You know it can be a boring task to manually collect academic material you found online. In this blog post, I will demonstrate how I use python to collect some academic thesis, journals, and other materials for my profession.
Online Scientific Research Journals:
Here my professor wants to have all the journals and their details published by "Scientific Research and Community Publishers" onlinescientificresearch.com neatly arranged in a spreadsheet table.
The specific details required are the journal name/title, the page URL, the description, cover image and ISSN number.
All the details should be organized in a spreadsheet as seen below.
The code:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Section 1: Scrape journals page URLs and thumbnail images
url = 'https://www.onlinescientificresearch.com/journals.php'
# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
journals = soup.find_all("div", {'class':'col-12 col-sm-6 col-lg-3'})
print(len(journals))
# ---------------------------------------------------------------
# Section 2: Extract paths to journals URL and thumbnail image...
url_list = []
image_list = []
for j in journals:
url = j.find('a')['href']
img = j.find('img')['src']
url_list.append(url)
image_list.append(img)
print('Done...')
# ---------------------------------------------------------------
# Section 3: Create dataframe and construct other details...
df = pd.DataFrame([url_list, image_list]).T
df.columns = ['Journal URL', 'Journal IMAGE URL']
# -------------------------------------
####### Construct Journal Name #######
df['Journal Name'] = df['Journal URL'].apply(lambda row: row.split('/')[-1].replace('.php', '').replace('-', ' ').title())
####### Construct Journal Description #######
def get_journal_descr(url):
# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
journal_descr = soup.find("div", {'class':'card-body'})
return journal_descr.text
# -------------------------------------
# Scrape Journal description into a list
j_descr_list = []
i = 1
for url in df['Journal URL']:
print(i, 'Processing...', url)
j_descr = get_journal_descr(url)
j_descr_list.append((url, j_descr))
i = i+1
desc_df = pd.DataFrame(j_descr_list)
# -------------------------------------
# We have to access each journal url page to get its description...
# df['Journal description'] = df['Journal URL'].apply(lambda url: get_journal_descr(url))
df['Journal description'] = desc_df[1]
####### Construct Journal ISSN #######
# We have to use OCR on the journal thumb nail to get its ISSN...
# Using OCR API at: https://ocr.space/ocrapi....
headers = {
'apikey': 'helloworld', # 'helloworld'
'content-type': 'application/x-www-form-urlencoded',
}
issn_list = []
for thumbnail in df['Journal IMAGE URL']:
print('Processing....', thumbnail)
data = f'isOverlayRequired=true&url={thumbnail}&language=eng'
response = requests.post('https://api.ocr.space/Parse/Image', headers=headers, data=data, verify=False)
result = json.loads(response.content.decode()) # Convert the result to dictionary using json.loads() function
# type(result)
# Check the dict keys, the ISSN is in: ParsedResults >> 0 >> ParsedText
issn = result['ParsedResults'][0]['ParsedText'].strip().split('\r\n')[-1]
issn_list.append(issn)
df['Journal ISSN'] = issn_list
df
M.Sc. in GIST Theses
Master of Science (Geographic Information Science and Technology) Theses by University of Southern California.
Here our professor wants the thesis details arranged in a table seen above.
Lets start by inspecting the html tags on the web page.
Here I copied the parent div tag that contains the needed data into a local html file. With this we don't need to send request to the website.
import pandas as pd
from bs4 import BeautifulSoup
# Copy the parent div tag into a html/txt file...
html_file = r"C:\Users\Yusuf_08039508010\Documents\Jupyter_Notebook\2022\M.S. IN GIST THESES\M.S. IN GIST THESES.HTML"
# Use BeautifulSoup to read the html div tag....
with open(html_file, encoding='utf-8') as f:
div_data = f.read()
soup = BeautifulSoup(div_data, 'html.parser')
thesis_years = soup.find_all("h3")
thesis_authors = soup.find_all("strong")
thesis_authors = [ a.text for a in thesis_authors ]
thesis_topics = soup.find_all("em")
thesis_topics = [ t.text for t in thesis_topics ]
thesis_advisor = soup.find_all("p")
thesis_advisor = [ a.text for a in thesis_advisor if 'Advisor:' in a.text ]
thesis_pdf = soup.find_all("a")
thesis_pdf = [ link.get('href') for link in thesis_pdf if 'Abstract Text' not in link.text ]
# --------------------------------------------
df = pd.DataFrame(thesis_authors, columns=['Author'])
df['Topic'] = thesis_topics
df['Advisor'] = thesis_advisor
df['PDF Link'] = thesis_pdf
df
i = 1
for indx, row in df.iterrows():
link = row['PDF Link']
print('Processsing...', link)
pdf_name = str(i) +'_'+ link.split('/')[-1]
pdf_file = requests.get(link, timeout=10).content
with open( f'Thesis PDF\\{pdf_name}', 'wb' ) as f:
f.write(pdf_file)
i += 1
# break
print('Finished...')
Journal - Nigerian Institution of Surveyors
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://nisngr.net/journal/'
response = requests.get(url, verify=False)
html = response.text
# ----------------------------
soup = BeautifulSoup(html, 'html.parser')
div_boxes = soup.find_all("div", {'class':'wpb_text_column wpb_content_element'})
# ----------------------------
papers_dict = {}
for div in div_boxes:
papers = div.find_all('a')
for link in papers:
papers_dict[link.text] = link['href']
# ----------------------------
df = pd.DataFrame([papers_dict]).T
df
Thank you for reading.