Saturday, November 19, 2022

Scrape online academic materials using python

 You know it can be a boring task to manually collect academic material you found online. In this blog post, I will demonstrate how I use python to collect some academic thesis, journals, and other materials for my profession.


Online Scientific Research Journals: 

Here my professor wants to have all the journals and their details published by "Scientific Research and Community Publishers" onlinescientificresearch.com neatly arranged in a spreadsheet table.

The specific details required are the journal name/title, the page URL, the description, cover image and ISSN number.

All the details should be organized in a spreadsheet as seen below.


The code:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup



# Section 1: Scrape journals page URLs and thumbnail images

url = 'https://www.onlinescientificresearch.com/journals.php'

# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
journals = soup.find_all("div", {'class':'col-12 col-sm-6 col-lg-3'})

print(len(journals))
# ---------------------------------------------------------------

# Section 2: Extract paths to journals URL and thumbnail image...

url_list = []
image_list = []

for j in journals:
    url = j.find('a')['href']
    img = j.find('img')['src']
    
    url_list.append(url)
    image_list.append(img)
    
print('Done...')


# ---------------------------------------------------------------
# Section 3: Create dataframe and construct other details...

df = pd.DataFrame([url_list, image_list]).T
df.columns = ['Journal URL', 'Journal IMAGE URL']
# -------------------------------------
####### Construct Journal Name #######
df['Journal Name'] = df['Journal URL'].apply(lambda row: row.split('/')[-1].replace('.php', '').replace('-', ' ').title())


####### Construct Journal Description #######
def get_journal_descr(url):
    # Get user-agent from: http://www.useragentstring.com/
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    html = response.text
    
    soup = BeautifulSoup(html, 'html.parser')
    journal_descr = soup.find("div", {'class':'card-body'})
    
    return journal_descr.text
# -------------------------------------
# Scrape Journal description into a list 
j_descr_list = []
i = 1

for url in df['Journal URL']:
    print(i, 'Processing...', url)
    j_descr = get_journal_descr(url)
    
    j_descr_list.append((url, j_descr))
    i = i+1

desc_df = pd.DataFrame(j_descr_list)
# -------------------------------------

# We have to access each journal url page to get its description...
# df['Journal description'] = df['Journal URL'].apply(lambda url: get_journal_descr(url))
df['Journal description'] = desc_df[1]


####### Construct Journal ISSN #######
# We have to use OCR on the journal thumb nail to get its ISSN...
# Using OCR API at: https://ocr.space/ocrapi....

headers = {
    'apikey': 'helloworld', # 'helloworld'
    'content-type': 'application/x-www-form-urlencoded',
}

issn_list = []

for thumbnail in df['Journal IMAGE URL']:
    print('Processing....', thumbnail)
    
    data = f'isOverlayRequired=true&url={thumbnail}&language=eng'

    response = requests.post('https://api.ocr.space/Parse/Image', headers=headers, data=data, verify=False)

    result = json.loads(response.content.decode()) # Convert the result to dictionary using json.loads() function
    # type(result)

    # Check the dict keys, the ISSN is in: ParsedResults >> 0 >> ParsedText
    issn = result['ParsedResults'][0]['ParsedText'].strip().split('\r\n')[-1]

    issn_list.append(issn)

df['Journal ISSN'] = issn_list

df
Extracting the journal ISSN was definitely the trickiest part as it requires working with OCR API.



M.Sc. in GIST Theses

Master of Science (Geographic Information Science and Technology) Theses by University of Southern California. 


Here our professor wants the thesis details arranged in a table seen above.

Lets start by inspecting the html tags on the web page.

Here I copied the parent div tag that contains the needed data into a local html file. With this we don't need to send request to the website.

import pandas as pd
from bs4 import BeautifulSoup

# Copy the parent div tag into a html/txt file...
html_file = r"C:\Users\Yusuf_08039508010\Documents\Jupyter_Notebook\2022\M.S. IN GIST THESES\M.S. IN GIST THESES.HTML"

# Use BeautifulSoup to read the html div tag....
with open(html_file, encoding='utf-8') as f:
    div_data = f.read()

soup = BeautifulSoup(div_data, 'html.parser')

thesis_years = soup.find_all("h3")

thesis_authors = soup.find_all("strong")
thesis_authors = [ a.text for a in thesis_authors ]

thesis_topics = soup.find_all("em")
thesis_topics = [ t.text for t in thesis_topics ]

thesis_advisor = soup.find_all("p")
thesis_advisor = [ a.text for a in thesis_advisor if 'Advisor:' in a.text ]

thesis_pdf = soup.find_all("a")
thesis_pdf = [ link.get('href') for link in thesis_pdf if 'Abstract Text' not in link.text ]

# --------------------------------------------
df = pd.DataFrame(thesis_authors, columns=['Author'])
df['Topic'] = thesis_topics
df['Advisor'] = thesis_advisor
df['PDF Link'] = thesis_pdf

df

The code below will download the PDF files to local disc using the requests library.
i = 1
for indx, row in df.iterrows():
    link = row['PDF Link']
    print('Processsing...', link)

    pdf_name = str(i) +'_'+ link.split('/')[-1]
    pdf_file = requests.get(link, timeout=10).content

    with open( f'Thesis PDF\\{pdf_name}', 'wb' ) as f:
        f.write(pdf_file)
        
    i += 1
    # break


print('Finished...')





Journal - Nigerian Institution of Surveyors



This was little bit trick because the web page had inconsistent html tags.
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://nisngr.net/journal/'
response = requests.get(url, verify=False)
html = response.text
# ----------------------------


soup = BeautifulSoup(html, 'html.parser')
div_boxes = soup.find_all("div", {'class':'wpb_text_column wpb_content_element'})
# ----------------------------


papers_dict = {}
for div in div_boxes:
    papers = div.find_all('a')
    
    for link in papers:
        papers_dict[link.text] = link['href']
# ----------------------------

df = pd.DataFrame([papers_dict]).T
df




Thank you for reading.

No comments:

Post a Comment