Saturday, November 19, 2022

Scrape online academic materials using python

 You know it can be a boring task to manually collect academic material you found online. In this blog post, I will demonstrate how I use python to collect some academic thesis, journals, and other materials for my profession.


Online Scientific Research Journals: 

Here my professor wants to have all the journals and their details published by "Scientific Research and Community Publishers" onlinescientificresearch.com neatly arranged in a spreadsheet table.

The specific details required are the journal name/title, the page URL, the description, cover image and ISSN number.

All the details should be organized in a spreadsheet as seen below.


The code:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup



# Section 1: Scrape journals page URLs and thumbnail images

url = 'https://www.onlinescientificresearch.com/journals.php'

# Get user-agent from: http://www.useragentstring.com/
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
journals = soup.find_all("div", {'class':'col-12 col-sm-6 col-lg-3'})

print(len(journals))
# ---------------------------------------------------------------

# Section 2: Extract paths to journals URL and thumbnail image...

url_list = []
image_list = []

for j in journals:
    url = j.find('a')['href']
    img = j.find('img')['src']
    
    url_list.append(url)
    image_list.append(img)
    
print('Done...')


# ---------------------------------------------------------------
# Section 3: Create dataframe and construct other details...

df = pd.DataFrame([url_list, image_list]).T
df.columns = ['Journal URL', 'Journal IMAGE URL']
# -------------------------------------
####### Construct Journal Name #######
df['Journal Name'] = df['Journal URL'].apply(lambda row: row.split('/')[-1].replace('.php', '').replace('-', ' ').title())


####### Construct Journal Description #######
def get_journal_descr(url):
    # Get user-agent from: http://www.useragentstring.com/
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    html = response.text
    
    soup = BeautifulSoup(html, 'html.parser')
    journal_descr = soup.find("div", {'class':'card-body'})
    
    return journal_descr.text
# -------------------------------------
# Scrape Journal description into a list 
j_descr_list = []
i = 1

for url in df['Journal URL']:
    print(i, 'Processing...', url)
    j_descr = get_journal_descr(url)
    
    j_descr_list.append((url, j_descr))
    i = i+1

desc_df = pd.DataFrame(j_descr_list)
# -------------------------------------

# We have to access each journal url page to get its description...
# df['Journal description'] = df['Journal URL'].apply(lambda url: get_journal_descr(url))
df['Journal description'] = desc_df[1]


####### Construct Journal ISSN #######
# We have to use OCR on the journal thumb nail to get its ISSN...
# Using OCR API at: https://ocr.space/ocrapi....

headers = {
    'apikey': 'helloworld', # 'helloworld'
    'content-type': 'application/x-www-form-urlencoded',
}

issn_list = []

for thumbnail in df['Journal IMAGE URL']:
    print('Processing....', thumbnail)
    
    data = f'isOverlayRequired=true&url={thumbnail}&language=eng'

    response = requests.post('https://api.ocr.space/Parse/Image', headers=headers, data=data, verify=False)

    result = json.loads(response.content.decode()) # Convert the result to dictionary using json.loads() function
    # type(result)

    # Check the dict keys, the ISSN is in: ParsedResults >> 0 >> ParsedText
    issn = result['ParsedResults'][0]['ParsedText'].strip().split('\r\n')[-1]

    issn_list.append(issn)

df['Journal ISSN'] = issn_list

df
Extracting the journal ISSN was definitely the trickiest part as it requires working with OCR API.



M.Sc. in GIST Theses

Master of Science (Geographic Information Science and Technology) Theses by University of Southern California. 


Here our professor wants the thesis details arranged in a table seen above.

Lets start by inspecting the html tags on the web page.

Here I copied the parent div tag that contains the needed data into a local html file. With this we don't need to send request to the website.

import pandas as pd
from bs4 import BeautifulSoup

# Copy the parent div tag into a html/txt file...
html_file = r"C:\Users\Yusuf_08039508010\Documents\Jupyter_Notebook\2022\M.S. IN GIST THESES\M.S. IN GIST THESES.HTML"

# Use BeautifulSoup to read the html div tag....
with open(html_file, encoding='utf-8') as f:
    div_data = f.read()

soup = BeautifulSoup(div_data, 'html.parser')

thesis_years = soup.find_all("h3")

thesis_authors = soup.find_all("strong")
thesis_authors = [ a.text for a in thesis_authors ]

thesis_topics = soup.find_all("em")
thesis_topics = [ t.text for t in thesis_topics ]

thesis_advisor = soup.find_all("p")
thesis_advisor = [ a.text for a in thesis_advisor if 'Advisor:' in a.text ]

thesis_pdf = soup.find_all("a")
thesis_pdf = [ link.get('href') for link in thesis_pdf if 'Abstract Text' not in link.text ]

# --------------------------------------------
df = pd.DataFrame(thesis_authors, columns=['Author'])
df['Topic'] = thesis_topics
df['Advisor'] = thesis_advisor
df['PDF Link'] = thesis_pdf

df

The code below will download the PDF files to local disc using the requests library.
i = 1
for indx, row in df.iterrows():
    link = row['PDF Link']
    print('Processsing...', link)

    pdf_name = str(i) +'_'+ link.split('/')[-1]
    pdf_file = requests.get(link, timeout=10).content

    with open( f'Thesis PDF\\{pdf_name}', 'wb' ) as f:
        f.write(pdf_file)
        
    i += 1
    # break


print('Finished...')





Journal - Nigerian Institution of Surveyors



This was little bit trick because the web page had inconsistent html tags.
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://nisngr.net/journal/'
response = requests.get(url, verify=False)
html = response.text
# ----------------------------


soup = BeautifulSoup(html, 'html.parser')
div_boxes = soup.find_all("div", {'class':'wpb_text_column wpb_content_element'})
# ----------------------------


papers_dict = {}
for div in div_boxes:
    papers = div.find_all('a')
    
    for link in papers:
        papers_dict[link.text] = link['href']
# ----------------------------

df = pd.DataFrame([papers_dict]).T
df




Thank you for reading.

Thursday, November 10, 2022

Automate boring tasks in QGIS with PyQGIS

 In this post, I will use PyQGIS to automate some boring tasks I often encounter in QGIS. Hope you will find something useful to your workflow. Lets get started...

If you don't know what pyqgis is, then read this definition by hatarilabs.com: "PyQGIS is the Python environment inside QGIS with a set of QGIS libraries plus the Python tools with the potential of running other powerful libraries as Pandas, Numpy or Scikit-learn".

PyQGIS allows users to automate workflow and extend QGIS with the use of Python libraries and the documentation can be accessed here.

This means knowledge of python programming is required to understand some of the codes below.



  Task 1~ Count number of opened/loaded layers in the layer panel

I often find myself trying to count the layers in my QGIS project layer panel, so a simple pyqis script to automate the process will be ideal especially when there are many layers on the layer panel to count.

# This will return the all layers on the layer panel
all_layers = QgsProject.instance().mapLayers().values()
print('There are', len(all_layers), 'on the layer panel.')


  Task 2~ Count features in loaded vector layer

In this task, I want to get the number of features in each layer am working on. This is similar to 'Show Feature Count' function when you right-click on a vector layer.

# Get all layers into a list....
all_layers = list(QgsProject.instance().mapLayers().values())

# Get all displayed names of layer and corresponding number of features ...
ftCounts = [ (l.name(), l.featureCount()) for l in all_layers ]
print(ftCounts)


  Task 3~ Switch on/off all layers

To turn ON or OFF all layer can be frustrating when you got many layers to click through. So why not auto mate it in just a click.

# Get list of layers from the layer's panel...
qgis_prjt_lyrs = QgsProject.instance().layerTreeRoot().findLayers()

# Use index to Set layer on or off....
qgis_prjt_lyrs[20].setItemVisibilityChecked(True) # True=On, False=Off

# Do for all...
for l in qgis_prjt_lyrs:
    l.setItemVisibilityChecked(False)


  Task 4~ Identify layers that are on/off

Lets extend task3 above, so we know which layers are on (visible) and which layers are off (hidden).

# Get list of layers from the layer's panel...
qgis_prjt_lyrs = QgsProject.instance().layerTreeRoot().findLayers()

# Check if a layer is visible or not...
layer_visibility_check = [ (l.name(), l.isVisible()) for l in qgis_prjt_lyrs ]
print(layer_visibility_check)

visibility_ture = [ l.name() for l in qgis_prjt_lyrs if l.isVisible() == True ]
print('Number of visible layers:', len(visibility_ture))

visibility_false = [ l.name() for l in qgis_prjt_lyrs if l.isVisible() == False ]
print('Number of visible layers:', len(visibility_false))


  Task 5~ Read file path of layers

This is useful when you have many layers and don't know where they are located on your machine. You will also see interesting paths to other remote layer such as WMS, etc

# Returns path to every layer...
layer_paths = [layer.source() for layer in QgsProject.instance().mapLayers().values()]
print(layer_paths)


  Task 6~ Read layer type of layers

We can check the 'type' of a layer.

# Get dict of layers from the layer's panel...
layersDict = QgsProject.instance().mapLayers()


for (id, map) in layersDict.items():
    print(map.name(), '>>', map.type())


  Task 7~ Create multiple attribute fields/columns

Lets say we want to add multiple integer fields/columns to a vector layer. The code below will create attribute fields for year 2000 to 2023, that is twenty three (23) attribute columns/fields on the selected vector layer.

# Get Layer by name...
layer = QgsProject.instance().mapLayersByName("NIG LGA")[0]

# Define dataProvider for layer
layer_provider = layer.dataProvider()

# Add an Integer attribute field and update fields...
layer_provider.addAttributes([QgsField("2000", QVariant.Int)])
layer.updateFields()

# Add bulk attribute fields...
for x in range(2001, 2023):
    layer_provider.addAttributes([QgsField(str(x), QVariant.Int)])
    layer.updateFields()

print('Done...')


  Task 8~ Read/List all names of layers on layer panel

Here we just want to return the displayed names of layers.
# Get all layers into a list....
all_layers = list(QgsProject.instance().mapLayers().values())

# Get all displayed names of layer
all_layers_names = [ l.name() for l in all_layers ]
print(all_layers_names)


  Task 9~ Save attribute table to dataframe

# Save attribute table into Dataframe...

import pandas as pd

# Get Layer by name...
layer = QgsProject.instance().mapLayersByName("NIG LGA")[0]

# get attribute columns names
col_names = [ field.name() for field in layer.fields() ]

lga_list = []
state_list = []
apc_list = []
pdp_list = []
lp_list = []
nnpp_list = []
winner_list = []


for feature in layer.getFeatures():
    lga_list.append(feature['lga_name'])
    state_list.append(feature['state_name'])
    apc_list.append(feature['APC'])
    pdp_list.append(feature['PDP'])
    lp_list.append(feature['LP'])
    nnpp_list.append(feature['NNPP'])
    winner_list.append(feature['Winner'])

df = pd.DataFrame([state_list, lga_list, apc_list, pdp_list, lp_list, nnpp_list, winner_list]).T

df.to_csv(r'C:\Users\Yusuf_08039508010\Desktop\...\test.csv')

print('Done....')


  Task 10~ Select from multiple layers and attribute fields

Here we want to conduct multiple selection of given keywords from all listed layers and all attribute fields.
# Query to Select from all listed layers and all attribute fields
search_for = {'Bauchi', 'SSZ', 'Edo', 'Yobe'}

for lyr in QgsProject.instance().mapLayers().values():
    if isinstance(lyr, QgsVectorLayer):
        to_select = []
        # fieldlist = [f.name() for f in lyr.fields()]
        for f in lyr.getFeatures():
            # Check if any of the search keyword intersects to
            # feature's row attribute. If true, get the feature ID for selection...
            if len(search_for.intersection(f.attributes())) > 0:
                to_select.append(f.id())
        if len(to_select) > 0:
            lyr.select(to_select)




  Task 11~ Convert multiple GeoJSON files to shapefiles

import glob

input_files = glob.glob(r'C:\Users\Yusuf_08039508010\Desktop\Working_Files\GIS Data\US Zip Codes\*.json')
for f in input_files:
    out_filename = f.split('\\')[-1].split('.')[0]
    input_file = QgsVectorLayer(f, "polygon", "ogr")
    
    if input_file.isValid() == True:
        QgsVectorFileWriter.writeAsVectorFormat(input_file, rf"C:\Users\Yusuf_08039508010\Desktop\Working_Files\Fiverr\2021\05-May\Division_Region_Area Map\SHP\US ZipCode\{out_filename}.shp", "UTF-8", input_file.crs(), "ESRI Shapefile")
    else:
        print(f, 'is not a valid input file')
        
print('Done Processing..., ', f)



Thank you for reading.

Friday, November 4, 2022

Search nearby places - Comparing three API (Google Places API, Geoapify API and HERE API)

 In the post, I will compare API from three different providers to search nearby places the three API to compare are: Google Places API, Geoapify API and HERE API.

For each of the platforms, you need to register and get a developer API key to use. All the platform offer a limited free API quota to start with.



Google Places API


import requests
import pandas as pd
from datetime import datetime

df = pd.read_csv('datafile.csv')


YOUR_API_KEY = 'AIza......'

i = 1
for row, col in df.iterrows():
    lat = col['Latitude']
    long = col['Longitude']
    print(i, 'Processing...', lat, long)
    
    url = f'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location={lat}%2C{long}&radius=4850&type=laundry&keyword=laundromats&key={YOUR_API_KEY}'

    payload={}
    headers = {}

    response = requests.request("GET", url, headers=headers, data=payload)

    # Get current time...
    now = datetime.now()
    current_time = now.strftime("%Y%m%d__%H%M%S")

    # Write to file....
    with open(fr'JSON folder\\GoogleAPI\\{state_folder}\\{current_time}.json', 'w') as outfile:
        json.dump(response.json(), outfile)

    i = i+1
    
    
print('Done...')


Geoapify API


GeoApify_API_KEY = '378122b08....'

url = 'https://api.geoapify.com/v2/places'

params = dict(
    categories: 'commercial',
    filter: 'rect:7.735282,48.586797,7.756289,48.574457',
    limit: 2000,
    apiKey=f'{GeoApify_API_KEY}'
)

resp = requests.get(url=url, params=params)
data = resp.json()

print(data)




HERE API


HERE_API_KEY = 'WEYn....'
coord = '27.95034271398129,-82.45670935632066' # lat, long
url = f'https://places.ls.hereapi.com/places/v1/discover/here?apiKey={HERE_API_KEY}&at={coord}&laundry'

response = requests.get(url).json()
# print(response.text)

# Get current time...
now = datetime.now()
current_time = now.strftime("%Y%m%d__%H%M%S")


# Write to file....
with open(fr'JSON folder\\{current_time}.json', 'w') as outfile:
    json.dump(response, outfile)
    
print('Done...')



Tuesday, November 1, 2022