Friday, April 9, 2021

Extracting data from .HAR file

 A HTTP Archive file (shorten as 'HAR file'), is a JSON format used for tracking information between a web browser and a website. The common extension for these files is '.har'.

In python there is third party module called "Haralyzer" developed for getting useful stuff out of HAR files.

Since HAR files are in JSON formats, I will not use the "Haralyzer" module instead I will read the .har file and extract data from the text. Another reason I don't want to use the library is that I don't want to install new third party library on my machine most especially that the haralyzer module depends on another third library "six".

Other than that nothing wrong in using a library that reads the .har file directly.

Let's get our hands dirty...


How to get a HAR file

Practically, any website that uses JSON format as its data communication pipeline will generate a .har file on clients browser which can be accessed from the browser's developer tool.

Lets use this use this website on Earthquake data by USGS. Open the website and go to your browser developer tool, then select 'Network' tab >> XHR >> Expot HAR...


This will download a HAR files that contains JSON representation of the earthquake data as seen below...


You can save the file with any name in a location you can remember, we will use it in the next section. Note that the file is a GeoJSON with Padding.

Thursday, April 1, 2021

How to make black and white Road network map in Mapbox Studio

Mapbox Studio is a platform used by developers to prepare maps for mobile, desktop and web applications.

In this post, we are going to prepare a black and white road network map similar to what you see below.


There are many startup web applications that allow you create this kind of map and download it for a fee! In few moments, you will learn how to make your own maps and you don't have to spend money buying maps.

See an example below that cost $59.


Step-by-step instructions



  • Create a new map by clicking on "New Style" button.


  • Select from the map style templates (here we will use the Blank Template)


  • Rename the map to a suitable name and add map components and layers.


To achieve this type of map, we need to add the following map components and layers:-

  1. Road Network
  2. Land, Water and Sky
  3. Administrative Boundaries

On each component/layer adjust the settings to fit what you wanted. For example, I set the administrative boundaries base to white etc.


After which you can publish and share the map as WMTS for use in a desktop software like QGIS for further map processing as you will see below in a moment.


Wednesday, March 24, 2021

Looping over an iterable (array/list) in JavaScript Vs Python

 Lets see what it takes to loop over an iterable using for-loop in both JavaScript Vs Python. By the way, an iterable is an object capable of returning its members one at a time, permitting it to be iterated over in a for-loop.

Assuming we have this iterable : m = [3.23, 4.56, 5.3, 2.44, 6.7, 12.4, 566] and we want to perform some math operation on each element (in this case: 2 to the power of element divided by 2).

The math formulae is as follow:-

For JavaScript

Math.pow(2, element) / 2


For Python

(2**element)/2


The solutions

JavaScript Solution

m = [3.23, 4.56, 5.3, 2.44, 6.7, 12.4, 566]

for (let i=0; i<m.length; ++i){
console.log(Math.pow(2, m[i]) / 2);
}


m = [3.23, 4.56, 5.3, 2.44, 6.7, 12.4, 566]

for (let i in m){
console.log(Math.pow(2, m[i]) / 2);
}

Python Solution

m = [3.23, 4.56, 5.3, 2.44, 6.7, 12.4, 566]

for i in m:
    print((2**i)/2)




That is it!

Thursday, March 18, 2021

Python script - Merge PDF files into a single file

 This script make use of the PyPDF2 library to merge list of pdf files into one big file.


from PyPDF2 import PdfFileMerger
from os import listdir

input_dir = r"C:\Users\Yusuf_08039508010\ND 2 WAEC Result Check\Result" 
#your input directory path


merge_list = []

for x in listdir(input_dir):
    if not x.endswith('.pdf'):
        continue
    merge_list.append(input_dir +'\\'+ x)

merger = PdfFileMerger()

for pdf in merge_list:
    merger.append(pdf)

merger.write(input_dir + "\pdf_file_name.pdf") #your output directory and pdf_file name
merger.close()

print('Finished...')


Enjoy!

Tuesday, March 9, 2021

Spread column-wise data to row-wise

 On the web, it is very common to find dataset displayed in column-wise manner as seen below.


As you will already noticed, each new record is separated by the company name in bold capital letters. So, lets add a sign to separate one record from the other. I used this "------------------------" sign, but you can use anything as long as it is unique and not part of the records itself.

So, the working data copied from the website is like this:-

ACCESS CREDIT MANAGEMENT, INC.
Tim Cullen, Attorney
11225 Huron Ln Ste 222
Little Rock, AR 72211-1861
United States
Phone: (501) 664-2922
Fax: (501) 664-3207
MAP Attorney
------------------------
CREDIT CONTROL CO., INC.
Bill Caldwell, President
Bill Caldwell, Ethics Contact
10201 W Markham St Ste 104
Little Rock, AR 72205-2180
United States
Phone: (501) 225-2050
Fax: (501) 225-2135
ACA Member since 1982
Line of Business: Third Party Collections
------------------------
THE MCHUGHES LAW FIRM, PLLC
Becky A. McHughes Esq., Attorney at Law
10810 Executive Center Dr
Danville Bldg Ste 312
Little Rock, AR 72211
United States
Phone: (501) 376-9131
Fax: (501) 374-9332
http://www.mchugheslaw.com
ACA Member since 2013
Line of Business: Law Firm
Line of Business: Third Party Collections
------------------------
THE MCHUGHES LAW FIRM, PLLC
Becky A. McHughes Esq., Attorney at Law
10809 Executive Center Dr
Danville Bldg Ste 312
Little Rock, AR 72204
United States
Phone: (501) 376-9131
Fax: (501) 374-9332
MAP Attorney
Lowell
------------------------
CENTRAL RESEARCH, INC.
Karena Holt, Vice President of Operations
Karena Holt, Ethics Contact
122 N. Bloominton Ste 1
Lowell, AR 72745
United States
Phone: (479) 419-5456
Fax: (479) 419-5460
http://www.central-research.com
ACA Member since 2016
Line of Business: Third Party Collections
------------------------
CENTRAL RESEARCH, INC.
Shane Taylor
106 N Bloomington
Ste S
Lowell, AR 72745-8988
United States
Phone: (479) 419-5456
MAP Attorney
Mabelvale
------------------------
FIRST COLLECTION SERVICES
Chris Dunkum, President
Chris Dunkum, Ethics Contact
10925 Otter Creek East Blvd
Mabelvale, AR 72103-1661
United States
Phone: (501) 455-1658
http://www.FCScollects.com
ACA Member since 1983
Line of Business: Outsourced First Party or Billing Company
Line of Business: Third Party Collections


What we really want is something like this:-

ACCESS CREDIT MANAGEMENT, INC. :: Tim Cullen, Attorney :: 11225 Huron Ln Ste 222 :: Little Rock, AR 72211-1861 :: United States :: Phone: (501) 664-2922 :: Fax: (501) 664-3207 :: MAP Attorney
------------------------
CREDIT CONTROL CO., INC. :: Bill Caldwell, President :: Bill Caldwell, Ethics Contact :: 10201 W Markham St Ste 104 :: Little Rock, AR 72205-2180 :: United States :: Phone: (501) 225-2050 :: Fax: (501) 225-2135 :: ACA Member since 1982 :: Line of Business: Third Party Collections
------------------------
THE MCHUGHES LAW FIRM, PLLC
Becky A. McHughes Esq., Attorney at Law :: 10810 Executive Center Dr :: Danville Bldg Ste 312 :: Little Rock, AR 72211 :: United States :: Phone: (501) 376-9131 :: Fax: (501) 374-9332 :: http://www.mchugheslaw.com :: ACA Member since 2013 :: Line of Business: Law Firm :: Line of Business: Third Party Collections
------------------------
THE MCHUGHES LAW FIRM, PLLC :: Becky A. McHughes Esq., Attorney at Law :: 10809 Executive Center Dr :: Danville Bldg Ste 312 :: Little Rock, AR 72204 :: United States :: Phone: (501) 376-9131 :: Fax: (501) 374-9332 :: MAP Attorney :: Lowell
------------------------
CENTRAL RESEARCH, INC. :: Karena Holt, Vice President of Operations :: Karena Holt, Ethics Contact :: 122 N. Bloominton Ste 1 :: Lowell, AR 72745 :: United States :: Phone: (479) 419-5456 :: Fax: (479) 419-5460 :: http://www.central-research.com :: ACA Member since 2016 :: Line of Business: Third Party Collections
From vertical arrangement to horizontal arrangement. The horizontal (row-wise) arrangement, works best in spreadsheet. We will have common column for the same records.

What we have (vertical/column-wise arrangement)


What we want (horizontal/row-wise arrangement)

Monday, March 8, 2021

RegexOne.com alternative solution

 RegexOne.com has an interactive lessons for Regular Expression and in this post, I want to solve all the lessons with a solution different from the one the provided.

For example: \w matches any word character (equal to [a-zA-Z0-9_]), so if the solution on RegexOne.com is \w then I have to look for another way like [a-zA-Z0-9_] to solve to lesson.

Lets get started...

Exercise 1: Matching Characters



Exercise 1½: Matching Digits



Exercise 2: Matching With Wildcards




Exercise 3: Matching Characters



Exercise 4: Excluding Characters



Exercise 5: Matching Character Ranges


Tuesday, March 2, 2021

PyQGIS - Add multiple shapefile vector layers to the QGIS project instance

 Sometimes, I need to load many shpafiles which are located in various folders into the QGIS project. A handy way to overcome this repetitive boring task is to use the PyQGIS script below.


import glob

# Use glob to recursively search all folders for .shp files...
shp_files = glob.glob(r'C:\Users\Yusuf_08039508010\Desktop\GIS Data\NGR\**\*.shp', recursive=True)
# print(shp_files)

layer_count = 0
for shp in shp_files:
    print("Loading...", shp)
    layer_name = shp.split('\\')[-1].split('.')[0]
    vlayer = QgsVectorLayer(shp, layer_name, "ogr")
    
    if not vlayer.isValid():
        print("Error: Layer Failed to Load!")
    else:
        QgsProject.instance().addMapLayer(vlayer)
        layer_count += 1

print(f'Finished Loading total of: {layer_count} shapefiles.')


As seen below, I will have to open 11 folders and sun folders to load all the shapefiles into QGIS project. But with the script above, I just run once and all the shpafiles in both parent and child folders are loaded in few second.


Here the script loaded 66 shapefiles from all the directories as seen below.




Enjoy!

Sunday, February 28, 2021

Rename multiple files with new names in excel spreadsheet

In the past, I have written similar script titled "Python script to rename multiple files/folders".

The only difference here is that the new file names will come from a column in excel spreadsheet instead of being generated within the script.

Here below is the spreadsheet file that contains the current file names and their corresponding new names.




For example image '4.jpg' would be renamed to 'Barack Obama.jpg', '9.jpg' to 'Donald Trump.jpg', '30.jpg' to 'Joseph Robinette Biden Jr.jpg'... and so on.

Note that all the images are of the same extension (.jpg), so we will maintain the extension.


The script

First, we will read the excel file using pandas (into a dataframe) and create a dictionary with the two columns where the keys are the 'old name' and the values are the 'new names'.

import os
import pandas as pd
import natsort


names_df = pd.read_excel(r"C:\Users\Yusuf_08039508010\Desktop\rename.xlsx")
names_df


names_df_dict = dict(zip(names_df['Old Name'], names_df['New Name']))
names_df_dict

Now, we can access the values of the dictionary by their keys like so: names_df_dict['1.jpg']. With this, we will loop over the keys dynamically and rename the images accordingly.

images_folder = r'C:\Users\Yusuf_08039508010\Documents\US Presidents'

for file in os.listdir(images_folder):
    print ('Renaming...', names_df_dict[file])
    
    # Use os.path() to contruct absolute path to the images... 
    # ALternatively, we could change directory (os.chdir()) to the images folder
    old_img_name = os.path.join(images_folder, file)
    new_img_name = os.path.join(images_folder, names_df_dict[file] + '.jpg')
    
    os.rename(old_img_name, new_img_name)
    
print('Finished....')

To be sure our renaming script did a perfect job, lets verify the last three presidents that is:-

  • '4.jpg' would be renamed to 'Barack Obama.jpg', 
  • '9.jpg' to 'Donald Trump.jpg', 
  • '30.jpg' to 'Joseph Robinette Biden Jr.jpg'


That is it!

Tuesday, February 16, 2021

Get Emails from Google search given company Name/Domain

Given a list of company names, search google to retrieve their email addresses:-

import re
import pandas as pd
import numpy as np

import requests, lxml.html
from bs4 import BeautifulSoup
import urllib.request


list_of_url = ['http://umaryusuf.com', 'another website']

# REGEX to search for emails...
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

unique_emails_list = []

for name in list_of_url:    
    search_query = name + " email"
    print('Processing...', name)

    # -------------- FOR BULK GOOGLE SEARCH USE A PROXY -----------------
    params = (
            ('api_key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXX'),
            ('url', 'https://www.google.com/search?q='+search_query),
        )
    response = requests.get('http://api.scraperapi.com/', params=params)
    # -------------------------------------------------------------------


    print(response.status_code)

    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text()

    emails_1 = [re_match.group() for re_match in re.finditer(EMAIL_REGEX, text)]

    emails_2 = re.findall(r"[A-Za-z0-9._%+-]+"
                         r"@[A-Za-z0-9.-]+"
                         r"\.[A-Za-z]{2,4}", text)

    unique_emails = list(set(emails_1 + emails_2))
    data = name, unique_emails

    unique_emails_list.append(data)
    print(data)


Given a list of company domain names, access each domain web page and get all emails from the web page:-

import re
import pandas as pd
import numpy as np

import requests, lxml.html
from bs4 import BeautifulSoup
import urllib.request


list_of_url = ['http://umaryusuf.com']



site_list = []

for domain in list_of_url:

    print('Processing...', domain)
    
    try:
        f = urllib.request.urlopen(domain)
        s = f.read().decode('ISO-8859-1')
        emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
        newemails = list(set(emails))
        d = domain, newemails

        site_list.append(d)
        print (d)
    except Exception:
        d = domain, 'Error Occured'
        site_list.append(d)
        print (d)

print("Finished...")


Enjoy!

Monday, February 15, 2021

PyQGIS - Write vector layers field name and type to text file

 The pyqgis script below will write the field name and field type of a given vector layer to a text file.

This is the same information you will find when you access the 'Fields' tab from the layer's property window as seen below.


Currently, there is no way to copy/save this table to a text or similar file for use off QGIS interface. So, it did be great if we write a little script to do the hard work for us.


The code:-

# Read active layer from the QGIS layer panel or read the shapefile from its path
layer = qgis.utils.iface.activeLayer()

# vector_file = r"C:\path_to_shapefile.shp"
# layer = QgsVectorLayer(vector_file, 'DISPLAYNAME', 'ogr')

# Count the number of feature (rows) and number of fields (columns)
faetureCount = layer.featureCount()
fieldCount = layer.fields().count()

# Loop through the layer fields to get each field name and type
data_list = []
for field in layer.fields():
    field_name = field.name()
    field_type = field.typeName()

    data = field_name, field_type
    data_list.append(data)


# Write the data_list to text file...
txtFileName = layer.name() # from layer name
with open(txtFileName +'.txt', 'w', encoding="utf-8") as f:
    print(data_list, end='\n', file = f)

# Print location of the text file...    
import os
print('The text file is save at: ', os.getcwd(), ' and its file name is: ', txtFileName)

The comments in the code are self explanatory, also remember to import the necessary modules.



You can extend the script by writing it to spreadsheet file using CSV or Pandas module.

That is it!

Tuesday, February 9, 2021

Scrape world university data from Webometrics

 In this post, I will work through the process of extracting data from The "Webometrics Ranking of World Universities" as requested by my client.

The website has a table with thousands of records representing the ranking system for the world's universities as seen above.

Note that the following three columns ('University', 'Det.' and 'Country') contains hyperlinks. We would link to get those hyperlinks as well instead of just the icons.

For this reason, we can not use the pandas.read_html(html_page) method, because it won't return the hyperlinks from those columns. So, we have to use the BeautifulSoup library to lookup the hyperlinks from the source html contents after sending a GET request using the requests or selenium library. 

At the end we will save the data into a spreadsheet using the pandas library.

Summary
1) Send a GET to the web pages - Requests or Selenium
2) Extract data from the response html content - BeautifulSoup
3) Format and Save the data to file - Pandas

A quick lookup show that to get to the next page, a page quey string is added to the URL like so: http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=3, http://webometrics.info/en/world?page=4, etc. The last page is: http://webometrics.info/en/world?page=120 at the time of writing.

Now, lets extract for the first page (http://webometrics.info/en/world?page=0) then use a for loop to extract for all other pages.

Saturday, February 6, 2021

QGIS 'Spreadsheet Layers' Plugin

 Don't want to use CSV file? Load layers from spreadsheet files (*.ods, *.xls, *.xlsx).

As at the time of writing, QGIS has no built in support Microsoft excel spreadsheet files (.xls or .xlsx). Foryunately, CampToCamp developed a plugin named "Spreadsheet Layers" to fill this gap.



Search and install the plugin as usual. Then it will be available under Layer >> Add Layer menu.



Enjoy!

Tuesday, February 2, 2021

Writing multiple spreadsheet files to worksheets

 Here we got 37 spreadsheet files within a folder and I want all the files to be in a single spreadsheet file with each file on a separate worksheet.

This requirement is different from merging the file into a single excel worksheet file. What is required here is to have each file as a worksheet within one excel file as seen below.


The code is as follow:-

It makes use of the ExcelWriter pandas method. The parameter "options={'strings_to_urls': False}" is set to allow writing of cell values that will have more than 225 characters.


import glob
import pandas as pd
folder = r"C:\Users\Yusuf_08039508010\Documents\Distinguished Senators"

senators_files = glob.glob(folder + '/*.xlsx')
len(senators_files)


# Writing multiple dataframes to worksheets...
writer = pd.ExcelWriter('DistinguishedSenators.xlsx', engine='xlsxwriter', options={'strings_to_urls': False})

for sheet in senators_files:
    print("Writting sheet...", sheet)
    
    sheetname = sheet.split('\\')[-1].split('.')[0]
    
    sheet_df = pd.read_excel(sheet)
    sheet_df = sheet_df.head(-1)
    
    print(sheet_df.shape)
    
    sheet_df.to_excel(writer, sheet_name=sheetname, index=None) # Save each df to excel

writer.save()

Related Materials

1) How to Write Pandas DataFrames to Multiple Excel Sheets

2) Example: Pandas Excel with multiple dataframes

Wednesday, January 27, 2021

Running a custom python function in QGIS

 In this post, I will explain how custom python functions are made in QGIS. 

Functions in QGIS are listed with the following categories:-

Aggregates Functions
Array Functions
Color Functions
Conditional Functions
Conversions Functions
Custom Functions
Date and Time Functions
Fields and Values
Files and Paths Functions
Form Functions
Fuzzy Matching Functions
General Functions
Geometry Functions
Layout Functions
Map Layers
Maps Functions
Mathematical Functions
Operators
Processing Functions
Rasters Functions
Record and Attributes Functions
Relations
String Functions
User Expressions
Variables
Recent Functions


So, if QGIS has all these functions why would one ever need a custom function?

The answer is simple, these are not all the functions in the world. So, most likely there is a function that hasn't been implemented then you can write your custom function.

Lets look at a simple scenario.

Assuming, we have this little python script that generates Hex color codes like these: #40B994, #13E7BC, #3F50EB, #E28326 etc and we want to use it to generate new attribute column with random hex color codes.

from random import choice

def color_func():
    hex = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "A", "B", "C", "D", "E", "F"]

    hex_list = [str(choice(hex)) for x in range(6)]

    hex_string = '#' + ''.join(hex_list)
    return hex_string

The function will be called like this in the field calculator. Note the function's name is: color_func()


Sunday, January 24, 2021

Convert cURL command to Python Request

cURL stands for Client URL (Uniform Resource Locator). It is a tool to transfer data from or to a server, using any of the supported protocols (HTTP, FTP, IMAP, POP3, SCP, SFTP, SMTP, TFTP, TELNET, LDAP or FILE).

A simple usage call is: curl http://umaryusuf.com

This will return the HTML content of the website page provided as seen below.



cURL is commonly use by service provide to grant access to their API request without depending on any programming language.

As an example, lets convert this sample API Code from ScraperAPI in cURL format to python requests format:-

curl "http://api.scraperapi.com?api_key=we709a6dkbask80kjbaskjoie2nsaqa7&url=http://httpbin.org/ip"


There are several ways to convert cURL to python, the one I use more often is a tool by Nick Carneiro (https://curl.trillworks.com). So, copy and paste the cURL code to generate a python version.


import requests

params = (
    ('api_key', 'we709a6dkbask80kjbaskjoie2nsaqa7'),
    ('url', 'http://httpbin.org/ip'),
)

response = requests.get('http://api.scraperapi.com/', params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('http://api.scraperapi.com?api_key=we709a6dkbask80kjbaskjoie2nsaqa7&url=http://httpbin.org/ip')

That is it!

Thursday, January 21, 2021

Assign color to vector layer based on HTML Notation (HEX color) codes in attribute table

 Here, I have a polygon layer with 'color' attribute column which contains HEX color codes as seen below.


We want to assign each polygon to its attribute color value, for example: Delta = #6808C9.

This is close to 'Categorized Symbology', but the only difference is that the colors are from the attribute column. To do this, we have to edit the 'Fill Color' expression to read the color column.




This way, each polygon is given a HEX color code that corresponds to the value on the attribute column.




Happy Mapping!