Geospatial Solutions Expert: February 2021

Sunday, February 28, 2021

Rename multiple files with new names in excel spreadsheet

In the past, I have written similar script titled "Pyth on script to rename multiple files/folders".

The only difference here is that the new file names will come from a column in excel spreadsheet instead of being generated within the script.

Here below is the spreadsheet file that contains the current file names and their corresponding new names.

For example image '4.jpg' would be renamed to 'Barack Obama.jpg', '9.jpg' to 'Donald Trump.jpg', '30.jpg' to 'Joseph Robinette Biden Jr.jpg'... and so on.

Note that all the images are of the same extension (.jpg), so we will maintain the extension.

The script

First, we will read the excel file using pandas (into a dataframe) and create a dictionary with the two columns where the keys are the 'old name' and the values are the 'new names'.

import os
import pandas as pd
import natsort


names_df = pd.read_excel(r"C:\Users\Yusuf_08039508010\Desktop\rename.xlsx")
names_df

names_df_dict = dict(zip(names_df['Old Name'], names_df['New Name']))
names_df_dict

Now, we can access the values of the dictionary by their keys like so: names_df_dict['1.jpg']. With this, we will loop over the keys dynamically and rename the images accordingly.

images_folder = r'C:\Users\Yusuf_08039508010\Documents\US Presidents'

for file in os.listdir(images_folder):
    print ('Renaming...', names_df_dict[file])
    
    # Use os.path() to contruct absolute path to the images... 
    # ALternatively, we could change directory (os.chdir()) to the images folder
    old_img_name = os.path.join(images_folder, file)
    new_img_name = os.path.join(images_folder, names_df_dict[file] + '.jpg')
    
    os.rename(old_img_name, new_img_name)
    
print('Finished....')

To be sure our renaming script did a perfect job, lets verify the last three presidents that is:-

'4.jpg' would be renamed to 'Barack Obama.jpg',
'9.jpg' to 'Donald Trump.jpg',
'30.jpg' to 'Joseph Robinette Biden Jr.jpg'

That is it!

Tuesday, February 16, 2021

Get Emails from Google search given company Name/Domain

Given a list of company names, search google to retrieve their email addresses:-

import re
import pandas as pd
import numpy as np

import requests, lxml.html
from bs4 import BeautifulSoup
import urllib.request


list_of_url = ['http://umaryusuf.com', 'another website']

# REGEX to search for emails...
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

unique_emails_list = []

for name in list_of_url:    
    search_query = name + " email"
    print('Processing...', name)

    # -------------- FOR BULK GOOGLE SEARCH USE A PROXY -----------------
    params = (
            ('api_key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXX'),
            ('url', 'https://www.google.com/search?q='+search_query),
        )
    response = requests.get('http://api.scraperapi.com/', params=params)
    # -------------------------------------------------------------------


    print(response.status_code)

    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text()

    emails_1 = [re_match.group() for re_match in re.finditer(EMAIL_REGEX, text)]

    emails_2 = re.findall(r"[A-Za-z0-9._%+-]+"
                         r"@[A-Za-z0-9.-]+"
                         r"\.[A-Za-z]{2,4}", text)

    unique_emails = list(set(emails_1 + emails_2))
    data = name, unique_emails

    unique_emails_list.append(data)
    print(data)

Given a list of company domain names, access each domain web page and get all emails from the web page:-

import re
import pandas as pd
import numpy as np

import requests, lxml.html
from bs4 import BeautifulSoup
import urllib.request


list_of_url = ['http://umaryusuf.com']



site_list = []

for domain in list_of_url:

    print('Processing...', domain)
    
    try:
        f = urllib.request.urlopen(domain)
        s = f.read().decode('ISO-8859-1')
        emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
        newemails = list(set(emails))
        d = domain, newemails

        site_list.append(d)
        print (d)
    except Exception:
        d = domain, 'Error Occured'
        site_list.append(d)
        print (d)

print("Finished...")

Enjoy!

Monday, February 15, 2021

PyQGIS - Write vector layers field name and type to text file

The pyqgis script below will write the field name and field type of a given vector layer to a text file.

This is the same information you will find when you access the 'Fields' tab from the layer's property window as seen below.

Currently, there is no way to copy/save this table to a text or similar file for use off QGIS interface. So, it did be great if we write a little script to do the hard work for us.

The code:-

# Read active layer from the QGIS layer panel or read the shapefile from its path
layer = qgis.utils.iface.activeLayer()

# vector_file = r"C:\path_to_shapefile.shp"
# layer = QgsVectorLayer(vector_file, 'DISPLAYNAME', 'ogr')

# Count the number of feature (rows) and number of fields (columns)
faetureCount = layer.featureCount()
fieldCount = layer.fields().count()

# Loop through the layer fields to get each field name and type
data_list = []
for field in layer.fields():
    field_name = field.name()
    field_type = field.typeName()

    data = field_name, field_type
    data_list.append(data)


# Write the data_list to text file...
txtFileName = layer.name() # from layer name
with open(txtFileName +'.txt', 'w', encoding="utf-8") as f:
    print(data_list, end='\n', file = f)

# Print location of the text file...    
import os
print('The text file is save at: ', os.getcwd(), ' and its file name is: ', txtFileName)

The comments in the code are self explanatory, also remember to import the necessary modules.

You can extend the script by writing it to spreadsheet file using CSV or Pandas module.

That is it!

Tuesday, February 9, 2021

Scrape world university data from Webometrics

In this post, I will work through the process of extracting data from The "Webometrics Ranking of World Universities" as requested by my client.

The website has a table with thousands of records representing the ranking system for the world's universities as seen above.

Note that the following three columns ('University', 'Det.' and 'Country') contains hyperlinks. We would link to get those hyperlinks as well instead of just the icons.

For this reason, we can not use the pandas.read_html(html_page) method, because it won't return the hyperlinks from those columns. So, we have to use the BeautifulSoup library to lookup the hyperlinks from the source html contents after sending a GET request using the requests or selenium library.

At the end we will save the data into a spreadsheet using the pandas library.

Summary

1) Send a GET to the web pages - Requests or Selenium

2) Extract data from the response html content - BeautifulSoup

3) Format and Save the data to file - Pandas

A quick lookup show that to get to the next page, a page quey string is added to the URL like so: http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=1, http://webometrics.info/en/world?page=3, http://webometrics.info/en/world?page=4, etc. The last page is: http://webometrics.info/en/world?page=120 at the time of writing.

Now, lets extract for the first page (http://webometrics.info/en/world?page=0) then use a for loop to extract for all other pages.

QGIS 'Spreadsheet Layers' Plugin

Don't want to use CSV file? Load layers from spreadsheet files (*.ods, *.xls, *.xlsx).

As at the time of writing, QGIS has no built in support Microsoft excel spreadsheet files (.xls or .xlsx). Foryunately, CampToCamp developed a plugin named "Spreadsheet Layers" to fill this gap.

Search and install the plugin as usual. Then it will be available under Layer >> Add Layer menu.

Enjoy!

Tuesday, February 2, 2021

Writing multiple spreadsheet files to worksheets

Combine individual files into one file of multiple tabs

Here we got 37 spreadsheet files within a folder and I want all the files to be in a single spreadsheet file with each file on a separate worksheet.

This requirement is different from merging the file into a single excel worksheet file. What is required here is to have each file as a worksheet within one excel file as seen below.

The code is as follow:-

It makes use of the ExcelWriter pandas method. The parameter "options={'strings_to_urls': False}" is set to allow writing of cell values that will have more than 225 characters.

import glob
import pandas as pd

folder = r"C:\Users\Yusuf_08039508010\Documents\Distinguished Senators"

senators_files = glob.glob(folder + '/*.xlsx')
len(senators_files)


# Writing multiple dataframes to worksheets...
writer = pd.ExcelWriter('DistinguishedSenators.xlsx', engine='xlsxwriter', options={'strings_to_urls': False})

for sheet in senators_files:
    print("Writting sheet...", sheet)
    
    sheetname = sheet.split('\\')[-1].split('.')[0]
    
    sheet_df = pd.read_excel(sheet)
    sheet_df = sheet_df.head(-1)
    
    print(sheet_df.shape)
    
    sheet_df.to_excel(writer, sheet_name=sheetname, index=None) # Save each df to excel

writer.save()

Related Materials

1) How to Write Pandas DataFrames to Multiple Excel Sheets

2) Example: Pandas Excel with multiple dataframes