Tuesday, August 3, 2021

GIS data wrangling with python - case study of 'African Journals Online' featured countries

 Real world GIS data are usually messy and you have to wrangle and clean them before a GIS software can take them in to make useful analysis.

Of course, the data wrangling process can be done manually but when you have a very large data with complex and messy structure then 'Python GIS data wrangling' is the sure way to go.

In this post, we will take a look at the following case study of 'African Journals Online' featuring countries. This online journal features 33 African countries as listed on the website and it is require we visualize those countries on a map of Africa.

As you can see the data is fairly simple to prepare by hand, but we will use python so we can develop our skill and be prepared for more complex data.

The data we are interested in are the country name and the number of publish articles. For example, Algeria (5) means country name is Algeria and it has 5 number of publish articles.

Step 1: Copy the data into python

There are several ways we can use to achieve this. The easiest is to copy it into a spreadsheet file as seen below...

Then we can read the data and wrangle/clean it into the format for GIS use.

import pandas as pd

aoj_df = pd.read_excel("aoj.xlsx")

Step 2: Write function to clean the data

A good starting point is to pick just one record and wrangle it before applying the solution to all other records.

aoj_df['Country'] = aoj_df['AOJ'].apply(lambda x : x.split('(')[0].strip())
aoj_df['NumberOfArticles'] = aoj_df['AOJ'].apply(lambda x : x.split('(')[1].replace(')', '').strip())

Step 3: Export the data into a spreadsheet

Since the resulting data will be use in GIS environment, we have to save it to file we can import into QGIS/ArcGIS or any GIS software we want to work with.

# Save to file...
aoj_df.to_excel('aoj_cleaned.xlsx', index=False)

Our data is now ready for GIS analysis.

Using the data

To use the data above, I made use of QGIS to join the spreadsheet to an African shapefile map. But there was an issue with the following counties ('Congo, Democratic Republic', 'Congo, Republic', 'Côte d'Ivoire', 'Egypt, Arab Rep.', 'Eswatini', 'South Sudan', 'Tanzania') either their names where differently spelt on both the spreadsheet and the shapefile or the country doesn't exist on the map.

How to wrangle both data to lookup the difference was explained on this post: PyQGIS Shapefile attribute lookup. Refer to the post to use python in QGIS to wrangle the data.

Basically, we need to compare two lists one for the map countries and the second for the AOJ spreadsheet countries to find out which countries will be missing from the join operation as listed above. The resulting code should look like the one below:-

# Select active layer for both map and spreadsheet...
map_layer = iface.activeLayer()
spreadsheet_layer = iface.activeLayer()

# Get map column attribute into list
map_countries = []
for feature in map_layer.getFeatures():

# Get map column attribute into list
spreadsheet_countries = []
for feature in spreadsheet_layer.getFeatures():

# Comparing the countries list...
missing_countries = []
for country in spreadsheet_countries:
    if country not in map_countries:

As you can see, these  ('Congo, Democratic Republic', 'Congo, Republic', 'Côte d'Ivoire', 'Egypt, Arab Rep.', 'Eswatini', 'South Sudan', 'Tanzania') countries names would have to be adjusted to match both the shapefile map and the spreadsheet data.

You can do this adjustment manually, however if you want to try editing them with python feel free to do so.

Hint: You can use the replace() function to complete this.

That is it!

No comments:

Post a Comment