Geospatial Solutions Expert: Python GIS Data Wrangling

This post was inspired by John Nelson YouTube video on "How to Make This Drought Map Pt 1: DATA WRANGLING", where he manually wrangled the dataset for the year 2018.

What he did was great if you are just doing it for a single year. If you intend to repeat the workflow for several years, then the process can be time consuming and prone to mistakes. For this reason, I will recreate the workflow using python scripting and the whole process can be automated with few button clicks.

More specifically, I will cover the following processes:-

Download and extract the zip folder
Combine the shapefiles into a single folder
Merge the shapefiles into shapefile

Lets get started.

1) Construct and Download zip files

First we need to download the dataset for all previous years. Here lets use python to generate the zip folder download links for all the years.

# Construct a list of droughtmonitor shp download links for several years...

dm_url_list = []

for x in range(0, 23):
	x = str(x)
	if len(x) < 2:
		base_url = f'https://droughtmonitor.unl.edu/data/shapefiles_m//20{str(0) + x}_USDM_M.zip'
		dm_url_list.append(base_url)
	else:
		base_url = f'https://droughtmonitor.unl.edu/data/shapefiles_m//20{x}_USDM_M.zip'
		dm_url_list.append(base_url)


print(dm_url_list)

The code above will produce this list of URLs. And we can just loops over the list to download the weekly shapefile dataset.

['https://droughtmonitor.unl.edu/data/shapefiles_m//2000_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2001_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2002_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2003_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2004_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2005_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2006_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2007_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2008_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2009_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2010_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2011_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2012_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2013_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2014_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2015_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2016_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2017_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2018_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2019_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2020_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2021_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2022_USDM_M.zip']

There are many ways to download files from URL in python, however simply using the inbuilt python module called webbrowser will do what we intended here as seen belwo.

import webbrowser

dm_url_list = ['https://droughtmonitor.unl.edu/data/shapefiles_m//2000_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2001_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2002_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2003_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2004_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2005_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2006_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2007_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2008_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2009_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2010_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2011_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2012_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2013_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2014_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2015_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2016_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2017_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2018_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2019_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2020_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2021_USDM_M.zip', 'https://droughtmonitor.unl.edu/data/shapefiles_m//2022_USDM_M.zip']

for url in dm_url_list:
    webbrowser.open(url, new=2)

With the files downloaded, next we need to extract the zip files.

2) Combine the shapefiles into a single folder

Lets extract the zip files into a single folder.

We shall first extract the content of the parent/main zip file, before extracting the sub zip files.

# Extracting year main zip file
import os
from zipfile import ZipFile


# specifying the zip file name
zipfile_name = r"C:\Users\Yusuf_08039508010\Desktop\Working_Files\...\DataWrangling - U.S. Drought Monitor\2019_USDM_M.zip"

# opening the zip file in READ mode
with ZipFile(zipfile_name, 'r') as zip:
    # zip.printdir() # printing all the contents of the zip file

    folder_name = os.path.basename(zipfile_name).split('.')[0]
    folder_path = os.path.dirname(zipfile_name)
    complete_path = f'{folder_path}\\{folder_name}'
    
    # Make directory
    os.makedirs(complete_path, exist_ok=True)

    # Extract the content of the zipfile into complete_path
    zip.extractall(path=complete_path)
    print('Done!')

# Extracting year sub zip files
year_folder = r"C:\Users\Yusuf_08039508010\Desktop\Working_Files\...\DataWrangling - U.S. Drought Monitor\2017_USDM_M"

shp_folder = year_folder +'\\'+ year_folder.split('\\')[-1] + 'SHP'
# Make directory
os.makedirs(shp_folder, exist_ok=True)
    
for dirpath, subdirs, files in os.walk(year_folder):
    for f in files:
        if f.endswith(".zip"):
            with ZipFile(f'{year_folder}\\{f}', 'r') as zip:
                    zip.extractall(path=shp_folder)
print('Done!')

3) Merge the shapefiles into single shapefile

There are many ways to accomplish this using python. One easy way is by running processing algorithms in QGIS python console.

Note that the merge we will perform here is on polygon shapefiles with the same attribute fields and shape type.

import glob
import processing

# Using OS module
#pa = 'C:/Users/Yusuf_08039508010/Desktop/.../DataWrangling - U.S. Drought Monitor/2017_USDM_M/2017_USDM_MSHP'
#shp_files = [pa+'/'+x for x in os.listdir(shp_folder) if x.endswith('.shp')]

shp_folder = r'C:\\Users\\Yusuf_08039508010\\Desktop\\...\\DataWrangling - U.S. Drought Monitor\\2017_USDM_M\\2017_USDM_MSHP'
shp_files = glob.glob(f'{shp_folder}\\*.shp')

parameters = {
        'LAYERS':shp_files,
        'CRS':None,
        'OUTPUT':'C:/Users/Yusuf_08039508010/Desktop/Working_Files/.../DataWrangling - U.S. Drought Monitor/2017_USDM_M/2017_USDM_MSHP/Merge_USDM.shp'}

processing.runAndLoadResults("native:mergevectorlayers", parameters )

print('Done....')

We can also use geopandas to merge the shapefiles as follow:-

import glob
import geopandas as gpd


shp_folder = r'C:\\Users\\Yusuf_08039508010\\Desktop\\Working_Files\\Fiverr\\2021\\012-December\\DataWrangling - U.S. Drought Monitor\\2017_USDM_M\\2017_USDM_MSHP'
shp_files = glob.glob(f'{shp_folder}\*.shp')


shp_gdf_list = []
for shp in shp_files:
    shp_gdf = gpd.read_file(shp)
    shp_gdf_list.append(shp_gdf)
    
# Merge the shapefile by concatinating them together...
merge_shp = gpd.GeoDataFrame(pd.concat(shp_gdf_list))

# Save to shp...
merge_shp.to_file('merge_shp_from_Geopandas.shp')

Conclusion

In this article, we have seen how to use python to wrangle the US drought monitoring data. Now we have an automated workflow that we can replicate to make our working process faster.

That is it!

Geospatial Solutions Expert

Friday, July 1, 2022

Python GIS Data Wrangling - U.S. Drought Monitor

No comments:

Post a Comment