Friday, June 24, 2022

Extracting 'Standardized Precipitation-Evapotranspiration Index (SPEI)' data from netCDF file into CSV spreadsheet

 In this post, we shall learn how to use python to extract netCDF data into CSV.


What is NetCDF?

NetCDF stands for "network Common Data Form" and it is a file format for storing multidimensional scientific data (variables) such as temperature, humidity, pressure, wind speed, precipitation and direction. There are several versions of netCDF and the version we are using here is version 4 (netCDF4).


The Dataset

The dataset am going to use is the Global 01-month 1901-2020 'Standardized Precipitation-Evapotranspiration Index (SPEI )' provided by Digital.csis



Read more details on this dataset on: SPEI Global Drought Monitor Database

How to open netCDF dataset

The netCDF data can be visualized using QGIS software. This will give us a quick overview as to what variables and attributes are contained in the file.


Some GIS packages allow reading netCDF data, such as QGIS, ArcGIS and IDRISI Taiga. There are other netCDF viewers, such as Panoply (developed in Java), ncBrowse, ncview, and nCDF_Browser.

Using Python to read netCDF file

First you need to install it using: pip install netCDF4

Then after which you will import the module then read the file and apply methods or attributes defined in the module. Lets see some code below:-

# import modules...
from netCDF4 import Dataset, num2date
import numpy as np

# Open an existing netCDF file...
netcdf_file = r"spei01.nc"
data = Dataset(netcdf_file, 'r')
Now we have the netCDF data object that we can access variables, attributes, dimensions, groups etc from the file in question.
print(data.variables)
print(data.variables.keys())
print(data.variables.values())
print(data.dimensions)

Since we only have four variable keys (dict_keys(['lon', 'lat', 'time', 'spei'])) in the netCDF data, lets explore each one after the other.

# Exploring lat variable...
print(data.variables['lat'].shape)
print(data.variables['lat'].size)
print(data.variables['lat'].dimensions)
print(data.variables['lat'][:])
print(data.variables['lat'].units)
print(data.variables['lat'].limits)
print(data.variables['lat'].long_name)
print(data.variables['lat'].convention)

# Exploring lon variable...
print(data.variables['lon'].shape)
print(data.variables['lon'].size)
print(data.variables['lon'].dimensions)
print(data.variables['lon'][:])
print(data.variables['lon'].units)
print(data.variables['lon'].limits)
print(data.variables['lon'].long_name)
print(data.variables['lon'].convention)

# Exploring time variable...
print(data.variables['time'].shape)
print(data.variables['time'].size)
print(data.variables['time'].dimensions)
print(data.variables['time'][:])
print(data.variables['time'].units)
print(data.variables['time'].limits)
print(data.variables['time'].long_name)

# Exploring SPEI variable...
print(data.variables['spei'].shape)
print(data.variables['spei'].size)
print(data.variables['spei'].dimensions)
print(data.variables['spei'][:100]) # Huge data Warning: Slice only to the first 100th row\
print(data.variables['spei'].units)
print(data.variables['spei']._FillValue)
print(data.variables['spei'].long_name)

For all the four variables, we are able see some of their properties like: units, limits, shape, size, long names etc. While most of the units are as usual, the time variable unit is in "days since 1900-1-1", it may require conversion to make is readable by human. There is a function that comes with the netCDF4 module named "num2date()" that does the conversion easily as follow.


The num2date() function takes the number of days and the "CF date-time unit_string" to generate a real_datetime object like this: "real_datetime(2021, 1, 15, 0, 0)". After which we used pandas to_datetime() and strftime() functions to return a human readable date.

# convert 'units: days since 2020-1-1' to date
num_days = 380
day_since = 'days since 2020-1-1'

time_convert = num2date(num_days, day_since, only_use_cftime_datetimes=False)
date_range = pd.to_datetime(time_convert).strftime('%Y-%m-%d') # Use pandas to get the human readable date
print(date_range)


# Bulk conversion...
num_days = [380, 390, 400, 410, 420, 430, 440, 450, 1000]
time_convert = num2date(num_days, day_since, only_use_cftime_datetimes=False)

date_range = [ pd.to_datetime(t).strftime('%Y-%m-%d') for t in time_convert ]
print(date_range)

With all these information in our pam, all that is left is to use other python programming syntax to manipulate and extract the needed variables accordingly.

In the code snippet below, we have csv file containing list of locations that we need to extract their corresponding SPEI data into csv files.

import pandas as pd
from netCDF4 import Dataset, num2date
netcdf_file = r"C:\Users\Yusuf_08039508010\Desktop\...\spei01.nc"
sample_pt_df = pd.read_csv(r"C:\Users\Yusuf_08039508010\Desktop\...\samples.csv")

# Read in the netCDF file...
netcdf_data = Dataset(netcdf_file, 'r')

# Storing the lat and lon data into the variables 
lat = netcdf_data.variables['lat'][:]
lon = netcdf_data.variables['lon'][:]

# for loc_id, site_name, xlong, ylat in zip(sample_pt_df['ID'], sample_pt_df['site'], sample_pt_df['lon'], sample_pt_df['lat']):
for index, row in sample_pt_df.iterrows():
    loc_id = row['ID']
    site_name = row['site']
    xlong = row['lon']
    ylat = row['lat']
    
    print('Processing...', loc_id)
    fname = str(loc_id) + '_' + site_name

    # Defining the lat, lon for the location of your interest
    zamfara_lat = ylat
    zamfara_lon = xlong


    # Squared difference of lat and lon 
    sq_diff_lat = (lat - zamfara_lat)**2
    sq_diff_lon = (lon - zamfara_lon)**2

    # Identifying the index of the minimum value for lat and lon 
    min_index_lat = sq_diff_lat.argmin()
    min_index_lon = sq_diff_lon.argmin()

    # Accessing the 'Standardized Precipitation-Evapotranspiration Index' data
    spei = netcdf_data.variables['spei']
    # spei1 = netcdf_data.variables['spei'][:]
    # spei2 = np.ma.getdata(spei1)

    # Creating an empty pandas dataframe
    # convert 'units: days since 1900-1-1' to date
    time_net = data.variables['time']
    time_convert = num2date(time_net[:], time_net.units, only_use_cftime_datetimes=False)

    date_range = [ pd.to_datetime(t).strftime('%Y-%m-%d') for t in time_convert ]
    df = pd.DataFrame(0, columns = ['SPEI'], index = date_range)
    
    dt = np.arange(0, netcdf_data.variables['time'].size)
    for time_index in dt:
        df.iloc[time_index] = spei[time_index, min_index_lat, min_index_lon]

    # Saving the time series into a csv
    df.to_csv(f'resultFolder\\{fname}.csv')
    
print('Finished...')


Related material:

1) netCDF4 API documentation - https://unidata.github.io/netcdf4-python/

No comments:

Post a Comment