Monday, April 3, 2023

Wrangling and Mapping ‘list of colleges teaching MBBS in India’ with Python

 In this post, I will demonstrate how to wrangle and map the ‘list of colleges teaching MBBS in India’. The list is available on this webpage.

First I will collected the dataset, clean it into a GIS friendly format, geocode the colleges, then map it on the map of India sourced from GADM. This is a project you could 100% complete without coding using tools like QGIS or ArcGIS, but here I will use python code to do it from scratch.

Data collection

Lets collect the list of data from the web page into CSV file. There are ways to do this in python like using requests module, selenium module, beautifulsoup module, scrapy module etc.

In this situation, I will copy the html element the represent the table into a local file, then extract the table from the local CSV spreadsheet using python.

Using pandas library, with few lines of code we got the list into a CSV file as seen below;-

mbbs_df = pd.read_html(r"mbbs.html")
mbbs_df[0].to_csv(r"mbbs.csv", index=False)
mbbs_df = mbbs_df[0]


Geocode the colleges

Next lets geocode the addresses. As always, there are several ways to do it. Here I will use an API provided by OpenCageData. Other geocoding API that could be used are Google Geocode, Bing Geocode, MapQuest Geocode, OpenStreetMap Geocoder etc

Create an account on OpenCageData, the free plan will serve our purpose. The API documentation page has useful example to get you started in no time. One reason why I prefer the OpenCageData API over other options is that the API call returns lot more useful information (they call them 'annotations') than just the latitude and longitude of the address. Some of these information are: 'DMS', 'MGRS', 'Maidenhead', 'Mercator', 'OSM', 'geohash', 'qibla', 'roadinfo', 'sun', 'timezone', 'what3words' etc.

An API endpoint or URL is like this:{YOUR-API-KEY}&q={college}&pretty=1

You will replace the variables in the {} with the appropriate information. If you load the URL, you will get a JSON data. Use a json editor to inspect the json items/keys as seen below.

The needed information are in the 'results' key, here we got multiple results. Usually the first one will be the best result.

Lets run this on our dataframe to add additional columns of information on latitude, longitude etc.

i = 1
data = []

for college in mbbs_df['Name and Address of Medical College/Medical Institution'][:30]:
    print(f'Processing {i} {college}')
    url = f'{college}&pretty=1'
    response = json.loads(requests.get(url).text)
        lat = response['results'][0]['geometry']['lat']
        lng = response['results'][0]['geometry']['lng']

        # Other information...
        what3words = response['results'][0]['annotations']['what3words']['words']
        qibla = response['results'][0]['annotations']['qibla']
        data_dict = {}
        data_dict['College'] = college
        data_dict['Latitude'] = lat
        data_dict['Longitude'] = lng
        data_dict['What3words'] = what3words
        data_dict['Qibla'] = qibla
    except Exception:
        data_dict['College'] = college
        data_dict['Latitude'] = 'NILL'
        data_dict['Longitude'] = 'NILL'
        data_dict['What3words'] = 'NILL'
        data_dict['Qibla'] = 'NILL'

    i += 1

data_dict_df = pd.DataFrame(data)

Map the data

We have to convert the latitude and longitude columns into a geometry column. 

The first step in plotting latitude and longitude from the CSV file with GeoPandas is to create a geopandas dataframe (geodataframe). This is achieved by GeoDataFrame, a module of GeoPandas library. This creates a new column called geometry which saves the latitude and longitude in POINT, POLYGON or MULTIPOLYGON format. It is from the geodataframe that the colleges can be drawn on map.

# Draw/Plot the MBBS data on a world map...
# Let use the dataset that comes with geopandas available at: gpd.datasets.available
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# We can restrict to Asia region only as seen below...
ax = world.plot(color='white', edgecolor='black', figsize=(30, 30))
ax.set_axis_off() # Off the axis

# We can now plot our ``GeoDataFrame``.
mbbs_gdf.plot(ax=ax, color='red', markersize=6)
# plt.savefig('world_MBBS.png', dpi=1080) # Save map to file...

To restrict the map to Asia region, filter the world map axis to 'Asia' based on the column continent like so;-
ax = world.loc[world['continent'] == 'Asia'].plot(color='white', edgecolor='black', figsize=(30, 30))

To restrict the map to India region, filter the world map axis to 'India' based on the column name like so;-

ax = world.loc[world['name'] == 'India'].plot(color='white', edgecolor='black', figsize=(20, 20))

Lets read the CSV file we saved from above and the shapefile or geopackage file we downloaded from

# Read the CSV, drop empty rows...
mbbs_df = pd.read_csv(r"mbbs.csv", encoding='unicode_escape').dropna(subset=['Sl. No'])

# Create geodf by adding a geometry column from lat/long columns...
mbbs_gdf = gpd.GeoDataFrame(mbbs_df, geometry=gpd.points_from_xy(mbbs_df.Longitude, mbbs_df.Latitude))

# Read GeoPackage file of India from
map_of_India = gpd.read_file(r"gadm41_IND.gpkg", layer='ADM_ADM_1')

# A quick plot of the data is like so...
mbbs_gdf.plot(cmap='Greys') # Blues, Oranges... more colors at:
mbbs_gdf.plot(color ='red') # Uniform color

# Color based on column: 'Management of College' and set map size...
mbbs_gdf.plot(cmap='jet', column='Management of College', figsize=(10, 10))

# Plot/Draw two maps together...
# First create subplot and add the map layers on it
fig, ax = plt.subplots(1, figsize=(10, 10))

map_of_India.plot(ax=ax, color='Grey', edgecolor='red')
mbbs_gdf.plot(ax=ax, color='black', alpha=0.4, markersize=10, marker='o')

We can plot a single state by filtering the datasets like so;-

# Plot single state...
# Filter out the dataset...
college_andhra_pradesh = mbbs_gdf.loc[mbbs_gdf['State'] == 'Andhra Pradesh']
map_andhra_pradesh = map_of_India.loc[ map_of_India['NAME_1'] == 'Andhra Pradesh' ]

# Define the subplot axis...
fig, ax = plt.subplots(1, figsize=(12, 6))
ax.set_axis_off() # Off the axis
fig.suptitle('State of Andhra Pradesh') # Set title for map

# Plot the maps...
map_andhra_pradesh.plot(ax=ax, color='gray', edgecolor='red')
college_andhra_pradesh.plot(ax=ax, color='black')

To add label to the polygon

# Plot/Draw two maps together...
# First create subplot and add the map layers on it
fig, ax = plt.subplots(1, figsize=(10, 10))

map_of_India.plot(ax=ax, alpha=0.2, color='Grey', edgecolor='red', legend=True, legend_kwds={"label": "Participation"})
# Add label to polygons...
map_of_India.apply(lambda x: ax.annotate(text=x['NAME_1'][:5], xy=x.geometry.centroid.coords[0], ha='center', color='blue'), axis=1)

mbbs_gdf.plot(ax=ax, color='black', alpha=0.4, markersize=10, marker='o')


The greatest advantage of scripting base wrangling and mapping dataset as seen above over GUI based like QGIS is its reproducibility. We can effortlessly apply what we did above to a new dataset, however if you are new and looking for user friendly approach, go with QGIS.

Thanks for following.

No comments:

Post a Comment