Friday, March 29, 2019

Geocoding and Reverse Geocoding with Python

Disclaimer: I originally submitted this article to DataCamp on "Jan 27, 2018". Since they didn't publish it on the platform, I have decided to do it here so that someone out there will find it useful.

Download the original files in HTML and Jupyter Notebook formats

DataCamp Tutorial - Geocoding and Reverse Geocoding with Python

The increasing use of location-aware data and technologies that are able to give directions relative to location and access geographically aware data has given rise to category of data scientists with strong knowledge of geospatial data - Geo-data Scientists.
In this tutorial, you will discover how to use PYTHON to carry out geocoding task. Specifically, you will learn to use GeoPy, Pandas and Folium PYTHON libraries to complete geocoding tasks. Because this is a geocoding tutorial, the article will cover more of GeoPy than Pandas. If you are not familiar with Pandas, you should definitely consider studying the Pandas Tutorial by Karlijn Willems so also this Pandas cheat sheet will be handy to your learning.

Tutorial Overview

  • What is Geocoding?
  • Geocoding with Python
  • Putting it all together – Bulk Geocoding
  • Accuracy of the Result
  • Mapping Geocoding Result
  • Conclusion

What is Geocoding?

A very common task faced by Geo-data Scientist is the conversion of physical human-readable addresses of places into latitude and longitude geographical coordinates. This process is known as “Geocoding” while the reverse case (that is converting latitude and longitude coordinates into physical addresses) is known as “Reverse Geocoding”. To clarify this explanation, here is an example using the datacamp USA office address:-
Geocoding: is converting an address like “Empire State Building 350 5th Ave, Floor 77 New York, NY 10118” to “latitude 40.7484284, longitude -73.9856546”.

Reverse Geocoding: is converting “latitude 40.7484284, longitude -73.9856546” to address “Empire State Building 350 5th Ave, Floor 77 New York, NY 10118”.
Now that you have seen how to do forward and reverse geocoding manually, let’s see how it can be done programmatically in PYTHON on larger dataset by calling some APIs.

Geocoding with Python

There is good number of PYTHON modules for Geocoding and Reverse Geocoding. In this tutorial, you will use the PYTHON Geocoding Toolbox named GeoPy which provides support for several popular geocoding web services including Google Geocoding API, OpenStreetMap Nominatim, ESRI ArcGIS, Bing Maps API etc.
You will make use of OpenStreetMap Nominatim API because it is completely open source and has no limit to the number of requests you can make. But first, you need to install the libraries (geopy, pandas and folium) on your PYTHON environment using “pip install geopy, pandas, folium”.
Let's import the libraries...
In [1]:
# Importing the necessary modules for this tutorial
# Folium Library for visualizing data on interactive map
# Pandas Library for fast, flexible, and expressive data structures designed

import folium
import pandas as pd
from geopy.geocoders import Nominatim, ArcGIS, GoogleV3 # Geocoder APIs
Note: You don’t have to import all the three geocoding APIs namely Nominatim, ArcGIS and GoogleV3 from the geopy module. However, I did so you can test and compare the result from the different APIs to find out which is more accurate with your specific dataset. To follow along and to get you familiar with geocoding, make use of “OpenStreetMap Nominatim API” for this article.
To do forward geocoding (convert address to latitude/longitude), you first create a geocoder API object by calling the Nominatim() API class.
In [2]:
g = Nominatim() # You can tryout ArcGIS or GoogleV3 APIs to compare the results
In the next few lines of code below, you will do forward Geocoding and Reverse Geocoding respectively.
In [3]:
# Geocoding - Address to lat/long

n = g.geocode('Empire State Building New York', timeout=10) # Address to geocode
print(n.latitude, n.longitude)
40.7484284 -73.9856546198733
By calling the geocode() method on the defined API object, you will supply an address as the first parameter to get it corresponding latitude and longitude attributes.
In [4]:
# Reverse Geocoding - lat/long to Address

n = g.reverse((40.7484284, -73.9856546198733), timeout=10) # Lat, Long to reverse geocode
print(n.address)
Empire State Building, 350, 5th Avenue, Korea Town, Manhattan Community Board 5, New York County, NYC, New York, 10018, United States of America
To reverse the process, you will call the reverse() method on the same API object and supply latitude and longitude coordinate values in that order to obtain their corresponding address attribute.
The process above is the very basic of geocoding a single address and reverse geocoding of a pair of latitude and longitude coordinate using PYTHON.
Now, let’s process a lager dataset in the next section. You will use Pandas library for the data handling/wrangling and Folium to subsequently visualize the geocoded result.
In [ ]:
 

Putting it all together – Bulk Geocoding

In the previous section, you geocoded a single place/address; "Empire State Building, New York". Now, you will work with bulk dataset, which is broadened to contain list of similar places (buildings) in New York City.
On this wikipedia page, there is an awesome list of tallest buildings in New York City. Unfortunately, the table has no detailed addresses or geographic coordinates of the buildings.
You will fix this missing data by applying geocoding technique you learned in the previous section. Specifically, you are going to look at the 'Name' column on the first table on the page where "Empire State Building" is the third ranked tallest building.
There are many methods of importing such a tabulated list into a PYTHON environment, in this case use pandas read_clipboard() method. Copy “Rank and Name” columns to your clipboard and create a dataframe.
In [5]:
# Create a dataframe from the copied table columns on the clipboard and display its first 10 records

df = pd.read_clipboard()
df.head(10)
Out[5]:
RankName
01One World Trade Center
12432 Park Avenue
23Empire State Building
34Bank of America Tower
45Three World Trade Center*
56=Chrysler Building
66=The New York Times Building
78One57
89Four World Trade Center
910220 Central Park South
Just like with any other data science dataset, you should do some clean up on the data. In particular, remove special characters (such as * “ ? # ‘ \ %) in the input dataset. This will enable the system read the names correctly without mixing there meaning.
In [6]:
# Remove all characters except letters belonging to english alphabet, spaces and tabs

df['Name'] = df['Name'].str.replace('[^A-Za-z\s0-9]+', '')
df.head(10)
Out[6]:
RankName
01One World Trade Center
12432 Park Avenue
23Empire State Building
34Bank of America Tower
45Three World Trade Center
56=Chrysler Building
66=The New York Times Building
78One57
89Four World Trade Center
910220 Central Park South
Also, the names may likely be in use in some other part of the world, you can help the system better know that you are primarily concerned with the building names in New York City by appending “New York City” to each building name as follow.
In [7]:
# Create a new column "Address_1" to hold the updated building names

df['Address_1'] = (df['Name'] + ', New York City')
df.head(10)
Out[7]:
RankNameAddress_1
01One World Trade CenterOne World Trade Center, New York City
12432 Park Avenue432 Park Avenue, New York City
23Empire State BuildingEmpire State Building, New York City
34Bank of America TowerBank of America Tower, New York City
45Three World Trade CenterThree World Trade Center, New York City
56=Chrysler BuildingChrysler Building, New York City
66=The New York Times BuildingThe New York Times Building, New York City
78One57One57, New York City
89Four World Trade CenterFour World Trade Center, New York City
910220 Central Park South220 Central Park South, New York City
Next step is the loop through the each record on 'Address_1' column and get the corresponding address and geographic coordinates.
In [8]:
add_list = [] # an empty list to hold the geocoded results

for add in df['Address_1']:
    print ('Processing .... ', add)
    
    try:
        n = g.geocode(add, timeout=10)
        
        data = (add, n.latitude, n.longitude, n.address)
        add_list.append(data)
        
    except Exception:
        data = (add, "None", "None", "None")
        add_list.append(data)
        
Processing ....  One World Trade Center, New York City
Processing ....  432 Park Avenue, New York City
Processing ....  Empire State Building, New York City
Processing ....  Bank of America Tower, New York City
Processing ....  Three World Trade Center, New York City
Processing ....  Chrysler Building, New York City
Processing ....  The New York Times Building, New York City
Processing ....  One57, New York City
Processing ....  Four World Trade Center, New York City
Processing ....  220 Central Park South, New York City
Processing ....  70 Pine Street, New York City
Processing ....  30 Park Place, New York City
Processing ....  40 Wall Street, New York City
Processing ....  Citigroup Center, New York City
Processing ....  10 Hudson Yards, New York City
Processing ....  8 Spruce Street, New York City
Processing ....  Trump World Tower, New York City
Processing ....  30 Rockefeller Plaza, New York City
Processing ....  56 Leonard Street, New York City
Processing ....  CitySpire Center, New York City
Processing ....  28 Liberty Street, New York City
Processing ....  4 Times Square, New York City
Processing ....  MetLife Building, New York City
Processing ....  731 Lexington Avenue, New York City
Processing ....  Woolworth Building, New York City
Processing ....  50 West Street, New York City
Processing ....  One Worldwide Plaza, New York City
Processing ....  Madison Square Park Tower, New York City
Processing ....  Carnegie Hall Tower, New York City
Processing ....  383 Madison Avenue, New York City
Processing ....  1717 Broadway, New York City
Processing ....  AXA Equitable Center, New York City
Processing ....  One Penn Plaza, New York City
Processing ....  1251 Avenue of the Americas, New York City
Processing ....  Time Warner Center South Tower, New York City
Processing ....  Time Warner Center North Tower, New York City
Processing ....  200 West Street, New York City
Processing ....  60 Wall Street, New York City
Processing ....  One Astor Plaza, New York City
Processing ....  7 World Trade Center, New York City
Processing ....  One Liberty Plaza, New York City
Processing ....  20 Exchange Place, New York City
Processing ....  200 Vesey Street, New York City
Processing ....  Bertelsmann Building, New York City
Processing ....  Times Square Tower, New York City
Processing ....  Metropolitan Tower, New York City
Processing ....  252 East 57th Street, New York City
Processing ....  100 East 53rd Street, New York City
Processing ....  500 Fifth Avenue, New York City
Processing ....  JP Morgan Chase World Headquarters, New York City
Processing ....  General Motors Building, New York City
Processing ....  3 Manhattan West, New York City
Processing ....  Metropolitan Life Insurance Company Tower, New York City
Processing ....  Americas Tower, New York City
Processing ....  Solow Building, New York City
Processing ....  Marine Midland Building, New York City
Processing ....  55 Water Street, New York City
Processing ....  277 Park Avenue, New York City
Processing ....  5 Beekman, New York City
Processing ....  Morgan Stanley Building, New York City
Processing ....  Random House Tower, New York City
Processing ....  Four Seasons Hotel New York, New York City
Processing ....  1221 Avenue of the Americas, New York City
Processing ....  Lincoln Building, New York City
Processing ....  Barclay Tower, New York City
Processing ....  Paramount Plaza, New York City
Processing ....  Trump Tower, New York City
Processing ....  One Court Square, New York City
Processing ....  Sky, New York City
Processing ....  1 Wall Street, New York City
Processing ....  599 Lexington Avenue, New York City
Processing ....  Silver Towers I, New York City
Processing ....  Silver Towers II, New York City
Processing ....  712 Fifth Avenue, New York City
Processing ....  Chanin Building, New York City
Processing ....  245 Park Avenue, New York City
Processing ....  Sony Tower, New York City
Processing ....  Tower 28, New York City
Processing ....  225 Liberty Street, New York City
Processing ....  1 New York Plaza, New York City
Processing ....  570 Lexington Avenue, New York City
Processing ....  MiMA, New York City
Processing ....  345 Park Avenue, New York City
Processing ....  400 Fifth Avenue, New York City
Processing ....  W R Grace Building, New York City
Processing ....  Home Insurance Plaza, New York City
Processing ....  1095 Avenue of the Americas, New York City
Processing ....  W New York Downtown Hotel and Residences, New York City
Processing ....  101 Park Avenue, New York City
Processing ....  One Dag Hammarskjld Plaza, New York City
Processing ....  Central Park Place, New York City
Processing ....  888 7th Avenue, New York City
Processing ....  Waldorf Astoria New York, New York City
Processing ....  1345 Avenue of the Americas, New York City
Processing ....  Trump Palace Condominiums, New York City
Processing ....  Olympic Tower, New York City
Processing ....  Mercantile Building, New York City
Processing ....  425 Fifth Avenue, New York City
Processing ....  One Madison, New York City
Processing ....  919 Third Avenue, New York City
Processing ....  New York Life Building, New York City
Processing ....  750 7th Avenue, New York City
Processing ....  The Epic, New York City
Processing ....  Eventi, New York City
Processing ....  Tower 49, New York City
Processing ....  555 10th Avenue, New York City
Processing ....  The Hub, New York City
Processing ....  Calyon Building, New York City
Processing ....  Baccarat Hotel and Residences, New York City
Processing ....  250 West 55th Street, New York City
Processing ....  The Orion, New York City
Processing ....  590 Madison Avenue, New York City
Processing ....  11 Times Square, New York City
Processing ....  1166 Avenue of the Americas, New York City
Save the result into a dataframe.
In [9]:
# make a new dataframe to hold geocoded reult

add_list_df = pd.DataFrame(add_list, columns=['Address_1', 'Latitude', 'Longitude', 'Full Address'])
add_list_df.head(10)
Out[9]:
Address_1LatitudeLongitudeFull Address
0One World Trade Center, New York City40.713-74.0132One World Trade Center, 1, Fulton Street, Batt...
1432 Park Avenue, New York City40.7615-73.9719432 Park Avenue, 432, Manhattan Community Boar...
2Empire State Building, New York City40.7484-73.9857Empire State Building, 350, 5th Avenue, Korea ...
3Bank of America Tower, New York City40.7555-73.9847Bank of America Tower, 115, West 42nd Street, ...
4Three World Trade Center, New York CityNoneNoneNone
5Chrysler Building, New York City40.7516-73.9753Chrysler Building, East 43rd Street, Tudor Cit...
6The New York Times Building, New York City40.7559-73.9893The New York Times Building, 620, 8th Avenue, ...
7One57, New York City40.7655-73.9791One57, West 57th Street, Diamond District, Man...
8Four World Trade Center, New York CityNoneNoneNone
9220 Central Park South, New York City40.767-73.9806220 Central Park South, Manhattan Community Bo...
In [ ]:
 

Accuracy of the Result

A quick inspection of the latest data frame reveals that the obtained geographical coordinates of the buildings lies within the latitude and longitude territory of New York City (that is: 40°42′46″N, 74°00′21″W). There are some buildings that were not geocoded (their results were not found). This indicates that there geocode results are not available in the OpenStreetMap Nominatim API.
Now, you can make use of some other APIs to check if their geocode results are available within the new API.
First, use the pandas “loc” method to separate the records whose geocode results were found from those that were not found.
In [10]:
# Extract the records where value of Latitude and Longitude are the same (that is: None)

geocode_found = add_list_df.loc[add_list_df['Latitude'] != add_list_df['Longitude']]

geocode_not_found = add_list_df.loc[add_list_df['Latitude'] == add_list_df['Longitude']]
geocode_not_found
Out[10]:
Address_1LatitudeLongitudeFull Address
4Three World Trade Center, New York CityNoneNoneNone
8Four World Trade Center, New York CityNoneNoneNone
27Madison Square Park Tower, New York CityNoneNoneNone
34Time Warner Center South Tower, New York CityNoneNoneNone
35Time Warner Center North Tower, New York CityNoneNoneNone
49JP Morgan Chase World Headquarters, New York CityNoneNoneNone
50General Motors Building, New York CityNoneNoneNone
71Silver Towers I, New York CityNoneNoneNone
72Silver Towers II, New York CityNoneNoneNone
77Tower 28, New York CityNoneNoneNone
87W New York Downtown Hotel and Residences, New ...NoneNoneNone
89One Dag Hammarskjld Plaza, New York CityNoneNoneNone
92Waldorf Astoria New York, New York CityNoneNoneNone
In [ ]:
 
There are many ways to get this done, in this case you simply compare the latitude and longitude columns knowing that their numeric values can never be the same. Wherever the latitude and longitude cells have the same value, it will be a string value of “None”, which means a geocode result wasn’t found for that building’s name.
Now, will you redefine the geocoder API object to call a different API (ArcGIS API for example) by calling the ArcGIS() API class.
In [11]:
g = ArcGIS() # redefine the API object
Then you can now loop through “geocode_not_found” data frame to see if you can get some results from the new API.
In [12]:
add_list = []

for add in geocode_not_found['Address_1']:
    print ('Processing .... ', add)
    
    try:
        n = g.geocode(add, timeout=10)
        
        data = (add, n.latitude, n.longitude, n.address)
        add_list.append(data)
        
    except Exception:
        data = (add, "None", "None", "None")
        add_list.append(data)
        
Processing ....  Three World Trade Center, New York City
Processing ....  Four World Trade Center, New York City
Processing ....  Madison Square Park Tower, New York City
Processing ....  Time Warner Center South Tower, New York City
Processing ....  Time Warner Center North Tower, New York City
Processing ....  JP Morgan Chase World Headquarters, New York City
Processing ....  General Motors Building, New York City
Processing ....  Silver Towers I, New York City
Processing ....  Silver Towers II, New York City
Processing ....  Tower 28, New York City
Processing ....  W New York Downtown Hotel and Residences, New York City
Processing ....  One Dag Hammarskjld Plaza, New York City
Processing ....  Waldorf Astoria New York, New York City
Here you can see that ArcGIS was able to retrieve geocode results for the buildings that Nominatim API couldn’t retrieve.
In [13]:
add_list_df = pd.DataFrame(add_list, columns=['Address_1', 'Latitude', 'Longitude', 'Full Address'])
add_list_df.head(10)
Out[13]:
Address_1LatitudeLongitudeFull Address
0Three World Trade Center, New York City40.709690-74.011670World Trade Center
1Four World Trade Center, New York City40.709900-74.012090Four World Trade Center
2Madison Square Park Tower, New York City40.741500-73.987580Madison Square
3Time Warner Center South Tower, New York City40.767857-73.982391Time Warner Ctr, New York, 10019
4Time Warner Center North Tower, New York City40.767857-73.982391Time Warner Ctr, New York, 10019
5JP Morgan Chase World Headquarters, New York City40.727050-73.825910Headquarters
6General Motors Building, New York City40.879330-73.871330GM
7Silver Towers I, New York City40.843822-73.847128Silver St, Bronx, New York, 10461
8Silver Towers II, New York City40.843822-73.847128Silver St, Bronx, New York, 10461
9Tower 28, New York City40.593850-74.18611928 Towers Ln, Staten Island, New York, 10314
You could also import the latitudes and longitudes as points unto Google maps to further validate their positional accuracy. As seen below, the latitude and longitude positions are at least more than 95% accurately geocoded.
In [ ]:
 

Mapping Geocoding Result

An obvious purpose of geocoding is to visualize places/addresses on a map. Here, you will learn to visualize the “geocode_found” data frame on a simple interactive map using the folium library (recall you have imported the library at the beginning of this tutorial). Folium makes it easy to visualize data that's been manipulated in PYTHON on an interactive LeafletJS map.
In [14]:
# convert Full Address, Latitude and Longitude dataframe columns to list
full_address_list = list(geocode_found['Full Address'])
long_list = list(geocode_found["Longitude"])
lat_list = list(geocode_found["Latitude"])


# create folium map object
geocoded_map = folium.Map(location=[40.7484284, -73.9856546], zoom_start=13) # location=[Lat, Long]


# loop through the lists and create markers on the map object
for long, lat, address in zip(long_list, lat_list, full_address_list):
    geocoded_map.add_child(folium.Marker(location=[lat, long], popup=address))
    geocoded_map.add_child(folium.CircleMarker(location=[lat, long], popup=address, radius=5, color='green', fill_color='green', fill_opacity=.2))


# Display the map inline
geocoded_map
Out[14]:
In [ ]:
 

Conclusion

You have just learned about geocoding and reverse geocoding in Python primarily using third party GeoPy module. The knowledge you have learned here will definitely help to locate addresses and places when working on datasets that are amenable to maps. Geocoding is useful for plotting and extracting places/addresses on a map for obvious reasons which may include:-
  • To visualize distances such as roads and pipelines
  • To deliver insight into public health information,
  • To determine voting demographics,
  • To analyze law enforcement and intelligence data, etc
Be skeptical of your geocoding results. Always inspect actual address match locations against other data sources, like street basemaps. Compare your results to more than one geocode API sources if possible. For example, if geocoded in OpenStreetMap Nominatim, import the results to Google Maps to see if they match its basemap.
In [ ]:
 



No comments:

Post a Comment