This tutorial will teach you how to join POI datasets with Placekey using Python in a Google Colab environment.
Check out the Google Colab notebook for this tutorial so you can access the code and run it for yourself!
Getting started
Before moving forward, make sure you are familiar with Placekey. There are a growing number of resources available:
Imports and Installations
In the first code block, we install the placekey package in our Google Colab environment and import the necessary packages.
!pip install placekey
from placekey.api import PlacekeyAPI
import pandas as pd
from ast import literal_eval
import json
from google.colab import drive as mountGoogleDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
Authentication
Run this code block to authenticate yourself with Google, giving you access to the datasets.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print("You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!")
Set API key
Replace the asterisks below with your Placekey API key. If you don’t have one yet, it’s completely free.
# placekey_api_key = "*****************" # fill this in with your personal API key (do not share publicly)
pk_api = PlacekeyAPI(placekey_api_key)
Datasets
This tutorial uses two POI datasets: SafeGraph Patterns and Forsyth County polling locations/results.
Define functions
First, define a couple functions to make it easier to read in the datasets.
def pd_read_csv_drive(id, drive, dtype=None, converters=None, encoding=None):
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Filename.csv')
return(pd.read_csv('Filename.csv',dtype=dtype, converters=converters, encoding=encoding))
def get_drive_id(filename):
drive_ids = {'forsyth-polling' : '17o6bXhJ1kfnPKUp7b7HQSW5FbKBVcdvt',
'patterns-election-week' : '1fnnCSjoslSQCJqDf9AQJe6YXSJec296q',
}
return(drive_ids[filename])
Read datasets
The Forsyth polling dataset comes from the Forsyth County website and the Georgia results website. It includes some address columns and three columns indicating in-person vote totals for each presidential candidate on Election Day. We remove one row that is for absentee ballots because it wasn’t clear from the data source whether or not these absentee ballots were delivered on Election Day.
locations = pd_read_csv_drive(get_drive_id('forsyth-polling'), drive=drive, dtype={'Precinct' : int, 'Zip': str})
locations = locations[locations.PrecinctName != 'ABSENTEE']
print(locations.shape)
locations.head(3)
The Patterns dataset from SafeGraph contains POI-specific foot traffic data. Along with address columns, it includes visitor counts, visitor home Census Block Groups, median distance traveled, and more. Column descriptions can be found in the documentation. The image below shows a subset of columns.
patt = pd_read_csv_drive(get_drive_id('patterns-election-week'), drive=drive, dtype={'naics_code': str, 'postal_code': str, 'poi_cbg' : str}, converters={'visits_by_day': literal_eval, 'visitor_home_cbgs': literal_eval})
print(patt.shape)
patt.head(3)
Adding Placekey to polling dataset
As can be seen above, SafeGraph Patterns comes with a Placekey column built-in. The polling dataset, on the other hand, does not have a Placekey column, so we need to add Placekeys to the polling dataset. There are several ways to add Placekeys to a dataset (including some no-code integrations!), which you can find on the Placekey website here. In this example, we will use Python’s placekey package.
Map columns to appropriate fields
In this step, we create a new dataframe with just the address columns from the polling dataset. The columns are renamed to conform to the Placekey API. More specifically, index is mapped to query_id, Location is mapped to location_name, Address is mapped to street_address, City is mapped to city, region is kept as region, and Zip is mapped to postal_code.
def get_df_for_api(df,
column_map = {
"index": "query_id", "Location" : "location_name","Address" : "street_address",
"City": "city", "region": "region", "Zip": "postal_code"
}
):
df_for_api = df.rename(columns=column_map)
cols = list(column_map.values())
df_for_api = df_for_api[cols]
df_for_api['iso_country_code'] = 'US'
return(df_for_api)
locations['index'] = locations['Precinct'].astype(str)
locations['region'] = 'GA'
df_for_api = get_df_for_api(locations)
df_for_api.head(3)
Convert the dataframe to JSON
Each row will be represented by a JSON object, so that it conforms to the Placekey API.
data_jsoned = json.loads(df_for_api.to_json(orient="records"))
print("number of records: ", len(data_jsoned))
print("example record:")
data_jsoned[0]
Request Placekeys from the Placekey API
After getting the responses, we convert them to a dataframe stored in df_placekeys.
responses = pk_api.lookup_placekeys(data_jsoned, verbose=True)
df_placekeys = pd.read_json(json.dumps(responses), dtype={'query_id':str})
df_placekeys.head(7)
Missing Placekeys
Placekey has excellent, continuously improving coverage of POIs. Most datasets will have high coverage. However, you may occasionally come across POIs that are not matched with a Placekey.
In this example, four polling locations were not given a Placekey. In the above dataframe, the error column gives some troubleshooting information. Namely, the locations had invalid addresses.
This type of error mainly occurs for one of two reasons: (1) errors in the address columns or (2) the POI is not registered with Placekey.
For the purposes of this tutorial, you can ignore the missing Placekeys.
Address vs. POI
You'll notice some of your Placekeys have 15 characters, while others have just 12. A Placekey with just 12 characters corresponds to an address, while a Placekey with 15 characters corresponds to a POI at an address. Another unique characteristic of Placekey is that it is a UUID that evolves with the changing world. For example, a McDonald's that used to be a Taco Bell are two different POIs (with two different Placekeys) even though they existed at the same address.
We are certain our two datasets are temporally-aligned because the Patterns dataset and polling dataset both correspond to Election Week, so we can comfortably join on just the address part of the Placekey. As a workaround, we will remove the POI portion of any Placekeys that have it. For example, 123-456@abc-def-ghi will become 123@abc-def-ghi. We do this because the Placekey API can only match some of our POIs with 12 characters (address Placekeys) instead of the full 15 characters (POI Placekeys). When we join two datasets on Placekey, we need to make sure the Placekeys are apples-to-apples (which in our example means all Placekeys are reduced to 12-character address Placekeys). This is done in the next code block.
When would we need the full POI Placekey? Some datasets might be specific to the POI over time, but not specific to the address. To go with the same example as before, suppose a McDonald’s (owned by Tom) in San Francisco used to be a Taco Bell (owned by Amy) with the same address. Suppose we also have a dataset containing data on the owners of every business in the history of San Francisco. In that case, we would need the POI part of the Placekey to distinguish between Tom’s McDonald’s and Amy’s Taco Bell. More information can be found here.
Add Placekeys back to the original polling dataset
Join locations and df_placekeys on index and query_id, respectively, while converting any POI Placekeys to just address Placekeys.
def merge_and_format(loc_df, placekeys_df):
lr_placekey = pd.merge(loc_df, placekeys_df, left_on="index", right_on="query_id", how='left')
lr_placekey = lr_placekey.drop('error', axis=1)
lr_placekey['address_placekey'] = df_placekeys.placekey.str[:3] + df_placekeys.placekey.str[-12:]
lr_placekey = lr_placekey[['placekey', 'address_placekey','Precinct','PrecinctName','Location','Address','City','region','Zip','in_person_trump','in_person_biden','in_person_jorgensen']]
return(lr_placekey)
loc_placekey = merge_and_format(locations, df_placekeys)
loc_placekey.head(3)
Joining on Placekey
Now, we are ready to join the POI datasets on Placekey. Before joining, we will also need to reduce all POI Placekeys in the Patterns dataset to just address Placekeys.
def merge_with_patterns(patterns_df, loc_res_placekey):
patterns_df['address_placekey'] = patterns_df.placekey.str[:3] + patterns_df.placekey.str[-12:]
df = loc_res_placekey.merge(patterns_df.drop('placekey', axis=1), how='inner',on='address_placekey')
df = df.drop_duplicates('Precinct')
df = df.reset_index().drop('index',axis=1)
return(df)
df = merge_with_patterns(patt, loc_placekey)
cols = list(df.columns)
cols.pop(cols.index('address_placekey'))
df = df[['address_placekey'] + cols]
print(df.shape)
df.head(3)
cols_of_interest = ['Precinct','PrecinctName','address_placekey','placekey','in_person_trump','in_person_biden','in_person_jorgensen','visits_by_day','visitor_home_cbgs']
df[cols_of_interest].head(3)
Now we have one dataframe connecting voting precinct results to SafeGraph Patterns visit data. This gives us powerful insight into Election Day voter preferences at the CBG level, which is enabled by joining these two POI datasets with Placekey.
Conclusion
While the datasets in this example are relatively small in comparison to other datasets we could have used (20 polling locations and 2462 POIs in SafeGraph Patterns), joining them without Placekey would be very tough. Short of manually going through each row, the de-facto standard is to use a third-party address matching service that can vary on address quality and cost. Alternatively, Placekey solves this same problem (linking data) by providing a free and open universal standard identifier. So, we can see how the value of Placekey scales exponentially: the faster it is adopted the more data we will have for problem-solving. That is all to say, whether you have 20 POIs or 200,000 POIs, Placekey enables you to effectively join POI datasets (for free).