Making Address Validation Easy with Placekey

This tutorial will teach you how to validate addresses with Placekey using Python in a Google Colab environment.

We use two datasets in this tutorial: real (valid) addresses and fake (invalid) addresses.

The goal of this tutorial is to get you familiar with address validation using Placekey, including the limitations and considerations to be aware of.

‍

Getting Started

Before moving forward, make sure you are familiar with Placekey. There are a growing number of resources available:

‍

Introduction

Placekey is a useful tool for address validation. When using Placekey for address validation, we assume the following:

A location that is given a Placekey is a valid address.
A location that is not given a Placekey is not a valid address.

In reality, there are some important considerations.

An address that is not given a Placekey is not necessarily invalid...it's only probably invalid.
(?) Fuzzy match on addresses may result in addresses with errors coming back as valid.
There's no functionality to standardize addresses.

Imports and Installations

In the first code block, we install the Python Placekey package in our Google Colab environment and import the necessary packages.

!pip install placekey

from placekey.api import PlacekeyAPI
import placekey as pk
import pandas as pd
from ast import literal_eval
import json
from google.colab import drive as mountGoogleDrive 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Authentication

Run this code block to authenticate yourself with Google, giving you access to the datasets.

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print("You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!")

You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!

Set API key

Replace the asterisks below with your Placekey API key. If you don’t have one yet, it’s completely free.

placekey_api_key = "H3qh2FFXfCggNy8abLGvzOZfRaKpJkWZ" # fill this in with your personal API key (do not share publicly)

pk_api = PlacekeyAPI(placekey_api_key)

‍

Datasets

This tutorial uses two datasets: real addresses from Miami-Dade and fake addresses.

Define functions

First, define a couple functions to make it easier to read in the datasets.

def pd_read_csv_drive(id, drive, dtype=None, converters=None, encoding=None):
  downloaded = drive.CreateFile({'id':id}) 
  downloaded.GetContentFile('Filename.csv')  
  return(pd.read_csv('Filename.csv',dtype=dtype, converters=converters, encoding=encoding))

def get_drive_id(filename):
    drive_ids = {'real_addresses' : '1zr6tclTsNg2-EMGluIMUwQocZQ50mufM',
                 'fake_addresses' : '1X4hmM14uEirHqjurhTzaXNyH-LLD0Xne'
                 }
    return(drive_ids[filename])

Read datasets

Our real addresses are all from Miami-Dade County. We filter to just the address columns.

real_addresses = pd_read_csv_drive(get_drive_id('real_addresses'), drive=drive)[['hse_num','sname','MAILING_MUNIC','zip']]
print(real_addresses.shape)
real_addresses.head(3)

We filter to 10,000 random addresses from our dataset.

real_addresses = real_addresses.sample(10000)

We standardize the columns to (a) conform to the Placekey API and (b) allow for concatenation with the fake addresses so that we can send all API requests together.

real_addresses['street_address'] = real_addresses['hse_num'].astype(str) + " " + real_addresses['sname']
real_addresses['city'] = real_addresses['MAILING_MUNIC']
real_addresses['region'] = 'FL'
real_addresses['postal_code'] = real_addresses['zip'].astype(str)
real_addresses['iso_country_code'] = 'US'
real_addresses = real_addresses.reset_index()
real_addresses['query_id'] = real_addresses['index'].astype(str)
real_addresses = real_addresses[['query_id','street_address','city','region','postal_code','iso_country_code']]
real_addresses['valid'] = True
real_addresses.head(3)

The fake addresses are not limited to any geographic region.

fake_addresses = pd_read_csv_drive(get_drive_id('fake_addresses'), drive=drive)
print(fake_addresses.shape)
fake_addresses.head(3)

We standardize the column names the same way as the real addresses.

fake_addresses['region'] = fake_addresses['state']
fake_addresses['postal_code'] = fake_addresses['zip_code'].astype(str)
fake_addresses['iso_country_code'] = 'US'
fake_addresses = fake_addresses.reset_index()
fake_addresses['query_id'] = fake_addresses['index'].astype('str')
fake_addresses = fake_addresses[['query_id','street_address','city','region','postal_code','iso_country_code']]
fake_addresses['valid'] = False
fake_addresses.head(3)

We combine the datasets into one so that we can send all Placekey API requests together. We also shuffle the order so that the valid and invalid addresses are not all grouped together. The valid column indicates whether or not the address is valid.

addresses = pd.concat([real_addresses, fake_addresses])
addresses = addresses.sample(frac=1).reset_index(drop=True)
addresses['query_id'] = addresses.index.astype(str)
print(addresses.shape)
addresses.head(3)

‍

Adding Placekey to addresses dataframe

There are several ways to add Placekeys to a dataset (including some no-code integrations!), which you can find on the Placekey website. In this example, we will use Python’s Placekey package.

Map columns to appropriate fields

In this step, the columns are renamed to conform to the Placekey API. Since we already renamed the columns of our dataframes when we concatenated the real and fake addresses into one dataframe, all we have to do is drop valid.

df_for_api = addresses.drop('valid', axis=1)
df_for_api.head(3)

Convert the dataframe to JSON

Each row will be represented by a JSON object, so that it conforms to the Placekey API.

data_jsoned = json.loads(df_for_api.to_json(orient="records"))
print("number of records: ", len(df_for_api))
print("example record:")
data_jsoned[0]

Request Placekeys from the Placekey API

After getting the responses, we convert them to a dataframe stored in df_placekeys.

This step will take a couple minutes. Rate limiting is automatically handled by the Python library and Placekey API.

responses = pk_api.lookup_placekeys(data_jsoned, verbose=True)
df_placekeys = pd.read_json(json.dumps(responses), dtype={'query_id':str})
df_placekeys.head(3)

How many addresses were not given an Placekey?

len(df_placekeys[df_placekeys['placekey'].isnull()])

What kind of errors did we get?

df_placekeys['error'].unique()

Add Placekeys back to the original addresses dataframe

def merge_and_format(loc_df, placekeys_df):
  lr_placekey = pd.merge(loc_df, placekeys_df, on="query_id", how='left')
  # if 'error' in lr_placekey.columns:
  #   lr_placekey = lr_placekey.drop('error', axis=1)
  lr_placekey = lr_placekey.drop("query_id", axis=1)
  return(lr_placekey)

addresses_placekey = merge_and_format(addresses, df_placekeys)
print(addresses_placekey.shape)
addresses_placekey.head(3)

‍

Assess accuracy

We are ready to assess the accuracy of our Placekey address validation. Remember - our assumption is that an address is valid if and only if it is given a Placekey. The following codeblock computes basic measures of accuracy for our real and fake addresses.

tp = len(addresses_placekey[(addresses_placekey['valid'] == True) & (addresses_placekey['placekey'].notnull())])
fp = len(addresses_placekey[(addresses_placekey['valid'] == False) & (addresses_placekey['placekey'].notnull())])
fn = len(addresses_placekey[(addresses_placekey['valid'] == True) & (addresses_placekey['placekey'].isnull())])
tn = len(addresses_placekey[(addresses_placekey['valid'] == False) & (addresses_placekey['placekey'].isnull())])
print("Accuracy:", (tp + tn)/(tp + fp + fn + tn))
print("Sensitivity:", tp/(tp + fn))
print("Specificity:", tn/(tn + fp))

‍

Conclusion

This tutorial was an introduction to address validation with Placekey. With our samples of real and fake addresses, we had an accuracy of over 98% in using Placekey to determine address validity. This includes a sensitivity of about 96.8%, meaning that 96.8% of valid addresses were correctly identified as valid. Additionally, our specificity was about 99.3%, meaning that 99.3% of invalid addresses were correctly identified as invalid.

Want to learn more?

Check out the Placekey website

Placekey Tutorials