Match Me If You Can

Address Matching in Python

If you ask any Data Scientist, they will tell you that the majority of their time spent at work is cleaning messy data. Data comes in countless forms so getting it into a practical format can be tricky and time-consuming.

One of the most challenging types of data to work with can be Address Data. Unfortunately for us, there is no correct way to write an address. This results in addresses being very inconsistent, for example:

A flat could be called a flat, an apartment or a studio
We could have the same address but one source has it listen as Ground Floor Flat whereas another source has it as Flat 1
A Postcode could be written with or without the space in the centre

… and the list goes on!

This tutorial focuses on matching UK Address Data. We will generate an example dataset using the Faker Python package and perform some simple data cleaning steps that could help match the addresses. We will also investigate the use of the difflib Python module which searches for close matches.

The Faker package, installed and imported below, is a Python package that generates fake data.

In [1]:

! pip install faker

In [2]:

# importing libraries
import pandas as pd
import difflib

from faker import Faker

Below, we create a DataFrame that consists of 25 fake addresses.

In [3]:

# want to generate Great British addresses
fake = Faker("en_GB")

# setting seed 
Faker.seed(1234)

addresses = [fake.address() for _ in range(25)]
addresses = [address.replace('\n', ', ') for address in addresses]

df = pd.DataFrame({'address': addresses})

In [4]:

# outputting first five rows of df
df.head()

Out[4]:

	address
0	1 Vanessa square, North Owenbury, B0 9UZ
1	2 Stephen parkway, Shawbury, W1 8YB
2	475 Smith flats, Hawkinstown, E0F 7RG
3	721 Gareth burg, Jamiehaven, M70 4BG
4	27 Lynne lights, Oliviamouth, W7A 9GT

Saving above DataFrame to .csv

In [5]:

df.to_csv('fake_addresses.csv', index=False)

Note: For example purposes, an additional .csv that includes the same addresses with some formatting changes has been created. This can be seen below where 1 Vanessa square, North Owenbury, B0 9UZ has changed to lower case.

In [6]:

# outputting first five rows of df_2
df_2 = pd.read_csv('fake_addresses_2.csv')
df_2.head()

Out[6]:

	address_2
0	1 vanessa square, north owenbury, b0 9uz
1	2 Stephen parkway, Shawbury, W1 8YB
2	475 Smith flats, Hawkinstown, E0F 7RG
3	721 Gareth burg, Jamiehaven, M70 4BG
4	27 Lynne lights, Oliviamouth, W7A 9GT

Firstly, We can do a direct merge to see how many addresses match up exactly.

In [7]:

df_merged = df.merge(df_2, left_on='address', 
                                   right_on='address_2', how='left')

df_merged

Out[7]:

	address	address_2
0	1 Vanessa square, North Owenbury, B0 9UZ	NaN
1	2 Stephen parkway, Shawbury, W1 8YB	2 Stephen parkway, Shawbury, W1 8YB
2	475 Smith flats, Hawkinstown, E0F 7RG	475 Smith flats, Hawkinstown, E0F 7RG
3	721 Gareth burg, Jamiehaven, M70 4BG	721 Gareth burg, Jamiehaven, M70 4BG
4	27 Lynne lights, Oliviamouth, W7A 9GT	27 Lynne lights, Oliviamouth, W7A 9GT
5	Flat 12A, Hilton pines, Port Daniellefurt, EN3…	NaN
6	Studio 1, Hannah forest, West Ianbury, E5E 1YL	NaN
7	Flat 16, Geraldine mount, Port Rebecca, S29 4BQ	NaN
8	Studio 75, Wayne causeway, East Janice, N2 0DF	Studio 75, Wayne causeway, East Janice, N2 0DF
9	65 Frances light, Shannonport, HD92 4LS	NaN
10	Studio 16, Stevens stravenue, Barnettmouth, W2…	NaN
11	Studio 7, Joanne road, Frasershire, SA9 3GW	NaN
12	Studio 07Q, Richard lakes, Collinsberg, N5U 4YP	Studio 07Q, Richard lakes, Collinsberg, N5U 4YP
13	418 Declan pass, Dannymouth, BT8 3LQ	NaN
14	00 Dorothy drives, Georgiastad, ML14 7YJ	NaN
15	095 Ben neck, Richardsonberg, E2J 2WW	NaN
16	Flat 89, Thomas cliff, Sianchester, E5 9HQ	NaN
17	Flat 0, Forster village, Rachelmouth, W30 9QS	Flat 0, Forster village, Rachelmouth, W30 9QS
18	033 Donna curve, Naomiton, W5K 4RS	033 Donna curve, Naomiton, W5K 4RS
19	05 O’Neill rapid, Port Charlieside, N8C 4QL	NaN
20	Flat 08, Smith mills, Clivestad, PE6V 9YU	NaN
21	Studio 59g, Lee crossroad, Port Paulinefort, G…	Studio 59g, Lee crossroad, Port Paulinefort, G…
22	Studio 8, Lynda mountains, Newtonmouth, E8U 1DH	NaN
23	Flat 64, Arthur burgs, Lake Guy, E0A 6PP	NaN
24	Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ	NaN

Looking at the table above, we have matched some of the addresses, but a lot of them have failed.

Taking the first address as an example, this will not match exactly due to the second address now being lower case.

Let’s sum our matched addresses:

In [8]:

df_merged['address_2'].notna().sum()

Out[8]:

So, from a direct merge with no cleaning, we manage to match 9 of the 25 addresses.

To try and improve our number of matches, we can do some simple cleaning steps to the addresses. This includes removing punctation and common words such as the. We can also make the addresses more consistent by changing all the case to upper case. This is performed in the function below:

In [9]:

def tidy_address(df: pd.DataFrame, col: str) -> pd.Series:
    """Takes in column of DataFrame and performs cleaning steps.

    Args:
        df: DataFrame
        col: column of DataFrame to clean

    Returns:
        clean_address: cleaned column
    """
    
    clean_address = df[col].str.replace(',','')\
    .str.replace(' ','')\
    .str.upper()\
    .str.replace('THE','')\
    .str.replace('.','')\
    
    return clean_address

By carrying out this cleaning step, we can add a column to the original DataFrames of the cleaned addresses.

Examples shown below.

In [10]:

df['clean_address'] = tidy_address(df, 'address')
df_2['clean_address_2'] = tidy_address(df_2, 'address_2')

df.head()

Out[10]:

	address	clean_address
0	1 Vanessa square, North Owenbury, B0 9UZ	1VANESSASQUARENORTHOWENBURYB09UZ
1	2 Stephen parkway, Shawbury, W1 8YB	2STEPHENPARKWAYSHAWBURYW18YB
2	475 Smith flats, Hawkinstown, E0F 7RG	475SMITHFLATSHAWKINSTOWNE0F7RG
3	721 Gareth burg, Jamiehaven, M70 4BG	721GARETHBURGJAMIEHAVENM704BG
4	27 Lynne lights, Oliviamouth, W7A 9GT	27LYNNELIGHTSOLIVIAMOUTHW7A9GT

In [11]:

df_2.head()

Out[11]:

	address_2	clean_address_2
0	1 vanessa square, north owenbury, b0 9uz	1VANESSASQUARENORTHOWENBURYB09UZ
1	2 Stephen parkway, Shawbury, W1 8YB	2STEPHENPARKWAYSHAWBURYW18YB
2	475 Smith flats, Hawkinstown, E0F 7RG	475SMITHFLATSHAWKINSTOWNE0F7RG
3	721 Gareth burg, Jamiehaven, M70 4BG	721GARETHBURGJAMIEHAVENM704BG
4	27 Lynne lights, Oliviamouth, W7A 9GT	27LYNNELIGHTSOLIVIAMOUTHW7A9GT

Now, merging these tables together on the clean addresses, we can see how our number of matches have changed.

In [12]:

df_merged_cleaned = df.merge(df_2, left_on='clean_address', 
                             right_on='clean_address_2', how='left')

df_merged_cleaned.head()

Out[12]:

	address	clean_address	address_2	clean_address_2
0	1 Vanessa square, North Owenbury, B0 9UZ	1VANESSASQUARENORTHOWENBURYB09UZ	1 vanessa square, north owenbury, b0 9uz	1VANESSASQUARENORTHOWENBURYB09UZ
1	2 Stephen parkway, Shawbury, W1 8YB	2STEPHENPARKWAYSHAWBURYW18YB	2 Stephen parkway, Shawbury, W1 8YB	2STEPHENPARKWAYSHAWBURYW18YB
2	475 Smith flats, Hawkinstown, E0F 7RG	475SMITHFLATSHAWKINSTOWNE0F7RG	475 Smith flats, Hawkinstown, E0F 7RG	475SMITHFLATSHAWKINSTOWNE0F7RG
3	721 Gareth burg, Jamiehaven, M70 4BG	721GARETHBURGJAMIEHAVENM704BG	721 Gareth burg, Jamiehaven, M70 4BG	721GARETHBURGJAMIEHAVENM704BG
4	27 Lynne lights, Oliviamouth, W7A 9GT	27LYNNELIGHTSOLIVIAMOUTHW7A9GT	27 Lynne lights, Oliviamouth, W7A 9GT	27LYNNELIGHTSOLIVIAMOUTHW7A9GT

Summing the matched addresses, we get the following:

In [13]:

df_merged_cleaned['clean_address_2'].notna().sum()

Out[13]:

So, from carrying out some preprocessing steps, we take our matches from 9 to 17 of the 25 addresses.

Let us have a look at the remaining 8:

In [14]:

df_remaining = df_merged_cleaned[df_merged_cleaned['address_2'].isna()]

df_remaining

Out[14]:

	address	clean_address	address_2	clean_address_2
5	Flat 12A, Hilton pines, Port Daniellefurt, EN3…	FLAT12AHILTONPINESPORTDANIELLEFURTEN3N2YL	NaN	NaN
6	Studio 1, Hannah forest, West Ianbury, E5E 1YL	STUDIO1HANNAHFORESTWESTIANBURYE5E1YL	NaN	NaN
7	Flat 16, Geraldine mount, Port Rebecca, S29 4BQ	FLAT16GERALDINEMOUNTPORTREBECCAS294BQ	NaN	NaN
11	Studio 7, Joanne road, Frasershire, SA9 3GW	STUDIO7JOANNEROADFRASERSHIRESA93GW	NaN	NaN
15	095 Ben neck, Richardsonberg, E2J 2WW	095BENNECKRICHARDSONBERGE2J2WW	NaN	NaN
16	Flat 89, Thomas cliff, Sianchester, E5 9HQ	FLAT89THOMASCLIFFSIANCHESTERE59HQ	NaN	NaN
20	Flat 08, Smith mills, Clivestad, PE6V 9YU	FLAT08SMITHMILLSCLIVESTADPE6V9YU	NaN	NaN
24	Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ	STUDIO92YBAILEYRANCHGERALDINEMOUTHL2W9TZ	NaN	NaN

Here, we see the cases where the cleaning did not work. Cleaning can only take you so far. Sometimes addresses are just written in two different ways, e.g., sometimes you can include the borough or county, other times not.

To try and match these remaining addresses, we will investigate the difflib module to find potential close matches.

In [15]:

# searching for close matches in the second dataframe
# n=1 returns for the closest match
# 0.6 is the default cut off where the suggested similarities are at least that similar
close_matches = df_remaining['address'].apply(lambda x: difflib\
                                              .get_close_matches(x, df_2['address_2'], n=1, cutoff=0.6)[0])

In [16]:

# creating dataframe to see the close matches
df_close_match = pd.DataFrame({'original_address': df_remaining['address'],
                               'close_match': close_matches})

df_close_match

Out[16]:

	original_address	close_match
5	Flat 12A, Hilton pines, Port Daniellefurt, EN3…	Flat 12 Hilton pines, Port Daniellefurt, EN3N 2YL
6	Studio 1, Hannah forest, West Ianbury, E5E 1YL	Flat 1, Hannah forest, West Ianbury, E5E 1YL
7	Flat 16, Geraldine mount, Port Rebecca, S29 4BQ	Geraldine mount, Flat 16, Port Rebecca, S29 4BQ
11	Studio 7, Joanne road, Frasershire, SA9 3GW	7 Joanne road Frasershire SA9 3GW
15	095 Ben neck, Richardsonberg, E2J 2WW	95 Ben neck, Richardsonberg, E2J2WW
16	Flat 89, Thomas cliff, Sianchester, E5 9HQ	89 Thomas cliff Sianchester E5 9HQ
20	Flat 08, Smith mills, Clivestad, PE6V 9YU	flat 8, smith mills, clivestad, pe6v9yu
24	Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ	Studio 92, Bailey ranch, Geraldinemouth, L2W 9TZ

This method has found all of the additional addresses that were not matched directly.

This step does involve some manual checking to ensure it has matched the correct addresses but it is a very helpful tool to guide you to the closest match!

Note: if you are searching for matches in a very large dataset, it could be helpful to split the address by postcode and only search through the subset of postcodes you are interested in. This reduces the amout of data the matcher has to look through and takes much less time.

Throughout this blog we have investigated three approaches to matching addresses:

Direct matching
Matching after cleaning
Generating close matches

Although these steps have worked well in the example shown here, they may not work for all cases. There are many other approaches you could try if these do not work, some of these are outlined in this article.

Address Matching in Python

By Holly Jones

Leave a comment Cancel reply