Match Me If You Can

Address Matching in Python

If you ask any Data Scientist, they will tell you that the majority of their time spent at work is cleaning messy data. Data comes in countless forms so getting it into a practical format can be tricky and time-consuming.

One of the most challenging types of data to work with can be Address Data. Unfortunately for us, there is no correct way to write an address. This results in addresses being very inconsistent, for example:

  • A flat could be called a flat, an apartment or a studio
  • We could have the same address but one source has it listen as Ground Floor Flat whereas another source has it as Flat 1
  • A Postcode could be written with or without the space in the centre

… and the list goes on!

This tutorial focuses on matching UK Address Data. We will generate an example dataset using the Faker Python package and perform some simple data cleaning steps that could help match the addresses. We will also investigate the use of the difflib Python module which searches for close matches.

The Faker package, installed and imported below, is a Python package that generates fake data.

In [1]:

! pip install faker

In [2]:

# importing libraries
import pandas as pd
import difflib

from faker import Faker

Below, we create a DataFrame that consists of 25 fake addresses.

In [3]:

# want to generate Great British addresses
fake = Faker("en_GB")

# setting seed 
Faker.seed(1234)

addresses = [fake.address() for _ in range(25)]
addresses = [address.replace('\n', ', ') for address in addresses]

df = pd.DataFrame({'address': addresses})

In [4]:

# outputting first five rows of df
df.head()

Out[4]:

address
01 Vanessa square, North Owenbury, B0 9UZ
12 Stephen parkway, Shawbury, W1 8YB
2475 Smith flats, Hawkinstown, E0F 7RG
3721 Gareth burg, Jamiehaven, M70 4BG
427 Lynne lights, Oliviamouth, W7A 9GT

Saving above DataFrame to .csv

In [5]:

df.to_csv('fake_addresses.csv', index=False)

Note: For example purposes, an additional .csv that includes the same addresses with some formatting changes has been created. This can be seen below where 1 Vanessa square, North Owenbury, B0 9UZ has changed to lower case.

In [6]:

# outputting first five rows of df_2
df_2 = pd.read_csv('fake_addresses_2.csv')
df_2.head()

Out[6]:

address_2
01 vanessa square, north owenbury, b0 9uz
12 Stephen parkway, Shawbury, W1 8YB
2475 Smith flats, Hawkinstown, E0F 7RG
3721 Gareth burg, Jamiehaven, M70 4BG
427 Lynne lights, Oliviamouth, W7A 9GT

Firstly, We can do a direct merge to see how many addresses match up exactly.

In [7]:

df_merged = df.merge(df_2, left_on='address', 
                                   right_on='address_2', how='left')

df_merged

Out[7]:

addressaddress_2
01 Vanessa square, North Owenbury, B0 9UZNaN
12 Stephen parkway, Shawbury, W1 8YB2 Stephen parkway, Shawbury, W1 8YB
2475 Smith flats, Hawkinstown, E0F 7RG475 Smith flats, Hawkinstown, E0F 7RG
3721 Gareth burg, Jamiehaven, M70 4BG721 Gareth burg, Jamiehaven, M70 4BG
427 Lynne lights, Oliviamouth, W7A 9GT27 Lynne lights, Oliviamouth, W7A 9GT
5Flat 12A, Hilton pines, Port Daniellefurt, EN3…NaN
6Studio 1, Hannah forest, West Ianbury, E5E 1YLNaN
7Flat 16, Geraldine mount, Port Rebecca, S29 4BQNaN
8Studio 75, Wayne causeway, East Janice, N2 0DFStudio 75, Wayne causeway, East Janice, N2 0DF
965 Frances light, Shannonport, HD92 4LSNaN
10Studio 16, Stevens stravenue, Barnettmouth, W2…NaN
11Studio 7, Joanne road, Frasershire, SA9 3GWNaN
12Studio 07Q, Richard lakes, Collinsberg, N5U 4YPStudio 07Q, Richard lakes, Collinsberg, N5U 4YP
13418 Declan pass, Dannymouth, BT8 3LQNaN
1400 Dorothy drives, Georgiastad, ML14 7YJNaN
15095 Ben neck, Richardsonberg, E2J 2WWNaN
16Flat 89, Thomas cliff, Sianchester, E5 9HQNaN
17Flat 0, Forster village, Rachelmouth, W30 9QSFlat 0, Forster village, Rachelmouth, W30 9QS
18033 Donna curve, Naomiton, W5K 4RS033 Donna curve, Naomiton, W5K 4RS
1905 O’Neill rapid, Port Charlieside, N8C 4QLNaN
20Flat 08, Smith mills, Clivestad, PE6V 9YUNaN
21Studio 59g, Lee crossroad, Port Paulinefort, G…Studio 59g, Lee crossroad, Port Paulinefort, G…
22Studio 8, Lynda mountains, Newtonmouth, E8U 1DHNaN
23Flat 64, Arthur burgs, Lake Guy, E0A 6PPNaN
24Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZNaN

Looking at the table above, we have matched some of the addresses, but a lot of them have failed.

Taking the first address as an example, this will not match exactly due to the second address now being lower case.

Let’s sum our matched addresses:

In [8]:

df_merged['address_2'].notna().sum()

Out[8]:

9

So, from a direct merge with no cleaning, we manage to match 9 of the 25 addresses.

To try and improve our number of matches, we can do some simple cleaning steps to the addresses. This includes removing punctation and common words such as the. We can also make the addresses more consistent by changing all the case to upper case. This is performed in the function below:

In [9]:

def tidy_address(df: pd.DataFrame, col: str) -> pd.Series:
    """Takes in column of DataFrame and performs cleaning steps.

    Args:
        df: DataFrame
        col: column of DataFrame to clean

    Returns:
        clean_address: cleaned column
    """
    
    clean_address = df[col].str.replace(',','')\
    .str.replace(' ','')\
    .str.upper()\
    .str.replace('THE','')\
    .str.replace('.','')\
    
    return clean_address

By carrying out this cleaning step, we can add a column to the original DataFrames of the cleaned addresses.

Examples shown below.

In [10]:

df['clean_address'] = tidy_address(df, 'address')
df_2['clean_address_2'] = tidy_address(df_2, 'address_2')

df.head()

Out[10]:

addressclean_address
01 Vanessa square, North Owenbury, B0 9UZ1VANESSASQUARENORTHOWENBURYB09UZ
12 Stephen parkway, Shawbury, W1 8YB2STEPHENPARKWAYSHAWBURYW18YB
2475 Smith flats, Hawkinstown, E0F 7RG475SMITHFLATSHAWKINSTOWNE0F7RG
3721 Gareth burg, Jamiehaven, M70 4BG721GARETHBURGJAMIEHAVENM704BG
427 Lynne lights, Oliviamouth, W7A 9GT27LYNNELIGHTSOLIVIAMOUTHW7A9GT

In [11]:

df_2.head()

Out[11]:

address_2clean_address_2
01 vanessa square, north owenbury, b0 9uz1VANESSASQUARENORTHOWENBURYB09UZ
12 Stephen parkway, Shawbury, W1 8YB2STEPHENPARKWAYSHAWBURYW18YB
2475 Smith flats, Hawkinstown, E0F 7RG475SMITHFLATSHAWKINSTOWNE0F7RG
3721 Gareth burg, Jamiehaven, M70 4BG721GARETHBURGJAMIEHAVENM704BG
427 Lynne lights, Oliviamouth, W7A 9GT27LYNNELIGHTSOLIVIAMOUTHW7A9GT

Now, merging these tables together on the clean addresses, we can see how our number of matches have changed.

In [12]:

df_merged_cleaned = df.merge(df_2, left_on='clean_address', 
                             right_on='clean_address_2', how='left')

df_merged_cleaned.head()

Out[12]:

addressclean_addressaddress_2clean_address_2
01 Vanessa square, North Owenbury, B0 9UZ1VANESSASQUARENORTHOWENBURYB09UZ1 vanessa square, north owenbury, b0 9uz1VANESSASQUARENORTHOWENBURYB09UZ
12 Stephen parkway, Shawbury, W1 8YB2STEPHENPARKWAYSHAWBURYW18YB2 Stephen parkway, Shawbury, W1 8YB2STEPHENPARKWAYSHAWBURYW18YB
2475 Smith flats, Hawkinstown, E0F 7RG475SMITHFLATSHAWKINSTOWNE0F7RG475 Smith flats, Hawkinstown, E0F 7RG475SMITHFLATSHAWKINSTOWNE0F7RG
3721 Gareth burg, Jamiehaven, M70 4BG721GARETHBURGJAMIEHAVENM704BG721 Gareth burg, Jamiehaven, M70 4BG721GARETHBURGJAMIEHAVENM704BG
427 Lynne lights, Oliviamouth, W7A 9GT27LYNNELIGHTSOLIVIAMOUTHW7A9GT27 Lynne lights, Oliviamouth, W7A 9GT27LYNNELIGHTSOLIVIAMOUTHW7A9GT

Summing the matched addresses, we get the following:

In [13]:

df_merged_cleaned['clean_address_2'].notna().sum()

Out[13]:

17

So, from carrying out some preprocessing steps, we take our matches from 9 to 17 of the 25 addresses.

Let us have a look at the remaining 8:

In [14]:

df_remaining = df_merged_cleaned[df_merged_cleaned['address_2'].isna()]

df_remaining

Out[14]:

addressclean_addressaddress_2clean_address_2
5Flat 12A, Hilton pines, Port Daniellefurt, EN3…FLAT12AHILTONPINESPORTDANIELLEFURTEN3N2YLNaNNaN
6Studio 1, Hannah forest, West Ianbury, E5E 1YLSTUDIO1HANNAHFORESTWESTIANBURYE5E1YLNaNNaN
7Flat 16, Geraldine mount, Port Rebecca, S29 4BQFLAT16GERALDINEMOUNTPORTREBECCAS294BQNaNNaN
11Studio 7, Joanne road, Frasershire, SA9 3GWSTUDIO7JOANNEROADFRASERSHIRESA93GWNaNNaN
15095 Ben neck, Richardsonberg, E2J 2WW095BENNECKRICHARDSONBERGE2J2WWNaNNaN
16Flat 89, Thomas cliff, Sianchester, E5 9HQFLAT89THOMASCLIFFSIANCHESTERE59HQNaNNaN
20Flat 08, Smith mills, Clivestad, PE6V 9YUFLAT08SMITHMILLSCLIVESTADPE6V9YUNaNNaN
24Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZSTUDIO92YBAILEYRANCHGERALDINEMOUTHL2W9TZNaNNaN

Here, we see the cases where the cleaning did not work. Cleaning can only take you so far. Sometimes addresses are just written in two different ways, e.g., sometimes you can include the borough or county, other times not.

To try and match these remaining addresses, we will investigate the difflib module to find potential close matches.

In [15]:

# searching for close matches in the second dataframe
# n=1 returns for the closest match
# 0.6 is the default cut off where the suggested similarities are at least that similar
close_matches = df_remaining['address'].apply(lambda x: difflib\
                                              .get_close_matches(x, df_2['address_2'], n=1, cutoff=0.6)[0])

In [16]:

# creating dataframe to see the close matches
df_close_match = pd.DataFrame({'original_address': df_remaining['address'],
                               'close_match': close_matches})

df_close_match

Out[16]:

original_addressclose_match
5Flat 12A, Hilton pines, Port Daniellefurt, EN3…Flat 12 Hilton pines, Port Daniellefurt, EN3N 2YL
6Studio 1, Hannah forest, West Ianbury, E5E 1YLFlat 1, Hannah forest, West Ianbury, E5E 1YL
7Flat 16, Geraldine mount, Port Rebecca, S29 4BQGeraldine mount, Flat 16, Port Rebecca, S29 4BQ
11Studio 7, Joanne road, Frasershire, SA9 3GW7 Joanne road Frasershire SA9 3GW
15095 Ben neck, Richardsonberg, E2J 2WW95 Ben neck, Richardsonberg, E2J2WW
16Flat 89, Thomas cliff, Sianchester, E5 9HQ89 Thomas cliff Sianchester E5 9HQ
20Flat 08, Smith mills, Clivestad, PE6V 9YUflat 8, smith mills, clivestad, pe6v9yu
24Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZStudio 92, Bailey ranch, Geraldinemouth, L2W 9TZ

This method has found all of the additional addresses that were not matched directly.

This step does involve some manual checking to ensure it has matched the correct addresses but it is a very helpful tool to guide you to the closest match!

Note: if you are searching for matches in a very large dataset, it could be helpful to split the address by postcode and only search through the subset of postcodes you are interested in. This reduces the amout of data the matcher has to look through and takes much less time.

Throughout this blog we have investigated three approaches to matching addresses:

  • Direct matching
  • Matching after cleaning
  • Generating close matches

Although these steps have worked well in the example shown here, they may not work for all cases. There are many other approaches you could try if these do not work, some of these are outlined in this article.

By Holly Jones

Leave a comment

Your email address will not be published. Required fields are marked *