Address Matching in Python
If you ask any Data Scientist, they will tell you that the majority of their time spent at work is cleaning messy data. Data comes in countless forms so getting it into a practical format can be tricky and time-consuming.
One of the most challenging types of data to work with can be Address Data. Unfortunately for us, there is no correct way to write an address. This results in addresses being very inconsistent, for example:
- A flat could be called a flat, an apartment or a studio
- We could have the same address but one source has it listen as Ground Floor Flat whereas another source has it as Flat 1
- A Postcode could be written with or without the space in the centre
… and the list goes on!
This tutorial focuses on matching UK Address Data. We will generate an example dataset using the Faker
Python package and perform some simple data cleaning steps that could help match the addresses. We will also investigate the use of the difflib
Python module which searches for close matches.
The Faker
package, installed and imported below, is a Python package that generates fake data.
In [1]:
! pip install faker
In [2]:
# importing libraries import pandas as pd import difflib from faker import Faker
Below, we create a DataFrame that consists of 25 fake addresses.
In [3]:
# want to generate Great British addresses fake = Faker("en_GB") # setting seed Faker.seed(1234) addresses = [fake.address() for _ in range(25)] addresses = [address.replace('\n', ', ') for address in addresses] df = pd.DataFrame({'address': addresses})
In [4]:
# outputting first five rows of df df.head()
Out[4]:
address | |
---|---|
0 | 1 Vanessa square, North Owenbury, B0 9UZ |
1 | 2 Stephen parkway, Shawbury, W1 8YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT |
Saving above DataFrame to .csv
In [5]:
df.to_csv('fake_addresses.csv', index=False)
Note: For example purposes, an additional .csv
that includes the same addresses with some formatting changes has been created. This can be seen below where 1 Vanessa square, North Owenbury, B0 9UZ has changed to lower case.
In [6]:
# outputting first five rows of df_2 df_2 = pd.read_csv('fake_addresses_2.csv') df_2.head()
Out[6]:
address_2 | |
---|---|
0 | 1 vanessa square, north owenbury, b0 9uz |
1 | 2 Stephen parkway, Shawbury, W1 8YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT |
Firstly, We can do a direct merge to see how many addresses match up exactly.
In [7]:
df_merged = df.merge(df_2, left_on='address', right_on='address_2', how='left') df_merged
Out[7]:
address | address_2 | |
---|---|---|
0 | 1 Vanessa square, North Owenbury, B0 9UZ | NaN |
1 | 2 Stephen parkway, Shawbury, W1 8YB | 2 Stephen parkway, Shawbury, W1 8YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG | 475 Smith flats, Hawkinstown, E0F 7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG | 721 Gareth burg, Jamiehaven, M70 4BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT | 27 Lynne lights, Oliviamouth, W7A 9GT |
5 | Flat 12A, Hilton pines, Port Daniellefurt, EN3… | NaN |
6 | Studio 1, Hannah forest, West Ianbury, E5E 1YL | NaN |
7 | Flat 16, Geraldine mount, Port Rebecca, S29 4BQ | NaN |
8 | Studio 75, Wayne causeway, East Janice, N2 0DF | Studio 75, Wayne causeway, East Janice, N2 0DF |
9 | 65 Frances light, Shannonport, HD92 4LS | NaN |
10 | Studio 16, Stevens stravenue, Barnettmouth, W2… | NaN |
11 | Studio 7, Joanne road, Frasershire, SA9 3GW | NaN |
12 | Studio 07Q, Richard lakes, Collinsberg, N5U 4YP | Studio 07Q, Richard lakes, Collinsberg, N5U 4YP |
13 | 418 Declan pass, Dannymouth, BT8 3LQ | NaN |
14 | 00 Dorothy drives, Georgiastad, ML14 7YJ | NaN |
15 | 095 Ben neck, Richardsonberg, E2J 2WW | NaN |
16 | Flat 89, Thomas cliff, Sianchester, E5 9HQ | NaN |
17 | Flat 0, Forster village, Rachelmouth, W30 9QS | Flat 0, Forster village, Rachelmouth, W30 9QS |
18 | 033 Donna curve, Naomiton, W5K 4RS | 033 Donna curve, Naomiton, W5K 4RS |
19 | 05 O’Neill rapid, Port Charlieside, N8C 4QL | NaN |
20 | Flat 08, Smith mills, Clivestad, PE6V 9YU | NaN |
21 | Studio 59g, Lee crossroad, Port Paulinefort, G… | Studio 59g, Lee crossroad, Port Paulinefort, G… |
22 | Studio 8, Lynda mountains, Newtonmouth, E8U 1DH | NaN |
23 | Flat 64, Arthur burgs, Lake Guy, E0A 6PP | NaN |
24 | Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ | NaN |
Looking at the table above, we have matched some of the addresses, but a lot of them have failed.
Taking the first address as an example, this will not match exactly due to the second address now being lower case.
Let’s sum our matched addresses:
In [8]:
df_merged['address_2'].notna().sum()
Out[8]:
9
So, from a direct merge with no cleaning, we manage to match 9 of the 25 addresses.
To try and improve our number of matches, we can do some simple cleaning steps to the addresses. This includes removing punctation and common words such as the. We can also make the addresses more consistent by changing all the case to upper case. This is performed in the function below:
In [9]:
def tidy_address(df: pd.DataFrame, col: str) -> pd.Series: """Takes in column of DataFrame and performs cleaning steps. Args: df: DataFrame col: column of DataFrame to clean Returns: clean_address: cleaned column """ clean_address = df[col].str.replace(',','')\ .str.replace(' ','')\ .str.upper()\ .str.replace('THE','')\ .str.replace('.','')\ return clean_address
By carrying out this cleaning step, we can add a column to the original DataFrames of the cleaned addresses.
Examples shown below.
In [10]:
df['clean_address'] = tidy_address(df, 'address') df_2['clean_address_2'] = tidy_address(df_2, 'address_2') df.head()
Out[10]:
address | clean_address | |
---|---|---|
0 | 1 Vanessa square, North Owenbury, B0 9UZ | 1VANESSASQUARENORTHOWENBURYB09UZ |
1 | 2 Stephen parkway, Shawbury, W1 8YB | 2STEPHENPARKWAYSHAWBURYW18YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG | 475SMITHFLATSHAWKINSTOWNE0F7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG | 721GARETHBURGJAMIEHAVENM704BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT | 27LYNNELIGHTSOLIVIAMOUTHW7A9GT |
In [11]:
df_2.head()
Out[11]:
address_2 | clean_address_2 | |
---|---|---|
0 | 1 vanessa square, north owenbury, b0 9uz | 1VANESSASQUARENORTHOWENBURYB09UZ |
1 | 2 Stephen parkway, Shawbury, W1 8YB | 2STEPHENPARKWAYSHAWBURYW18YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG | 475SMITHFLATSHAWKINSTOWNE0F7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG | 721GARETHBURGJAMIEHAVENM704BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT | 27LYNNELIGHTSOLIVIAMOUTHW7A9GT |
Now, merging these tables together on the clean addresses, we can see how our number of matches have changed.
In [12]:
df_merged_cleaned = df.merge(df_2, left_on='clean_address', right_on='clean_address_2', how='left') df_merged_cleaned.head()
Out[12]:
address | clean_address | address_2 | clean_address_2 | |
---|---|---|---|---|
0 | 1 Vanessa square, North Owenbury, B0 9UZ | 1VANESSASQUARENORTHOWENBURYB09UZ | 1 vanessa square, north owenbury, b0 9uz | 1VANESSASQUARENORTHOWENBURYB09UZ |
1 | 2 Stephen parkway, Shawbury, W1 8YB | 2STEPHENPARKWAYSHAWBURYW18YB | 2 Stephen parkway, Shawbury, W1 8YB | 2STEPHENPARKWAYSHAWBURYW18YB |
2 | 475 Smith flats, Hawkinstown, E0F 7RG | 475SMITHFLATSHAWKINSTOWNE0F7RG | 475 Smith flats, Hawkinstown, E0F 7RG | 475SMITHFLATSHAWKINSTOWNE0F7RG |
3 | 721 Gareth burg, Jamiehaven, M70 4BG | 721GARETHBURGJAMIEHAVENM704BG | 721 Gareth burg, Jamiehaven, M70 4BG | 721GARETHBURGJAMIEHAVENM704BG |
4 | 27 Lynne lights, Oliviamouth, W7A 9GT | 27LYNNELIGHTSOLIVIAMOUTHW7A9GT | 27 Lynne lights, Oliviamouth, W7A 9GT | 27LYNNELIGHTSOLIVIAMOUTHW7A9GT |
Summing the matched addresses, we get the following:
In [13]:
df_merged_cleaned['clean_address_2'].notna().sum()
Out[13]:
17
So, from carrying out some preprocessing steps, we take our matches from 9 to 17 of the 25 addresses.
Let us have a look at the remaining 8:
In [14]:
df_remaining = df_merged_cleaned[df_merged_cleaned['address_2'].isna()] df_remaining
Out[14]:
address | clean_address | address_2 | clean_address_2 | |
---|---|---|---|---|
5 | Flat 12A, Hilton pines, Port Daniellefurt, EN3… | FLAT12AHILTONPINESPORTDANIELLEFURTEN3N2YL | NaN | NaN |
6 | Studio 1, Hannah forest, West Ianbury, E5E 1YL | STUDIO1HANNAHFORESTWESTIANBURYE5E1YL | NaN | NaN |
7 | Flat 16, Geraldine mount, Port Rebecca, S29 4BQ | FLAT16GERALDINEMOUNTPORTREBECCAS294BQ | NaN | NaN |
11 | Studio 7, Joanne road, Frasershire, SA9 3GW | STUDIO7JOANNEROADFRASERSHIRESA93GW | NaN | NaN |
15 | 095 Ben neck, Richardsonberg, E2J 2WW | 095BENNECKRICHARDSONBERGE2J2WW | NaN | NaN |
16 | Flat 89, Thomas cliff, Sianchester, E5 9HQ | FLAT89THOMASCLIFFSIANCHESTERE59HQ | NaN | NaN |
20 | Flat 08, Smith mills, Clivestad, PE6V 9YU | FLAT08SMITHMILLSCLIVESTADPE6V9YU | NaN | NaN |
24 | Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ | STUDIO92YBAILEYRANCHGERALDINEMOUTHL2W9TZ | NaN | NaN |
Here, we see the cases where the cleaning did not work. Cleaning can only take you so far. Sometimes addresses are just written in two different ways, e.g., sometimes you can include the borough or county, other times not.
To try and match these remaining addresses, we will investigate the difflib module to find potential close matches.
In [15]:
# searching for close matches in the second dataframe # n=1 returns for the closest match # 0.6 is the default cut off where the suggested similarities are at least that similar close_matches = df_remaining['address'].apply(lambda x: difflib\ .get_close_matches(x, df_2['address_2'], n=1, cutoff=0.6)[0])
In [16]:
# creating dataframe to see the close matches df_close_match = pd.DataFrame({'original_address': df_remaining['address'], 'close_match': close_matches}) df_close_match
Out[16]:
original_address | close_match | |
---|---|---|
5 | Flat 12A, Hilton pines, Port Daniellefurt, EN3… | Flat 12 Hilton pines, Port Daniellefurt, EN3N 2YL |
6 | Studio 1, Hannah forest, West Ianbury, E5E 1YL | Flat 1, Hannah forest, West Ianbury, E5E 1YL |
7 | Flat 16, Geraldine mount, Port Rebecca, S29 4BQ | Geraldine mount, Flat 16, Port Rebecca, S29 4BQ |
11 | Studio 7, Joanne road, Frasershire, SA9 3GW | 7 Joanne road Frasershire SA9 3GW |
15 | 095 Ben neck, Richardsonberg, E2J 2WW | 95 Ben neck, Richardsonberg, E2J2WW |
16 | Flat 89, Thomas cliff, Sianchester, E5 9HQ | 89 Thomas cliff Sianchester E5 9HQ |
20 | Flat 08, Smith mills, Clivestad, PE6V 9YU | flat 8, smith mills, clivestad, pe6v9yu |
24 | Studio 92y, Bailey ranch, Geraldinemouth, L2W 9TZ | Studio 92, Bailey ranch, Geraldinemouth, L2W 9TZ |
This method has found all of the additional addresses that were not matched directly.
This step does involve some manual checking to ensure it has matched the correct addresses but it is a very helpful tool to guide you to the closest match!
Note: if you are searching for matches in a very large dataset, it could be helpful to split the address by postcode and only search through the subset of postcodes you are interested in. This reduces the amout of data the matcher has to look through and takes much less time.
Throughout this blog we have investigated three approaches to matching addresses:
- Direct matching
- Matching after cleaning
- Generating close matches
Although these steps have worked well in the example shown here, they may not work for all cases. There are many other approaches you could try if these do not work, some of these are outlined in this article.