Using GeoPandas to count nearby bus stops
Introduction
At Bays Consulting, one of our current passions lies in delivering insight for local areas in the UK. As a data scientist, I am in my comfort zone when working with data that describes abstract entities, like financial transactions or computer network events. At a basic level, analysis and profiling of such data generally relies on simple, exact keys for joining and aggregation. But to explore deeper what’s happening in a local community, we need to expand our set of available tools and get to grips with geospatial analysis; that is, analysis of data that includes points, lines and polygons in space.
In this blog I am going take a short dive into some geospatial analysis, by asking the question of how many bus stops can be reached easily from a local area. I’ll making particular use of the geopandas
library.
Where to get the data?
I’m going to use two publicly available data sets in today’s blog.
Firstly, to profile small localities, I’m going to work at the level of Lower Layer Super Output Areas (LSOAs). These are a common unit of statistical geography here in the UK, aiming to encompass about 1,500 people. It’s handy to work with these when building up a profile of a local area, as there is a good supply of up-to-date open socioeconomic data despite the relatively local size. I’ve taken boundary data for the LSOAs in a portion of North Yorkshire from the ONS’s Open Geography Portal).
Secondly, to count up accessibility of public transport access points, the Department for Transport in the UK helpfully publishes a comprehensive, standardised gazetteer of all transport stops, known as NaPTAN.
Bringing the data in
To run these examples in Python 3, you will need to install the packages below.
In [1]:
import requests # for internet downloads import urllib # URL manipulation import pandas as pd # standard dataset manipulation import geopandas # pandas extension for working with spatial datasets import descartes # needed for making the figures import shapely # base geometry import matplotlib # plotting
Now I’ll construct a query to the Geoportal to download the LSOAs in my area of interest, as JSON, and pop them into a GeoDataFrame
. I’ll explain the significance of some of the code in comments. You’ll notice we’re working in latitude and longitude, also known as EPSG:4326; more on this later!
ln{2}:
# the base URL base_url = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/lsoa/FeatureServer/0/query" # a rectangular slice of North Yorks bounds = [-1.26, 53.82, 0.00, 54.39] params = { 'outFields': ','.join(("LSOA11CD", "LSOA11NM")), # choose to keep LSOA code and name in the results 'geometryType': "esriGeometryEnvelope", # our area of interest is a rectangle 'spatialRel': "esriSpatialRelIntersects", # get all data intersecting our rectangle 'geometry': ','.join(str(b) for b in bounds), # send the bounding rectangle as a comma-delimited string 'inSR': 4326, 'outSR': 4326, # use good old latitude/longitude as the coordinates 'f': "json", # JSON please, something GeoPandas can work with 'resultType': 'standard', # download all results, rather than paginating } url = base_url + '?' + urllib.parse.urlencode(params) # combine our parameters into a query string areas = geopandas.read_file(url).set_index('LSOA11CD') # download and parse using GeoPandas, use LSOA code as index
Now we need the bus stops from NaPTAN. The full dataset is rather large, but we can save some time by only requesting data for North Yorkshire and the East Riding; these areas are coded as 320
and 370
in the National Public Transport Gazetteer; in addition, the city of York is labelled separately as 329
. The data comes as a zipped set of CSVs; we’re just interested in the stop locations today. Unhelpfully, the CSV files are nested inside a second zip file for each region, but the Python code given here can nevertheless dig them out.
In [3]:
# libraries to work with zip files from io import BytesIO from zipfile import ZipFile # the North Yorks dataset url = "https://naptan.app.dft.gov.uk/DataRequest/Naptan.ashx?format=csv&LA=220|320|329" # high-level areas of interest stops = [] with requests.get(url) as r: with ZipFile(BytesIO(r.content)) as z1: for f in z1.filelist: # list the contents of top level zip file if f.filename.endswith(".zip"): # find all nested zip files with z1.open(f.filename) as f1: # open the nested zip file for reading with ZipFile(f1) as z2: with z2.open("Stops.csv") as f2: # open the csv for reading # use pandas to read the csv data, keeping certain columns of interest stops.append(pd.read_csv(f2, usecols=['ATCOCode','CommonName', 'Latitude', 'Longitude'], index_col='ATCOCode', encoding='iso-8859-1')) # the string data is encoded in a particular way stops = pd.concat(stops) # combine the list of data frames into one big data frame
Let’s have a quick look at that data. The good news is that the stop is named and geo-located in latitude/longitude co-ordinates, but we need to do a little more processing to get the data into GeoPandas for spatial analysis.
In [4]:
stops.head(3)
Out[4]:
CommonName | Longitude | Latitude | |
ATCOCode | |||
2200YEA00079 | Beverley Lairgate | -0.430858 | 53.839716 |
2200YEA00067 | Beverley Swinemoor Lane | -0.409905 | 53.846380 |
2200YEZ01389 | Cowden Lane End | -0.134892 | 53.865639 |
To convert the plain pandas DataFrame
into a GeoDataFrame
, we apply a function to each row of the data, combining the latitude and longitude into a Point
object. We can use the resulting sets of points as the geometry
of our new data frame. Finally, to constrain today’s analysis a bit, we can use the clip
utility to subset our stops to the total bounding rectangle of our areas of interest.
Note that the Point
constructor takes its arguments in the order x
i.e. east/west followed by y
, so we must pass in longitude and then latitude; usually latitude and longitude are quoted in the reverse order!
In [5]:
from shapely.geometry import Point, box points = stops.apply(lambda x: Point(x.Longitude, x.Latitude), axis='columns') stops = geopandas.GeoDataFrame(stops, geometry=points, crs='EPSG:4326') # convert to GeoDataFrame, marking that lat/longs are used stops = geopandas.clip(stops, box(*areas.total_bounds)) # form a bounding rectangle from the areas, and clip the stop data
Let’s summarise what we’ve done so far. We have created two geometric-aware data frames:
areas
: names and polygons of local areas (LSOAs)stops
: point locations of public transport access points i.e. bus stops.
We can use the built-in geopandas plotting utilities to visualize these.
ln[6]:
ax1 = areas.plot(facecolor='lightblue', edgecolor='darkblue', figsize=(15, 10)) # the LSOA polygons ax2 = stops.plot(ax=ax1, color='red', marker='^', markersize=2.0) # the stops, using the same plot ax3 = ax1.set(title='LSOAs and Bus Stops in Yorkshire') # add a title
Reassuringly, the population centres of York (bottom left), Bridlington (middle right), and Scarborough (upper right) stand out. Looking more closely at the rural areas, we can see how the bus stops string out along what we can presume are major roads.
Finding the density of stops
Now we have the data in, we can start to analyse further. Today I would like to join together the two datasets we have, in order to count how many bus stops are accessible from each LSOA.
Usually in a data analysis task of this nature, I would be looking for a key on which to join the two data frames together. But in this kind of geographic analysis, it’s a spatial join: matching the two sets of data whenever one of the points lies within, or near, one of the polygons. Fortunately, geopandas takes care of most of the heavy lifting with the sjoin
(i.e. spatial join) method.
Looking outside the box
Before I can employ the sjoin
, I must remember that I don’t want to match bus stops only when they lie within an area, but also if they are near enough to an area to be considered accessible. Without getting into the complications of roads, foot paths etc. I shall make the rough assumption that an point is ‘accessible’ if it lies within 1km of the area.
We can use the geopandas buffer
method to expand each polygon area by a fixed distance beyond its orginal border, but first we must reproject our geometric co-ordinates.
As I mentioned above, we have obtained and processed the data so far using the spherical concepts of latitude and longitude. The libraries we’re using, however, calculate distance in a Euclidean manner, using our good old friend Pythagoras’s theorem. This is not a big problem if our analysis is constrained to the UK, since we can use “eastings and northings” as coordinates, which project the UK onto a cartesian grid, with distances measured in metres. These units may be already familiar from the markings on an Ordnance Survey map! The standardised name for this system is EPSG:27700.
Let’s start by putting this into practice on one example area.
ln[7]:
example = areas.loc[['E01012940']].to_crs('EPSG:27700') # pick an example, and transform co-ordinates ax1 = example.buffer(1000).plot(facecolor='lightblue', figsize=(5, 5)) # fill the area expanded by 1000 metres ax2 = example.plot(facecolor='none', edgecolor='darkblue', ax=ax1) # add the original boundary line to the same plot
We can clearly see the original area surrounded by its expanded, or buffered, extent at 1km away. Notice the axes and now labelled with the units in eastings/northings, just like an OS map.
Putting it all together
We’re now ready to combine the two datasets with a spatial join. We mustn’t forget to reproject the points as well as the areas, or we would not be comparing like with like. We’ll summarize the results by counting and plotting the number of matched points per area.
In [8]:
buffers = areas.set_geometry(areas.to_crs('EPSG:27700').buffer(1000)) # new GeoDataFrame with expanded areas joined = geopandas.sjoin(buffers, stops.to_crs('EPSG:27700')) # spatial join counts = joined.index.value_counts().rename('stopsPerLSOA') # count the number of resulting rows per LSOA
To illustrate the results, I have plotted the stop count for each LSOA to show the variation in bus stop density, firstly as a bar and secondly as a shaded map area. You can see quite clearly the concentration of bus stops that lie close to the City of York in the bottom-left quadrant.
In [9]:
# set up a plot area from matplotlib import pyplot as plt fig, axes = plt.subplots(ncols=2, figsize=(20,5)) # bar chart, no need for x-axis axes[0].bar(counts.index, counts, width=1) axes[0].get_xaxis().set_visible(False) # shaded map aka choropleth counts_areas = geopandas.GeoDataFrame(counts, geometry=areas.geometry) # new GeoDataFrame combining count and polygon counts_areas.plot('stopsPerLSOA', cmap='coolwarm', legend=True, ax=axes[1]) # make a shaded map text = fig.suptitle('Bus stops in Yorkshire per LSOA')
Conclusion¶
In my blog today I have aimed to provide some examples of working on geospatial data with GeoPandas, including
- data import and cleansing
- understanding and transforming co-ordinate systems
- geometric operations such as
clip
andbuffer
- spatial joins.
Beyond the technical walk-through, along the way I’ve shown a first step of insight into two openly-available datasets concerning public transport in the UK. Although at first glance these might seem of interest only to nerdy enthusiasts, the maps produced hint at the wide variation in accessibility of public transport, particularly in rural areas of the UK. As we work together to level-up and rebuild local economies, not least in the light of the COVID-19 epidemic, this provides food for thought on the barriers faced by many people up and down the land. In the short-term these barriers can prevent access to assistance, including food aid, but above all can be part of structural challenges to skills and employment.
I’m looking forward to engaging more with these important areas in my time here at Bays Consulting. For now, I hope this is useful for any data scientists looking to take first steps in geospatial analysis.
By Matthew Silverman,