Web Scraping 101

The web is a huge source of information and data but often the information is presented in an unhelpful manner. Sometimes all you need is a concise summary – this is where web scraping comes in.

Take looking for jobs as an example, do you really want to spend all evening trawling through websites when really only a handful of jobs are applicable to you? Web scraping can be used to extract specific pieces of information from multiple pages.

It’s important to note, not all websites allow you to scrape their pages. Before you begin, make sure to check the terms and conditions of the website.

This tutorial will show you how to web scrape through an example of the governments apprenticeship site (https://www.findapprenticeship.service.gov.uk) which is permitted under the OGL (Open Government License).

In this blog we will cover:

  • Web scraping a single page using the Beautiful Soup library in python
  • Automating the process to scrape multiple pages

Importing the packages

Before we begin, we need to import the packages that will be useful throughout this example. The requests library allows us to download the data from a page. We will be using urllib.parse to combine URL components into a URL string. Last but not least we will be using BeautifulSoup to scrape the information from the web page HTML.

In [1]:

In addition to the link shown above, we need to set some parameters to get to the page we want to scrape. For the purpose of this example, we are going to look at all levels of apprenticeships within 30 miles of Nottingham and it has been set to show 50 per page.

The page we want to scrape can be seen below.

 

We can define these parameters beforehand and then by using urllib.parse we can combine the components into a URL string.

In [2]:

Extracting the ‘soup’

As mentioned earlier, by using requests.get we can download the page by inputting the website’s URL.

Below shows how to use urllib.parse to join together the parameters defined in the code block above into a URL string.

The URL string ends up being this which also works when inputted into requests.get . However, by defining the parameters beforehand, it allows you to easily change the variables. You can even write a simple loop to run over multiple locations by defining the parameters first!

In [3]:

The web pages are created in HTML. HTML is formed of many different elements. We want to find specific elements and extract these from the main content of the page.

We will be looking for unique class or id properties in the HTML code to specify which parts of the ‘soup’ we want to extract.

By inspecting the HTML code, we need to find the class search-results to extract the main content from the page. This can be done using the find command as shown in the code block below.

 

In [4]:

Once the main content of the page is extracted, the next step is to identify which information you want to extract from the page. For this example, we will look at the occupation description, the company offering the apprenticeship, the apprenticeship level and the salary. First, we define these variables as empty vectors.

In [5]:

Within the main content, we now need to find all of the apprenticeship adverts individually. Each apprenticeship card has the element li with class=research-result sfa-section-bordered.

By finding all li with class = research-result in a similar way to that of the main content, the following piece of code can be used.

In [6]:
Within each advert we can find the variables that we defined above. By inspecting the information we are interested in (occupation description is used as an example in the image below) we can extract that part of the soup for each apprenticeship advert.
By looping over each apprenticeship advert we can find the information, extract the text and append it to a list. This is all shown in the code block below.
In [7]:
Note, we are using str.split(wages, ‘:’)[1].strip() . If we had printed wages before using this line of code, we would get an output of the format Wage : £168.75 per week (for example). So, by splitting this string at ‘:’, we can extract the information after the colon by [1].strip.

By looping over each apprenticeship advert we can find the information, extract the text and append it to a list. This is all shown in the code block below.

Putting all of the information into a dataframe:

In [8]:

In  [9]:

Out[9]:

Automating the process over multiple pages

Now we want to automate this process to run over multiple pages. How do we do this by avoiding doing a for loop? Don’t get me wrong – a for loop would work for this but if you want to look at multiple locations, how do you know how many pages you want to sift through?

We are basically going to use the same code within a while loop. But first, we need to inspect the next button.

We will start with

In [10]:

Which is the same as in the earlier example. However, this time we will re-write the next link in our loop everytime we move page.

By inspecting the next page button (shown above) we can extract the URL by the following code:

In [11]:

Putting all of this inside a while loop which states that while the next link exists, run all of the code written above. This allows us to scrape over multiple pages.

The final piece of code is shown below.

In [12]:

In   [13]: df_apprenticeships
Out[13]:

1531 rows × 4 columns

This now shows all of the apprenticeship opportunities within 30 miles of Nottingham.
There are so many uses of web scraping and this example is just scratching the surface! But hopefully this tutorial provides an introduction to the basics.
REMEMBER Check the terms and conditions before you go ahead!

By Holly Jones