Web Scraping 101
The web is a huge source of information and data but often the information is presented in an unhelpful manner. Sometimes all you need is a concise summary – this is where web scraping comes in.
Take looking for jobs as an example, do you really want to spend all evening trawling through websites when really only a handful of jobs are applicable to you? Web scraping can be used to extract specific pieces of information from multiple pages.
It’s important to note, not all websites allow you to scrape their pages. Before you begin, make sure to check the terms and conditions of the website.
This tutorial will show you how to web scrape through an example of the governments apprenticeship site (https://www.findapprenticeship.service.gov.uk) which is permitted under the OGL (Open Government License).
In this blog we will cover:
- Web scraping a single page using the Beautiful Soup library in python
- Automating the process to scrape multiple pages
Importing the packages
Before we begin, we need to import the packages that will be useful throughout this example. The requests library allows us to download the data from a page. We will be using urllib.parse to combine URL components into a URL string. Last but not least we will be using BeautifulSoup to scrape the information from the web page HTML.
In addition to the link shown above, we need to set some parameters to get to the page we want to scrape. For the purpose of this example, we are going to look at all levels of apprenticeships within 30 miles of Nottingham and it has been set to show 50 per page.
The page we want to scrape can be seen below.
We can define these parameters beforehand and then by using urllib.parse we can combine the components into a URL string.
Extracting the ‘soup’
As mentioned earlier, by using requests.get we can download the page by inputting the website’s URL.
Below shows how to use urllib.parse to join together the parameters defined in the code block above into a URL string.
The URL string ends up being this which also works when inputted into requests.get . However, by defining the parameters beforehand, it allows you to easily change the variables. You can even write a simple loop to run over multiple locations by defining the parameters first!
The web pages are created in HTML. HTML is formed of many different elements. We want to find specific elements and extract these from the main content of the page.
We will be looking for unique class or id properties in the HTML code to specify which parts of the ‘soup’ we want to extract.
By inspecting the HTML code, we need to find the class search-results to extract the main content from the page. This can be done using the find command as shown in the code block below.
Once the main content of the page is extracted, the next step is to identify which information you want to extract from the page. For this example, we will look at the occupation description, the company offering the apprenticeship, the apprenticeship level and the salary. First, we define these variables as empty vectors.
Within the main content, we now need to find all of the apprenticeship adverts individually. Each apprenticeship card has the element li with class=research-result sfa-section-bordered.
By finding all li with class = research-result in a similar way to that of the main content, the following piece of code can be used.
By looping over each apprenticeship advert we can find the information, extract the text and append it to a list. This is all shown in the code block below.
Putting all of the information into a dataframe:
Automating the process over multiple pages
Now we want to automate this process to run over multiple pages. How do we do this by avoiding doing a for loop? Don’t get me wrong – a for loop would work for this but if you want to look at multiple locations, how do you know how many pages you want to sift through?
We are basically going to use the same code within a while loop. But first, we need to inspect the next button.
We will start with
Which is the same as in the earlier example. However, this time we will re-write the next link in our loop everytime we move page.
By inspecting the next page button (shown above) we can extract the URL by the following code:
Putting all of this inside a while loop which states that while the next link exists, run all of the code written above. This allows us to scrape over multiple pages.
The final piece of code is shown below.
1531 rows × 4 columns