Pages of Difficulty and Frustration: PDFs, and how to use Python to turn their content into Pretty Data Frames

PDF files are all over the internet and often contain a lot of important information. However, extracting that information can sometimes seem an impossible task.

In this blog we will learn how to extract the text from a PDF given here and print out clean sentences in a data frame.

First we need to install the following libraries and packages, note pdfminer.six is the version of pdfminer that has been used throughout this tutorial.

In [1]:

import pandas as pd
import re

from io import StringIO
from typing import Iterable, Tuple, Optional
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

In [2]:

file = "C:\\Users\\HollyBays\\smart-meter-policy-framework-post-2020-consultation.pdf"

I have saved the file to my local drive so this will look different depending on how you decide to save the PDF.

The following function takes filename as an argument and iteratively yields sentence text back to the function caller.

Sometimes PDFs are in a protected format and can cause problems when you are trying to scrape the text – setting the argument check_extractable to False enables you to overcome this.

In [3]:

def convert_pdf2txt(filename) -> Iterable[Tuple[str, str]]:
    rsrcmgr = PDFResourceManager()

    with open(filename, 'rb') as fp:
        pages = PDFPage.get_pages(fp, check_extractable=False)

        for page in pages:
            with StringIO() as s:
                with TextConverter(rsrcmgr, s) as device:
                    PDFPageInterpreter(rsrcmgr, device).process_page(page)
                
                text = s.getvalue()

            yield page.pageid, text

Now, we can write a function which uses our convert_pdf2txt function and outputs a data frame of cleaned sentences.

Using the enumerate function indexes our tuple (page_id, page_text) to identify the page numbers of our PDF.

To find the sentences we do a Regex Split re.split(). This splits the string by the occurrences of the regex pattern. In our case, we are splitting the strings every time there is a full stop, unless that full stop appears within a decimal number. We are also splitting whenever we have two or more blank lines or when there is a bullet point.

Note: if you are unfamiliar with regex, you can practice writing regular expressions at https://regex101.com/.

Once we have split up the text into sentences we now do a re.sub() which replaces one or many matches with a string in the given text. To make the sentences cleaner, we are replacing any amount of white space by a single space. We then use .strip() to remove the spaces at the beginning and end of the sentences.

We append the sentence if there is more than one character within that sentence and if the sentences exist, we put them into a data frame.

In [4]:

def clean_sentences(pdf) -> Optional[pd.DataFrame]:
    sentences = []

    for page_num, (page_id, page_text) in enumerate(convert_pdf2txt(pdf)):
        for sentence in re.split("[.](?![0-9])|\\n{2}|•", page_text):
            sentence_clean = re.sub('\\s+', ' ', sentence).strip()
            if len(sentence_clean) > 1:
                sentences.append({'sentence': sentence_clean, 'page_number': page_num + 1})

    if len(sentences) > 0:
        df = pd.DataFrame(sentences)
        return df

In [5]:

clean_sentences(file)

Out[5]:

	sentence	page_number
0	September 2019 DELIVERING A SMART SYSTEM Consu…	1
1	© Crown copyright 2019 This publication is lic…	2
2	To view this licence, visit nationalarchives	2
3	gov	2
4	uk/doc/open-government-licence/version/3 or wr…	2
…	…	…
738	uk/government/consultations/smart-meter-policy…	41
739	gov	41
740	uk	41
741	Please tell us what format you need	41
742	It will help us if you say what assistive tech…	41