Pages of Difficulty and Frustration: PDFs, and how to use Python to turn their content into Pretty Data Frames

PDF files are all over the internet and often contain a lot of important information. However, extracting that information can sometimes seem an impossible task.

In this blog we will learn how to extract the text from a PDF given here and print out clean sentences in a data frame.

First we need to install the following libraries and packages, note pdfminer.six is the version of pdfminer that has been used throughout this tutorial.

In [1]:

import pandas as pd
import re

from io import StringIO
from typing import Iterable, Tuple, Optional
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

In [2]:

file = "C:\\Users\\HollyBays\\smart-meter-policy-framework-post-2020-consultation.pdf"

I have saved the file to my local drive so this will look different depending on how you decide to save the PDF.

The following function takes filename as an argument and iteratively yields sentence text back to the function caller.

Sometimes PDFs are in a protected format and can cause problems when you are trying to scrape the text – setting the argument check_extractable to False enables you to overcome this.

In [3]:

def convert_pdf2txt(filename) -> Iterable[Tuple[str, str]]:
    rsrcmgr = PDFResourceManager()

    with open(filename, 'rb') as fp:
        pages = PDFPage.get_pages(fp, check_extractable=False)

        for page in pages:
            with StringIO() as s:
                with TextConverter(rsrcmgr, s) as device:
                    PDFPageInterpreter(rsrcmgr, device).process_page(page)
                
                text = s.getvalue()

            yield page.pageid, text

Now, we can write a function which uses our convert_pdf2txt function and outputs a data frame of cleaned sentences.

Using the enumerate function indexes our tuple (page_id, page_text) to identify the page numbers of our PDF.

To find the sentences we do a Regex Split re.split(). This splits the string by the occurrences of the regex pattern. In our case, we are splitting the strings every time there is a full stop, unless that full stop appears within a decimal number. We are also splitting whenever we have two or more blank lines or when there is a bullet point.

Note: if you are unfamiliar with regex, you can practice writing regular expressions at https://regex101.com/.

Once we have split up the text into sentences we now do a re.sub() which replaces one or many matches with a string in the given text. To make the sentences cleaner, we are replacing any amount of white space by a single space. We then use .strip() to remove the spaces at the beginning and end of the sentences.

We append the sentence if there is more than one character within that sentence and if the sentences exist, we put them into a data frame.

In [4]:

def clean_sentences(pdf) -> Optional[pd.DataFrame]:
    sentences = []

    for page_num, (page_id, page_text) in enumerate(convert_pdf2txt(pdf)):
        for sentence in re.split("[.](?![0-9])|\\n{2}|•", page_text):
            sentence_clean = re.sub('\\s+', ' ', sentence).strip()
            if len(sentence_clean) > 1:
                sentences.append({'sentence': sentence_clean, 'page_number': page_num + 1})

    if len(sentences) > 0:
        df = pd.DataFrame(sentences)
        return df

In [5]:

clean_sentences(file)

Out[5]:

sentencepage_number
0September 2019 DELIVERING A SMART SYSTEM Consu…1
1© Crown copyright 2019 This publication is lic…2
2To view this licence, visit nationalarchives2
3gov2
4uk/doc/open-government-licence/version/3 or wr…2
738uk/government/consultations/smart-meter-policy…41
739gov41
740uk41
741Please tell us what format you need41
742It will help us if you say what assistive tech…41

743 rows × 2 columns

Using the above code you will find it much easier to extract important information from PDFs, store it neatly, and analyse it with Python.

By Holly Jones

Published
Categorized as blog

Leave a comment

Your email address will not be published. Required fields are marked *