PDF files are all over the internet and often contain a lot of important information. However, extracting that information can sometimes seem an impossible task.
In this blog we will learn how to extract the text from a PDF given here and print out clean sentences in a data frame.
First we need to install the following libraries and packages, note pdfminer.six is the version of pdfminer that has been used throughout this tutorial.
In [1]:
import pandas as pd import re from io import StringIO from typing import Iterable, Tuple, Optional from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage
In [2]:
file = "C:\\Users\\HollyBays\\smart-meter-policy-framework-post-2020-consultation.pdf"
I have saved the file to my local drive so this will look different depending on how you decide to save the PDF.
The following function takes filename as an argument and iteratively yields sentence text back to the function caller.
Sometimes PDFs are in a protected format and can cause problems when you are trying to scrape the text – setting the argument check_extractable
to False
enables you to overcome this.
In [3]:
def convert_pdf2txt(filename) -> Iterable[Tuple[str, str]]: rsrcmgr = PDFResourceManager() with open(filename, 'rb') as fp: pages = PDFPage.get_pages(fp, check_extractable=False) for page in pages: with StringIO() as s: with TextConverter(rsrcmgr, s) as device: PDFPageInterpreter(rsrcmgr, device).process_page(page) text = s.getvalue() yield page.pageid, text
Now, we can write a function which uses our convert_pdf2txt
function and outputs a data frame of cleaned sentences.
Using the enumerate
function indexes our tuple (page_id, page_text) to identify the page numbers of our PDF.
To find the sentences we do a Regex Split re.split()
. This splits the string by the occurrences of the regex pattern. In our case, we are splitting the strings every time there is a full stop, unless that full stop appears within a decimal number. We are also splitting whenever we have two or more blank lines or when there is a bullet point.
Note: if you are unfamiliar with regex, you can practice writing regular expressions at https://regex101.com/.
Once we have split up the text into sentences we now do a re.sub()
which replaces one or many matches with a string in the given text. To make the sentences cleaner, we are replacing any amount of white space by a single space. We then use .strip()
to remove the spaces at the beginning and end of the sentences.
We append the sentence if there is more than one character within that sentence and if the sentences exist, we put them into a data frame.
In [4]:
def clean_sentences(pdf) -> Optional[pd.DataFrame]: sentences = [] for page_num, (page_id, page_text) in enumerate(convert_pdf2txt(pdf)): for sentence in re.split("[.](?![0-9])|\\n{2}|•", page_text): sentence_clean = re.sub('\\s+', ' ', sentence).strip() if len(sentence_clean) > 1: sentences.append({'sentence': sentence_clean, 'page_number': page_num + 1}) if len(sentences) > 0: df = pd.DataFrame(sentences) return df
In [5]:
clean_sentences(file)
Out[5]:
sentence | page_number | |
---|---|---|
0 | September 2019 DELIVERING A SMART SYSTEM Consu… | 1 |
1 | © Crown copyright 2019 This publication is lic… | 2 |
2 | To view this licence, visit nationalarchives | 2 |
3 | gov | 2 |
4 | uk/doc/open-government-licence/version/3 or wr… | 2 |
… | … | … |
738 | uk/government/consultations/smart-meter-policy… | 41 |
739 | gov | 41 |
740 | uk | 41 |
741 | Please tell us what format you need | 41 |
742 | It will help us if you say what assistive tech… | 41 |
743 rows × 2 columns
Using the above code you will find it much easier to extract important information from PDFs, store it neatly, and analyse it with Python.