I want to see how many times proper nouns are mentioned in the text I’m using (the first part of Pride and Prejudice) and whether any of them are Mr. Darcy.
There are a few ways to identify parts of speech in text, however I’m going to be using TextBlob, a Python library. I’ll to show you how to filter a .csv file to help identify the parts of speech in specific text and count how many times they appear.
(If you’re going to follow along and haven’t used TextBlob before then read this for installation details: https://textblob.readthedocs.io/en/dev/install.html )
I’ve pre-loaded a .csv file with the first couple of pages of Pride and Prejudice. The text is in a column called ‘text’. In this file I have another column called ‘chapter’ that corresponds to the chapter the text is from. This column is labelled 1, 2, 3, 4…
If, for instance, I decide that I only want to look at chapter 1 then I’ll need to add a filter to my code.
The TextBlob library can be used to process data for several reasons, such as parts of speech tagging (what we’re doing in this blog), sentiment analysis, translation, phrase extraction, spelling correction and more. To have a look at the features and origins of TextBlob visit: https://textblob.readthedocs.io/en/dev/
Firstly, we need to import TextBlob and we’ll also use Pandas to import the .csv file.
Then we can import the csv file.
The next two lines of code are completely optional. This is where we are going to filter our text column by other columns. I want to only use the text where the chapter column equals 1.
The next line of code is also optional. If, when you are analysing your text, you find that your text is being truncated because you are exceeding the normal Pandas column width, you can remove the max column width like so:
(For more on truncation and other Pandas display options take a look at this blog: https://towardsdatascience.com/6-pandas-display-options-you-should-memories-84adf8887bc3 )
Now is the time to tell TextBlob where the text is located, in this case it’s in the ‘text’ column.
In the last of my code I’m going to create an empty dictionary (words) and check for parts of speech in blob (the input_text). I don’t want any old parts of speech though. I’ve decided to look for NNP (singular proper nouns) and NNPS (plural proper nouns) in my hunt for Mr. Darcy and others.
If you want to see a list of the possible tags then check out this github link. I can’t see a particular list for TextBlob, but this list is pretty comprehensive for universal tags: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
This is my print:
It is a truth universally acknowledged that I found four ‘Bennet’ and two ‘Bingley’ but, unfortunately, no Mr. Darcy. In this example I didn’t use much text from the chapter, so I’m not surprised that he isn’t mentioned. Maybe I should try the whole booked next time?
If you want to read Pride and Prejudice (I’d thoroughly recommend it) check out: https://www.gutenberg.org/files/1342/1342-h/1342-h.htm
If you want to have a go yourself then the full code is below. Why not try adding some of your own stop words to filter out words you aren’t interested in?
Happy tagging!
By Hannah Johnstone
In [23]:
import pandas as pd from textblob import TextBlob df = pd.read_csv(r'C:\Users\name\file_name.csv') #filter1 = df["column_name"] == "value" #df.where(filter1, inplace = True) #pd.options.display.max_colwidth = None input_text = df['column_name'].to_string(index=False) blob = TextBlob(input_text) words = {} for word, pos in blob.tags: #print(word,pos) if pos == 'NNP' or pos == 'NNPS': if word in words: words[word] = words[word] + 1 else: words[word] = 1 print(words)