Machine Learning for Sentence Classification

Recently I’ve been studying NLP more than other data science fields, and one challenge that I face more often than not is the cleaning part of the process. Building NLP models require many pre-processing steps, and if the data is not properly treated, it could result in poor models, which is necessarily what we want to avoid.
In this article, we’re going to focus on PDF documents. The goal here is to open a PDF file, convert it to plain text, understand the need for Data Cleaning and build a machine learning model for that purpose.
In this post we will:
- Open a PDF file and convert it into a text string
- Split that text into sentences and build a data set
- Manually label that data with user interaction
- Make a classifier to remove unwanted sentences
Some libraries we’re going to use:
- pdfminer → read PDF files
- textblob → text processing
- pandas → data analysis
PDF Reader
As always, I’ll try to explain the code used along the text, so feel free to skip the snippets if you’d like. Let’s start by importing some modules:
from collections import Counter
from IPython.display import clear_output
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from textblob import TextBlob
import io
import math
import numpy as np
import pandas as pd
import string
We are going to use pdfminer to build our PDF reader:
def read_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
text = " ".join(text.replace(u"xa0", " ").strip().split())
fp.close()
device.close()
retstr.close()
return text
Although this function seems long, it’s just reading a PDF file and returning its text as a string. We’ll apply it to a paper called "A Hands-on Guide to Google Data":
By just looking at the first page, we quickly see that an article contains much more than simple sentences, including elements like dates, line counts, page numbers, titles and subtitles, section separators, equations, and so on. Let’s check how those properties will come out when the paper is converted to plain text (primer.pdf is the name of the file, stored locally in my computer):
read_pdf('primer.pdf')

It’s clear here that we lost the all the text structure. Line counts and page numbers are spread as they were part of sentences, while titles and references can’t be clearly distinguished from the text bodies. There are probably many ways out there for you to conserve the text structure while reading a PDF, but let’s keep this messed up for the sake of explanation (as this is very often how raw text data looks like).
Text Cleaning
A full cleaning pipeline has many steps, and to become familiar with them I suggest following some tutorials ([this](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/) and this are great starting points). In general lines, the cleaning process chain would include:
- Tokenization
- Normalization
- Entity extraction
- Spelling and grammar correction
- Removing punctuation
- Removing special characters
- Word Stemming
Our goal here isn’t to replace any of those stages, but instead, build a more general tool to remove what’s unwanted for us. Take it as a complementary step to help in the middle.
Let’s suppose we want to get rid of any sentence that does not look like human written. The idea is to classify those sentences as "unwanted" or "weird" and consider the remaining sentences "normal". For example:
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 related.
Or
51 52 53 54 55 # read data from correlate and make it a zoo time series dat <- read.csv("Data/econ-HSN1FNSA.csv") y <- zoo(dat[,2],as.
Those sentences are clearly messed up because of the text transformation and in case we’re making, let’s say, a PDF summarizer, they shouldn’t be included.
To remove them, we could manually analyze the text, figure out some patterns and apply regular expressions. But, in some cases, it might be better to build a model that find those patterns for us. This is what we’re doing here. We’ll create a classifier to recognize weird sentences so that we can easily remove them from the text body.
Building The Data Set
Let’s build a function to open the PDF file, split the text into sentences and save them into a data frame with columns label and sentence:
def pdf_to_df(path):
content = read_pdf(path)
blob = TextBlob(content)
sentences = blob.sentences
df = pd.DataFrame({'sentence': sentences, 'label': np.nan})
df['sentence'] = df.sentence.apply(''.join)
return df
df = pdf_to_df('primer.pdf')
df.head()

Since we don’t have the data labeled (in "weird" or "normal" ), we’re going to do it manually to fill our label column. This data set will be updatable so that we can attach new documents to it and label their sentences.
Let’s first save the unlabelled dataset into a .pickle file:
df.to_pickle('weird_sentences.pickle')
Now we’ll create a user interaction function to manually classify the data points. For each sentence in the dataset, we’ll display a text box for the user to type ‘1’ or nothing. If the user types ‘1’, the sentence is to be classified as "weird".
I’m using a Jupyter Notebook so I’ve called the clear_output() function from IPython.display to improve the interaction.
def manually_label(pickle_file):
print('Is this sentence weird? Type 1 if yes. n')
df = pd.read_pickle(pickle_file)
for index, row in df.iterrows():
if pd.isnull(row.label):
print(row.sentence)
label = input()
if label == '1':
df.loc[index, 'label'] = 1
if label == '':
df.loc[index, 'label'] = 0
clear_output()
df.to_pickle('weird_sentences.pickle')
print('No more labels to classify!')
manually_label('weird_sentences.pickle')
This is how the output looks like for each sentence:

Since this sentence looks pretty normal, I won’t type ‘1’, but simply press enter and move on to the next one. This process will repeat until the dataset is fully labeled or when you interrupt. Every user input is being saved to the pickle file, so the dataset is being updated at each sentence. This easy interaction made it relatively fast to label the data. It took me 20 minutes to have about 500 data points labeled.
Two other functions were written to keep things simple. One to attach another PDF file to our dataset and another one to reset all the labels (sets the label column values to np.nan).
def append_pdf(pdf_path, df_pickle):
new_data = pdf_to_df(pdf_path)
df = pd.read_pickle(df_pickle)
df = df.append(new_data)
df = df.reset_index(drop=True)
df.to_pickle(df_pickle)
def reset_labels(df_pickle):
df = pd.read_pickle(df_pickle)
df['label'] = np.nan
df.to_pickle(df_pickle)
As we ended up with more "normal" than "weird" sentences, I built a function to undersample the dataset, otherwise, some machine learning algorithms wouldn’t perform well:
def undersample(df, target_col, r=1):
falses = df[target_col].value_counts()[0]
trues = df[target_col].value_counts()[1]
relation = float(trues)/float(falses)
if trues >= r*falses:
df_drop = df[df[target_col] == True]
drop_size = int(math.fabs(int((relation - r) * (falses))))
else:
df_drop = df[df[target_col] == False]
drop_size = int(math.fabs(int((r-relation) * (falses))))
df_drop = df_drop.sample(drop_size)
df = df.drop(labels=df_drop.index, axis=0)
return df
df = pd.read_pickle('weird_sentences.pickle').dropna()
df = undersample(df, 'label')
df.label.value_counts()

645 labeled data points. Not enough to make a decent model, but we’ll use it as a playground example.
Text Transformation
Now we need to transform the sentences in a way the algorithm can understand. One form of doing that is counting the occurrence of each character inside the sentence. That would be something like a bag-of-words technique, but at the character level.
def bag_of_chars(df, text_col):
chars = []
df['char_list'] = df[text_col].apply(list)
df['char_counts'] = df.char_list.apply(Counter)
for index, row in df.iterrows():
for c in row.char_counts:
df.loc[index, c] = row.char_counts[c]
chars = list(set(chars))
df = df.fillna(0).drop(['sentence', 'char_list', 'char_counts'], 1)
return df
data = bag_of_chars(df, 'sentence')
data.head()

Machine Learning Model
Perfect! Now we’re just left with a usual machine learning challenge. Many features and one target in a classification problem. Let’s split the data into train and test sets:
data = data.sample(len(data)).reset_index(drop=True)
train_data = data.iloc[:400]
test_data = data.iloc[400:]
x_train = train_data.drop('label', 1)
y_train = train_data['label']
x_test = test_data.drop('label', 1)
y_test = test_data['label']
We’re ready to choose an algorithm and check its performance. Here I’m using a Logistic Regression just to see what we can achieve:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression()
lr.fit(x_train, y_train)
accuracy_score(y_test, lr.predict(x_test))

86 % accuracy. That’s pretty good for a tiny dataset, a shallow model and a bag-of-chars approach. The only problem is that although we split into training and testing, we are evaluating the model with the same document that we trained on. A more appropriate approach would be using a new document as the test set.
Let’s make a function that enables us to predict any custom sentence:
def predict_sentence(sentence):
sample_test = pd.DataFrame({'label': np.nan, 'sentence': sentence}, [0])
for col in x_train.columns:
sample_test[str(col)] = 0
sample_test = bag_of_chars(sample_test, 'sentence')
sample_test = sample_test.drop('label', 1)
pred = lr.predict(sample_test)[0]
if pred == 1:
return 'WEIRD'
else:
return 'NORMAL'
weird_sentence = 'jdaij oadao //// fiajoaa32 32 5555'
Normal Sentence:
We just built a cool Machine Learning model
normal_sentence = 'We just built a cool machine learning model'
predict_sentence(normal_sentence)

Weird Sentence:
jdaij oadao //// fiajoaa32 32 5555
weird_sentence = 'jdaij oadao //// fiajoaa32 32 5555'
predict_sentence(weird_sentence)

And our model scores! Unfortunately, when I tried more sentences it showed bad performance classifying some of them. The bag-of-words (in this case chars) method isn’t probably the best option, the algorithm itself could be highly improved and we should label many more data points for the model to become reliable. The point here is that you could use this same approach to perform a lot of different tasks, e.g. recognizing specific elements (e.g. links, dates, names, topics, titles, equations, references, and more). Used the right way, text classification can be a powerful tool to help in the cleaning process, and should not be taken for granted. Good cleaning!
Thank you if you kept reading until the end. This was an article focused on text classification to handle cleaning problems. Please follow my profile for more on Data Science and feel free to let me any comments or concerns. See you next post!