Machine Learning for Text Classification Using SpaCy in Python

Susan Li
Towards Data Science
5 min readApr 14, 2018

--

Photo Credit: Pixabay

spaCy is a popular and easy-to-use natural language processing library in Python. It provides current state-of-the-art accuracy and speed levels, and has an active open source community. However, since SpaCy is a relative new NLP library, and it’s not as widely adopted as NLTK. There is not yet sufficient tutorials available.

In this post, we will demonstrate how text classification can be implemented using spaCy without having any deep learning experience.

The Data

It s often time consuming and frustrating experience for a young researcher to find and select a suitable academic conference to submit his (or her) academic papers. We define “suitable conference”, meaning the conference is aligned with the researcher’s work and have a good academic ranking.

Using the conference proceeding data set, we are going to categorize research papers by conferences. Let’s get started. The data set can be found here.

Explore

Take a quick peek:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import base64
import string
import re
from collections import Counter
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
df = pd.read_csv('research_paper.csv')
df.head()
Figure 1

There is no missing values.

df.isnull().sum()

Title 0
Conference 0
dtype: int64

Split the data to train and test sets:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.33, random_state=42)
print('Research title sample:', train['Title'].iloc[0])
print('Conference of this paper:', train['Conference'].iloc[0])
print('Training Data Shape:', train.shape)
print('Testing Data Shape:', test.shape)

Research title sample: Cooperating with Smartness: Using Heterogeneous Smart Antennas in Ad-Hoc Networks.
Conference of this paper: INFOCOM
Training Data Shape: (1679, 2)
Testing Data Shape: (828, 2)

The dataset consists of 2507 short research paper titles, which have been classified into 5 categories (by conferences). The following figure summarizes the distribution of research papers by different conferences.

fig = plt.figure(figsize=(8,4))
sns.barplot(x = train['Conference'].unique(), y=train['Conference'].value_counts())
plt.show()
Figure 2

The following is one way to do text preprocessing in SpaCy. After that, we are trying to find out the top words used in the papers that submit to the first and second categories (conferences) — INFOCOM & ISCAS

import spacynlp = spacy.load('en_core_web_sm')
punctuations = string.punctuation
def cleanup_text(docs, logging=False):
texts = []
counter = 1
for doc in docs:
if counter % 1000 == 0 and logging:
print("Processed %d out of %d documents." % (counter, len(docs)))
counter += 1
doc = nlp(doc, disable=['parser', 'ner'])
tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
tokens = ' '.join(tokens)
texts.append(tokens)
return pd.Series(texts)
INFO_text = [text for text in train[train['Conference'] == 'INFOCOM']['Title']]IS_text = [text for text in train[train['Conference'] == 'ISCAS']['Title']]INFO_clean = cleanup_text(INFO_text)
INFO_clean = ' '.join(INFO_clean).split()
IS_clean = cleanup_text(IS_text)
IS_clean = ' '.join(IS_clean).split()
INFO_counts = Counter(INFO_clean)
IS_counts = Counter(IS_clean)
INFO_common_words = [word[0] for word in INFO_counts.most_common(20)]
INFO_common_counts = [word[1] for word in INFO_counts.most_common(20)]
fig = plt.figure(figsize=(18,6))
sns.barplot(x=INFO_common_words, y=INFO_common_counts)
plt.title('Most Common Words used in the research papers for conference INFOCOM')
plt.show()
Figure 3
IS_common_words = [word[0] for word in IS_counts.most_common(20)]
IS_common_counts = [word[1] for word in IS_counts.most_common(20)]
fig = plt.figure(figsize=(18,6))
sns.barplot(x=IS_common_words, y=IS_common_counts)
plt.title('Most Common Words used in the research papers for conference ISCAS')
plt.show()
Figure 4

The top words in INFOCOM are “networks” and “network”. It is obvious that INFOCOM is a conference in the field of networking and closely related areas.

The top words in ISCAS are “base” and “design”. It indicates that ISCAS is a conference about database, system design and related topics.

Machine Learning with spaCy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()

Below is another way to clean text using spaCy:

STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”"]
class CleanTextTransformer(TransformerMixin): def transform(self, X, **transform_params):
return [cleanText(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}

def cleanText(text):
text = text.strip().replace("\n", " ").replace("\r", " ")
text = text.lower()
return text
def tokenizeText(sample):
tokens = parser(sample)
lemmas = []
for tok in tokens:
lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
tokens = lemmas
tokens = [tok for tok in tokens if tok not in STOPLIST]
tokens = [tok for tok in tokens if tok not in SYMBOLS]
return tokens

Define a function to print out the most important features, the features that have the highest coefficients:

def printNMostInformative(vectorizer, clf, N):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
topClass1 = coefs_with_fns[:N]
topClass2 = coefs_with_fns[:-(N + 1):-1]
print("Class 1 best: ")
for feat in topClass1:
print(feat)
print("Class 2 best: ")
for feat in topClass2:
print(feat)
vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))
clf = LinearSVC()

pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
# data
train1 = train['Title'].tolist()
labelsTrain1 = train['Conference'].tolist()
test1 = test['Title'].tolist()
labelsTest1 = test['Conference'].tolist()
# train
pipe.fit(train1, labelsTrain1)
# test
preds = pipe.predict(test1)
print("accuracy:", accuracy_score(labelsTest1, preds))
print("Top 10 features used to predict: ")

printNMostInformative(vectorizer, clf, 10)
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])
transform = pipe.fit_transform(train1, labelsTrain1)
vocab = vectorizer.get_feature_names()
for i in range(len(train1)):
s = ""
indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
for idx, num in zip(indexIntoVocab, numOccurences):
s += str((vocab[idx], num))

accuracy: 0.7463768115942029
Top 10 features used to predict:
Class 1 best:
(-0.9286024231429632, ‘database’)
(-0.8479561292796286, ‘chip’)
(-0.7675978546440636, ‘wimax’)
(-0.6933516302055982, ‘object’)
(-0.6728543084136545, ‘functional’)
(-0.6625144315722268, ‘multihop’)
(-0.6410217867606485, ‘amplifier’)
(-0.6396374843938725, ‘chaotic’)
(-0.6175855765947755, ‘receiver’)
(-0.6016682542232492, ‘web’)
Class 2 best:
(1.1835964521070819, ‘speccast’)
(1.0752051052570133, ‘manets’)
(0.9490176624004726, ‘gossip’)
(0.8468395015456092, ‘node’)
(0.8433107444740003, ‘packet’)
(0.8370516260734557, ‘schedule’)
(0.8344139814680707, ‘multicast’)
(0.8332232077559836, ‘queue’)
(0.8255429594734555, ‘qos’)
(0.8182435133796081, ‘location’)

from sklearn import metrics
print(metrics.classification_report(labelsTest1, preds,
target_names=df['Conference'].unique()))
precision recall f1-score support

VLDB 0.75 0.77 0.76 159
ISCAS 0.90 0.84 0.87 299
SIGGRAPH 0.67 0.66 0.66 106
INFOCOM 0.62 0.69 0.65 139
WWW 0.62 0.62 0.62 125

avg / total 0.75 0.75 0.75 828

Here you have it. We now have done machine learning for text classification with the help of SpaCy.

Source code can be found on Github. Have a learning weekend!

Reference: Kaggle

--

--