Deep Learning for Natural Language Processing Using word2vec-keras

A deep learning approach for NLP by combining Word2Vec with Keras LSTM

Published in

Towards Data Science

8 min readNov 4, 2019

Natural language processing (NLP) is a common research subfield shared by many research fields such as linguistics, computer science, information engineering, and artificial intelligence, etc. NLP is concerned with the interactions between computers and human natural languages in general and in particular how to use computers to process and analyze natural language data (e.g., text, voice, etc.). Some of the major challenges in NLP include speech recognition, natural language understanding, and natural language generation.

Text is one of the most widespread forms of NLP data. It can be treated as either a sequence of characters or a sequence of words, but with the advance of deep learning, the trend is to work at the level of words. Given a sequence of words, it must be somehow converted into numeric numbers before it can be understood by a machine learning or deep learning algorithm/model such as LSTM. One straight forward way is to use One-hot encoding to map each word to a sparse vector of the length of vocabulary. The other method (e.g., Word2vec) uses word embedding to convert a word into a compact vector of configurable length.

In NLP for traditional machine learning [1], both textual data preprocessing and feature engineering are required. Recently a new deep learning model Word2Vec-Keras Text Classifier [2] is released for text classification without feature engineering. It combines the Word2Vec model of Gensim [3] (a Python library for topic modeling, document indexing and similarity retrieval with large corpora) with Keras LSTM through an embedding layer as input.

In this article, similarly to [1], I use the public Kaggle SMS Spam Collection Dataset [4] to evaluate the performance of the Word2VecKeras model in SMS spam classification without feature engineering. The following two scenarios are covered:

SMS spam classification with data preprocessing
SMS spam classification without data preprocessing

The following code is to import all the necessary Python libraries:

from word2vec_keras import Word2VecKeras
from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
import itertools
import numpy as np
import nltk
import string
import re
import ast # abstract syntax tree: https://docs.python.org/3/library/ast.html
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn%matplotlib inline

Once the SMS dataset file spam.csv is downloaded onto a computer, the following code can load the local dataset file into Pandas DataFrame as follows on Mac:

column_names = ['label', 'body_text', 'missing_1', 'missing_2', 'missing_3']
        raw_data = pd.read_csv('./data/spam.csv', encoding = "ISO-8859-1")
        raw_data.columns = column_names
        raw_data.drop(['missing_1', 'missing_2', 'missing_3'], axis=1, inplace=True)
        raw_data = raw_data.sample(frac=1.0)
        raw_data.head()

Note that loading this dataset needs to use the encoding format ISO-8859–1 rather than the default encoding format UTF-8.

1. Spam Classification with Data Preprocessing

In this section, first, a data preprocessing procedure similar to [1] is applied to clean the SMS dataset. Then the resulting clean dataset is fed into the Word2VecKeras model for model training and prediction of spam SMS. the mlflow [5][6] is used to tract the history of model executions.

1.1 Data Preprocessing

The preprocessing() method of the Preprocessing class is to preprocess the SMS raw data as follows:

remove punctuation
tokenization
remove stopwords
apply stemming
apply lemmatizing
join tokens into sentence
drop intermediate data columns

class Preprocessing(object):
    def __init__(self, data, target_column_name='body_text_clean'):
        self.data = data
        self.feature_name = target_column_name
        
    def remove_punctuation(self, text):
        text_nopunct = "".join([char for char in text if char not in string.punctuation])# It will discard all punctuations
        return text_nopunct
    
    def tokenize(self, text):
        # Match one or more characters which are not word character
        tokens = re.split('\W+', text) 
        return tokens
    
    def remove_stopwords(self, tokenized_list):
        # Remove all English Stopwords
        stopword = nltk.corpus.stopwords.words('english')
        text = [word for word in tokenized_list if word not in stopword]
        return text    def stemming(self, tokenized_text):
        ps = nltk.PorterStemmer()
        text = [ps.stem(word) for word in tokenized_text]
        return text
    
    def lemmatizing(self, tokenized_text):
        wn = nltk.WordNetLemmatizer()
        text = [wn.lemmatize(word) for word in tokenized_text]
        return text
    
    def tokens_to_string(self, tokens_string):
        try:
            list_obj = ast.literal_eval(tokens_string)
            text = " ".join(list_obj)
        except:
            text = None
        return text
    
    def dropna(self):
        feature_name = self.feature_name
        if self.data[feature_name].isnull().sum() > 0:
            column_list=[feature_name]
            self.data = self.data.dropna(subset=column_list)
            return self.data
        
    def preprocessing(self):
        self.data['body_text_nopunc'] = self.data['body_text'].apply(lambda x: self.remove_punctuation(x))
        self.data['body_text_tokenized'] = self.data['body_text_nopunc'].apply(lambda x: self.tokenize(x.lower())) 
        self.data['body_text_nostop'] = self.data['body_text_tokenized'].apply(lambda x: self.remove_stopwords(x))
        self.data['body_text_stemmed'] = self.data['body_text_nostop'].apply(lambda x: self.stemming(x))
        self.data['body_text_lemmatized'] = self.data['body_text_nostop'].apply(lambda x: self.lemmatizing(x))
        
        # save cleaned dataset into csv file and load back
        self.save()
        self.load()
        
        self.data[self.feature_name] = self.data['body_text_lemmatized'].apply(lambda x: self.tokens_to_string(x))
        
        self.dropna()
        
        drop_columns = ['body_text_nopunc', 'body_text_tokenized', 'body_text_nostop', 'body_text_stemmed', 'body_text_lemmatized'] 
        self.data.drop(drop_columns, axis=1, inplace=True)
        return self.data
    
    def save(self, filepath="./data/spam_cleaned.csv"):
        self.data.to_csv(filepath, index=False, sep=',')  
        
    def load(self, filepath="./data/spam_cleaned.csv"):
        self.data = pd.read_csv(filepath)
        return self.data

The resulting data is saved in a new column body_text_clean as shown below:

In the above data preprocessing, both the stopwords and the wordnet data files of the Natural Language Toolkit (NLTK) are required and need to be downloaded manually (available here) on Mac. The nltk.download() method does not work appropriately.

1.2 Modeling

The method prepare_data() of the SpamClassifier class is to get the SMS data prepared for modeling as follows:

load the dataset file spam.csv into Pandas DataFrame
use the Preprocessing class to preprocess the raw data (see the body_text_clean column)
split the clean data after data preprocessing into training and testing datasets
reformat the training and testing datasets as Python lists to be aligned with the model Word2VecKeras API [2]

Once the data is prepared for modeling, the train_model() method can be called to train the Word2VecKeras model. Then the methods evaluate() and predict() can be called to obtain model performance metrics (e.g., accuracy) and perform prediction respectively.

The mlFlow() method combines the above method calls, tracking of model execution results, and logging trained model into file into one work flow.

Note that the value of hyper-parameter w2v_min_count is to ignore all words with total frequency lower than this value. Thus it needs to be adjusted according to specific dataset. If it is set too high (e.g., a value of 5 for the SMS spam dataset used in this article), a vocabulary error will occur due to empty sentences.

class SpamClassifier(object):
    def __init__(self):
        self.model = Word2VecKeras()
        
    def load_data(self):
        column_names = ['label', 'body_text', 'missing_1', 'missing_2', 'missing_3']
        data = pd.read_csv('./data/spam.csv', encoding = "ISO-8859-1")
        data.columns = column_names
        data.drop(['missing_1', 'missing_2', 'missing_3'], axis=1, inplace=True)
        self.raw_data = data.sample(frac=1.0) 
        
        return self.raw_data
    
    def split_data(self):
        self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.x, self.y, test_size=0.25, random_state=42)
        
    def numpy_to_list(self):
        self.x_train = self.x_train.tolist()
        self.y_train = self.y_train.tolist()
        self.x_test  = self.x_test.tolist()
        self.y_test  = self.y_test.tolist()
    
    def prepare_data(self, feature, label='label'):
        self.load_data()
        pp = Preprocessing(self.raw_data)
        self.data = pp.preprocessing()
        self.x = self.data[feature].values
        self.y = self.data[label].values
        self.split_data()
        self.numpy_to_list()
        
        return self.data
        
    def train_model(self):
        self.w2v_size = 300
        self.w2v_min_count = 1 # 5
        self.w2v_epochs = 100
        self.k_epochs = 5 # 32
        self.k_lstm_neurons = 512
        self.k_max_sequence_len = 1000
        
        self.model.train(self.x_train, self.y_train, 
            w2v_size=self.w2v_size, 
            w2v_min_count=self.w2v_min_count, 
            w2v_epochs=self.w2v_epochs, 
            k_epochs=self.k_epochs, 
            k_lstm_neurons=self.k_lstm_neurons, 
            k_max_sequence_len=self.k_max_sequence_len, 
            k_hidden_layer_neurons=[])
        
    def evaluate(self):
        self.result = self.model.evaluate(self.x_test, self.y_test)
        self.accuracy = self.result["ACCURACY"]
        self.clf_report_df = pd.DataFrame(self.result["CLASSIFICATION_REPORT"])
        self.cnf_matrix = self.result["CONFUSION_MATRIX"]
        return self.result
    
    def predict(self, idx=1):
        print("LABEL:", self.y_test[idx])
        print("TEXT :", self.x_test[idx])
        print("/n============================================")
        print("PREDICTION:", self.model.predict(self.x_test[idx]))
        
    def mlFlow(self, feature='body_text_clean'):
        np.random.seed(40)  
        with mlflow.start_run():
            self.prepare_data(feature=feature) # feature should be 'body_text' if no need to preprocessing
            self.train_model()
            self.evaluate()
            self.predict()
            mlflow.log_param("feature", feature) 
            mlflow.log_param("w2v_size", self.w2v_size)  
            mlflow.log_param("w2v_min_count", self.w2v_min_count)
            mlflow.log_param("w2v_epochs", self.w2v_epochs)
            mlflow.log_param("k_lstm_neurons", self.k_lstm_neurons)
            mlflow.log_param("k_max_sequence_len", self.k_max_sequence_len)
            mlflow.log_metric("accuracy", self.accuracy)
            mlflow.sklearn.log_model(self.model, "Word2Vec-Keras")

The following code shows how to instantiate a SpamClassifier object and call the mlFlow() method for modeling and prediction with data preprocessing:

spam_clf = SpamClassifier()
spam_clf.mlFlow(feature='body_text_clean')

1.3 Comparison

A naive baseline classification algorithm is to predict the majority (i.e., ham) of the classes (spam or ham). Any useful supervised machine learning classification model must beat it in performance. In the Kaggle SMS spam collection dataset, there are 5,572 samples in total, 747 are spam and 4,825 are ham. Thus the baseline algorithm performance in accuracy is about 86.6%.

In [1], a similar data preprocessing procedure was applied to the same Kaggle SMS spam dataset first. Then feature engineering was performed on the preprocessed dataset to obtain modeling features such as text message length and percentage of punctuations in text. Then the scikit-learn RandomForestClassifier model was trained for prediction. The obtained accuracy is about 97.7%.

In this article, after data preprocessing, the Word2VecKeras model is directly trained on the preprocessed dataset for prediction without any feature engineering. The achieved accuracy with 5 epochs is about 98.5%.

The results show that both of the traditional machine learning method [1] and the new deep learning method in this article outperformed the baseline algorithm in accuracy significantly.

2. Spam Classification without Data Preprocessing

In this section, once the Kaggle SMS spam collection dataset is loaded, the raw data (see the body_text column) is directly fed into the Word2Vec-Keras model for model training and prediction of spam SMS. Neither data preprocessing nor feature engineering is used. Similarly to previous section, the mlflow [5][6] is used to track the history of model executions. This is achieved as follows:

spam_clf = SpamClassifier()
spam_clf.mlFlow(feature='body_text')

As shown below, the obtained accuracy with 5 epochs is about 99.0%, which is competitive with the Word2VecKeras model performance in accuracy of spam classification with data preprocessing in the previous section.

The following snapshot of the mlflow UI shows the history of model executions:

Summary

In this article, the public Kaggle SMS Spam Collection Dataset [4] was used to evaluate the performance of the new Word2VecKeras model in SMS spam classification without feature engineering.

Two scenarios were covered. One applied the common textual data preprocessing to clean the raw dataset and then used the clean dataset to train the model for prediction. The other directly used the raw dataset without any data preprocessing for model training and prediction.

The results of model performance in accuracy show that the Word2VecKeras model outperformed the traditional NLP method in [1] and performed similarly in both of the two scenarios above. This indicates that the new Word2VecKeras model has the potential of being directly applied to raw textual data for text classification without either textual data preprocessing or feature engineering.

All of the pieces of source code in this article are available in Github [7].

References

[1]. B. Shetty, Natural Language Processing(NLP) for Machine Learning

[2]. Word2Vec-Keras Text Classifier

[3]. Gensim

[4]. Kaggle SMS Spam Collection Dataset

[5]. mlflow

[6]. Y. Zhang, Object-Oriented Machine Learning Pipeline with mlflow for Pandas and Koalas DataFrames

[7]. Y. Zhang, Jupyter notebook in Github

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

Deep Learning for Natural Language Processing Using word2vec-keras

A deep learning approach for NLP by combining Word2Vec with Keras LSTM

1. Spam Classification with Data Preprocessing

1.1 Data Preprocessing

1.2 Modeling

1.3 Comparison

2. Spam Classification without Data Preprocessing

Summary

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Towards Data Science

Written by Yuefeng Zhang, PhD

Responses (2)