The world’s leading publication for data science, AI, and ML professionals.

An End-to-End Web Service Implementation for Text Classification using Word2Vec and LGBM

An implementation of a web service and a word2vec model, training and validation of LGBM, and the utilization of these together

When I had a look at my previous posts, there is a sentence which says that "I’m going to talk about Word2Vec models in my next article" appears and decided to write my 3rd blog post of my NLP series. I have been thinking of sharing a blog post which contains a basic web service which aims to predict text classes for a while, and then I finalized both the source code and the post.

Before starting my post, I’d like to say that you can definitely find better language models such BERT and GPT-3 to achieve better Classification accuracy than using Word2Vec models. Here I want to use a simpler language model in a simple web service which can be easily productionized. In addition to Word2Vec model, I wanted to use Light Gradient Boosting Machine (LGBM) to predict text classes using the output of language model. A web service which consists the classifier and the language model was implemented using Flask microframework.

source, lost s01e01
source, lost s01e01

Before starting to explain the code, I’d like to explain the models/architecture briefly.


Word2Vec

Using neural networks for learning distributed vector representations of words attracts growing attention recently. The target word is predicted using its neighbour words by Word2Vec and it maps words of resembling meanings to points within a close distance in the vector space. It was implemented by Thomas Mikolov. This model generates vector representation of the words and the arithmetic operations can be done using these vectors. As you may guess, LGBM utilized the output of the Word2Vec model in which text data was used for training.

Besides the function of this language model, it has a fully-layered neural network architecture. Number of nodes in input and output layers is equal to the vocabulary size. Size of the hidden layer is defined by the user and enlarging the hidden layer causes the model to be more complex. Also, the size of the hidden layer is equal to the size of vector representation. Compared to other neural network structures, activation function does not exist at the hidden layer, so sum of weights are directly passed to the output layer. At the output layer, a probability distribution is generated by the softmax function. Window size is one of the key parameters in the model which is the length of sliding window employed in training. During the training, Word2Vec finds the relationship between words using the sliding window which is moved in the context. Similar vectors mean closer meanings in larger window size. On the other hand, smaller window size indicates that similar vector representations mean these words might be interchangeable. The similarity between vector representations are measured by Cosine Similarity. Word2Vec can choose 2 methods which are Common Bag of Words (CBOW) and Skip Gram to generate vector representations. CBOW predicts the target by looking at the context, Skip Gram predicts the context using the given word. In order to implement word2vec features, I employed Gensim library.

Before completing this part of the post, I have to share an example to demonstrate the operations of vectors since it’s mandatory.

bordeaux – france + italy = tuscany

source
source

LightGBM

If you’re interested in tree based AI models, you would know/read that there are 2 types of ensemble models which are Boosting and Bagging models to solve classification or regression problems. Briefly, in ensemble models, number of trees are trained sequentially or in parallel regarding the ensemble type then each tree votes for the result for the final result of the forest.

Despite the risk of overfitting, Boosting models with optimal parameters generally outperforms Bagging models. Recently, you may have noticed 3 popular models, XGBoost, CatBoost, and LightGBM, which have been appearing on Kaggle competitions, Medium blogs, and relevant areas. LightGBM is a high performance gradient boosting based algorithm for many AI tasks. Leafwise tree growth and utilization of discrete values which were transformed from continuous values provide great accuracy and faster training procedure. (Leafwise tree growth is also supported by XGBoost) In my recent projects, I usually acquire better performance by CatBoost but implementation of LightGBM models is more reasonable regarding efficiency. (I don’t want to generalize which model is the best for overall, just wanted to share my feedback from my experiences).

source
source

Flask

Flask is a framework to implement web services in Python. There are 3 main classes which are Controller, Application, and Init to implement a web service using Flask. You can think of Controller class as a gate. It’s registered into Init then provides the communication of the application from outside. Init class also contains the prefix URL information of Controller. In order to execute to service, Application class is invoked. Host and port configuration is kept in Application class. Briefly, a basic application such as printing "Bonjour!" can be implemented using only these 3 classes. Since our service is a bit complex, I used multiple layers to increase code quality.

In the architecture, Controller class doesn’t have any logic, it only extracts the input from the request and forwards it to related services. Services are the modules which handle the complex operations including classification and text to vector transformation. There is a class created for each result type and they are sent as JSON format by Controller.


Before starting the code review, I’d like to talk about data. Amazon product reviews which can be found on Kaggle were used for text classification. Each review has a score and the application predicts this score. In order to simplify the problem, I grouped scores into 2 different groups which are greater than 3 or not (maximum score is 5, minimum score is 0) then our problem became a binomial classification problem.

In Controller, there are 3 main methods which handle:

  • a word2vec model implementation
  • a training of classifier model
  • a prediction of text.

_train_wvmodel recieves a text file for model training and returns the model name as a response. I want to use model identifiers in order to support multiple models for both word2vec and classification model. _train_classifiermodel also receives the text file since I didn’t want to store the text file in the application. In addition to the file, model identifier which was generated after word2vec model is trained is passed by request to retrieve vector representation of each text. Lastly, _predict_textclass retrieves the text and modelID in the request. Review is classified by a model whose modelID is given.

In wordEmbeddingService, there are 2 main tasks: training a word2vec model and generation of vector representation of reviews. Second method, _create_wordembeddings, is called by classifier to use vectors as an input.

When text is obtained, the whole text is lowercased, non-alphabetic words and stop words are removed, then each row is tokenized to the word lists. After preprocessing is completed, word2vec model starts to be trained then it is saved using datetime. Afterwards, this modelID is returned to the controller.

In the second method, a vector representation of a given text is generated using the existing model. When any word is missing in the vocabulary, a default vector filled with zeros is utilized. When all word vectors are fetched from the model, the average of all vectors are taken to generate the final version of text representation.

TextClassifierService handles the training of the LGBM and the prediction of a given text using the trained classifier model. First method requires text dataframe and modelID to train classifier, second method requires a vector representation instead of dataframe.

_train_classifiermodel starts with text preprocessing. As I mentioned above, the same preprocessing is applied for word2vec model training and it’s implemented in WordEmbeddingService. Separation of preprocessing from the WordEmbeddingService will increase the code quality, I just wanted to stop the implementation. After preprocessing is completed, the list vector representations are split into test and training data sets. Model is trained then the classification performance is measured using AUC and F-score metrics. Since the majority of classification data sets is unbalanced, I avoid using the Accuracy metric. Regarding the performance measurement, any hypertuning process for both language model and classifier model is not applied. You can also try different types of optimal parameter search methods. Finally, Performance metrics and modelID is saved into _modelinfo object then it’s returned to Contoller.

predict method does not contain any complex function because the complexity was burdened by _predict_textclass proxy class which I’m going to discuss last. It receives a vector representation of text and modelID to handle prediction. When the result object is created, it’s sent to our final service.

I wanted to create this service to make better readable code and reduce the complexity of prediction implementation. As other services does, it conducts the text preprocessing before prediction. Vector representation of each word is obtained from word2vec model then the representation of whole text is generated using the average value of word vectors. At the end of the method, classifier predicts text’s class then it’s returned to Controller.

In order to call these 3 functions, you can use cURLs below:

  • Word2Vec model training:
curl POST 'http://localhost:8080/api/v1/wv-model-training' 
--form 'file={file_location}'
  • LGBM training
curl POST http://localhost:8080/api/v1/text-classifier-training 
--form 'file={file_location}' 
--form 'model_id="{model_id_which_is_returned_from_the_first_method}"'
  • Text prediction
curl POST http://localhost:8080/api/v1/predict-text 
--header "Content-Type: application/json" 
--data '{"text": "{any_review_you_want_to_classify}",
"model_id": "{model_id_which_is_returned_from_the_first_method}"}'

Briefly, the controller layer handles requests and forwards the information to related services to handle training and/or prediction operations. I try to maximize business information and minimize the technology in the controller layer and vice versa for the service layers. Service layers have the whole logic and they’re responsible for complex operations, model storage, and meta-operations. For language and predictor models, you can use other models. To increase both language and classifier model performance, you can tune the hyper-parameters.

You can find the whole implementation from my github repo.


Consequently, I tried to explain the technologies that I used for language models, predictors, and the basic web service architecture and the services and functions which were implemented end-to-end. I hope this post will be useful for you and help you to implement your own language applications. Honestly, I can’t estimate the topic of my following blog post in the NLP series. If you have any questions about the post and the implementation, you can ask me without hesitation. Before saying goodbye, I want to say thank you to first readers of this post Mustafa Kirac and Mehmet Emin Ozturk!


To say me "hi" or ask me anything:

linkedin: https://www.linkedin.com/in/ktoprakucar/

github: https://github.com/ktoprakucar

e-mail: [email protected]


References

  1. https://arxiv.org/pdf/1310.4546.pdf
  2. https://arxiv.org/pdf/1402.3722.pdf
  3. http://jalammar.github.io/illustrated-word2vec/
  4. https://en.wikipedia.org/wiki/Cosine_similarity
  5. https://radimrehurek.com/gensim/
  6. https://en.wikipedia.org/wiki/Boosting_(machine_learning)
  7. https://en.wikipedia.org/wiki/Bootstrap_aggregating
  8. https://lightgbm.readthedocs.io/en/latest/
  9. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
  10. https://flask.palletsprojects.com/en/1.1.x/
  11. https://towardsdatascience.com/gridsearch-vs-randomizedsearch-vs-bayesiansearch-cfa76de27c6b

Related Articles