The world’s leading publication for data science, AI, and ML professionals.

A Step-by-Step Tutorial for Conducting Sentiment Analysis

part 3: the last step, applying logistic regression

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

Following the steps from my previous articles, I preprocessed the text data and transformed the "cleaned" data into a sparse matrix. Please follow the links to check out more details.

Now I am at the last step of conducting news sentiment analysis on WTI crude oil future prices. In this article, I will discuss the use of logistic regression, and some interest results I found from my project. I have some background introduction to this project here.

Define and Construct the Target Value

As discussed briefly in my previous articles, conducting sentiment analysis is solving a classification problem (usually binary) by machine learning models and text data. Solving a classification problem is solving a supervised machine learning problem, which requires both features and target values when training the model. If it is a binary classification problem, the target values are usually positive sentiment and negative sentiment. They are assigned and detailedly defined depending on the context of your research question.

Take my project as an example, the purpose of my project is to predict the change of the crude oil future prices from recently released news articles. I define the positive news as the ones that would predict the price increase, while the negative ones would predict the price decrease. Since I have already collected and transformed the text data and will use them as the features, I now need to assign the target values for my dataset.

The target value of my project is the directions of price change with respect to different news articles. I collected the high-frequency trading data from Bloomberg for the WTI crude oil future close price, which is updating every five minutes. I plot the data in the following graph:

Source: Bloomberg
Source: Bloomberg

The data is for the last quarter of 2019. There are a lot of fluctuations in the price, but no obvious trend, which is perfect for the sentiment analysis. The price data here is a continuous variable, and I need to transform it into a categorical variable with binary values for sentiment analysis.

With the assumption that the financial market is perfectly efficient and the market reacts fast enough to new information, I define that news effect on WTI crude oil future price is reflected within five minutes after this news has been released. I establish a dummy variable: if price increases within five minutes after a news article has been released, the price dummy will be one. Otherwise, if price decreases or doesn’t change, the price dummy will be zero. There is nearly no case for the price to remain the same within the five-minute slot throughout the dataset. Thus, when the price dummy equals zero, it means the price has decreased within five minutes after this news release.

For each news article, by looking for the price change within five minutes after release, this paper matches news and the price dummy. The figure below shows that the data is roughly balanced by comparing the number of price increase episodes with price decrease episodes:

Introducing Logistic Regression

After constructing the target value, I have both the text features (TFIDF vectorized text data) and the price dummy ready for each news article. Now I need to apply an estimator to build the machine learning model. There are a lot of models that solve the binary classification problem, and the one I chose here is logistic regression.

Logistic regression is a linear classifier, it is a transformation from a linear function:

linear regression
linear regression

where b0, b1…bn are the estimators of the regression coefficients for a set of independent variable x=(x_1,x_2…x_n). The logistic regression function p(x) is the sigmoid function of f(x):

sigmoid transformation
sigmoid transformation

After transformation, the value of p(x) will be between [0,1], which can be interpreted as a probability. Generally, p(x) is interpreted as the predicted probability when x is in positive class, and 1-p(x) is the probability that x is in the negative class. In this project, p(x) is defined as the probability that WTI crude oil futures price increases within five minutes after news article i’s release.

Applying logistic regression to conduct news sentiment analysis, I treat each news article as an observation, and the contents in the news article as the features, and estimates β_w0, β_w1, … β_wj from the following equation:

Here i stands for each news article as an observation, and wj is the jth unique word in all news articles. On the left-hand side, Y_i is the price change dummy described in the previous section. Specifically, the value of Y is decided by the following conditions:

price dummy
price dummy

On the right-hand side, the first term is a sparse matrix, with each row stands for each news article and each column stands for each unique word. There are over 20,606 unique words that have ever shown in 4616 news articles, which indicate the shape of the sparse matrix. Each value X_{i, wj} of the sparse matrix is denoted as the TFIDF value for each unique word wj in each news article i. For more details about the TFIDF transformation, please check out my previous article.

Implement Logistic Regression

To implement logistic regression and train the model, I first divide my dataset into training and test set. "df[‘news’]" here are the "cleaned" news articles and "df[‘price’]" are the price dummies as the target value. To find the best transformer and the best estimator, I build a machine learning pipeline and use Gridsearchcv to find the best hyper-parameters. I have attached the code here for your reference:

#train and test split
X_train, X_test, y_train, y_test = train_test_split(df['news'], 
                                                    df['price'], 
                                                    random_state=0)
#build a Machine Learning pipeline
est = Pipeline([('vectorizer', TfidfVectorizer(lowercase=False)),
 ('classifier', LogisticRegression(solver='liblinear'))])
#GridSearchCV with a transformer and a estimator
parameters = {'vectorizer__max_df': (0.8,0.9), 
 'vectorizer__min_df': [20,50,0.1],
 "classifier__C":np.logspace(-3,3,7), 
 "classifier__penalty":["l1","l2"]}
gs=GridSearchCV(est,param_grid=parameters)
#fit the training data
gs.fit(X_train, y_train)
#Evaluate the model
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
AUC:  0.719221201684

If not specified, GridSearchCV will look for the hyper-parameters that generate the highest accuracy for model evaluation. Sometimes accuracy is not the best metric to evaluate the model, and we may use other metrics. You can specify it by defining the "scoring" input at the GridSearchCV function. For how to choose the right metrics, I have an article, "The Ultimate Guide of Classification Metrics for Model Evaluation", which answers this question in detail.

I used the AUC as my model metric and it reaches 0.71 in my test set. Given the number of observations I have to train the model (over 4000 news articles), I believe the model is ready to deploy.

Interesting Findings

Following the previous procedure, I estimated the coefficients (βs) of each unique word. In total, I got over 20,000 unique words, and the plot below shows the β for each unique word:

Each point in the x-axis stands for a unique word collecting from all news articles, and there are 20,606 of them. The y-axis stands for the sign and the size of the coefficient for each word. The figure indicates that most of the unique word by itself has a very limited effect in affecting price, with a coefficient very close to zero. However, there are some words that have coefficients with an absolute value over 0.5, and they can be very predictive in estimating the price change.

Using the Python word clouds function, based on the values of coefficients, I plot the most positive and most negative words in predicting the change of price in different directions. The bigger the font size indicates a bigger impact in predicting the price change.

Deploy at Heroku

After constructing and evaluating the model, I implemented the model as a Flask web app and deployed it at Heroku:

In the "News Body" box, you can paste any news articles and press "Submit", the model will then predict the sentiment of this news article, which probability of price increase after this news has been released.

Building the web app needs to deploy the trained model online and make predictions based on new inputs. Besides coding in Python to build the machine learning model and construct a flask app, you also need some background knowledge of HTML for the web app. In the future, I will write a tutorial about deploying machine learning models.

At the web app, there are also some explanatory data analysis and other interesting findings regarding my project. Feel free to check it out and play with it here.

This is all for conducting Sentiment Analysis. Please feel free to contact me if you have any comments or questions. Thank you for reading!

Here is the list of all my blog posts. Check them out if you are interested!

My Blog Posts Gallery

Read every story from Zijing Zhu (and thousands of other writers on Medium)


Related Articles