How to build a model that recognizes human sentiment from audio and text recordings. Co-authors: Qianwen Guan, Alexandre Laurent.

In the context of the final project of Le Wagon’s bootcamp, my team and I decided to take on a fascinating task: Speech sentiment recognition.
It is a great challenge to take up because emotions are subjective on the lines of culture, gender, language and even down to the individual person, it is therefore difficult to universally classify human sentiment.
Our data
We found a data set from Carnegie Melon University called _CMU-MOSEI_¹ which is the largest dataset of sentence level sentiment analysis and emotion recognition in online videos. It contains more than 65 hours of annotated video from more than 1,000 speakers and 250 topics.
The data was divided between segments of variable lengths, each representing a full spoken sentence (features), and the sentiment, our target, which varied between the values -3 to 3 (from negative to positive, 0 being neutral).
We decided to analyse both audio recordings and text transcripts to predict the sentiment behind a person’ sentence. Our intuition was that combining two models with two different sources, using multi-modal learning, could improve our performance.
Data preprocessing
The first step our work what to clean both text and audio data.
While the text was already extracted from the videos, the editing there essentially consisted in doing basic formatting for the text files (removal of punctuation, numbers and uppercase). However, in Natural Language Processing (NLP), it is difficult to chose what parts of text to remove and what parts are to be kept (single words, sentences, whole conversations). I attempted to lemmatize and stem words but no improvements in performance were found.
Audio format was slightly more complex, we tried two approaches:
- The first was audio feature extraction using Python’s library librosa. It enables to extract 5 major features (mean values) from each recording: MFCC, Chroma, Mel Spectrogram, Spectral Centroid & Tonnetz. From those, I obtained around 190 features which could then be used for modelling as tabular data.
- The second was audio to Mel spectrogram which allowed us to interpret audio as an image and to model it from a visual point of view. Looking at the following figure, the x-axis represents time (s), the y-axis frequency (Hz), and the color intensity represents the amplitude of the signal (dB). In this case, the images allowed for feature extraction to occur in a Deep Learning setting (convolutional network).

Stacking Machine Learning models
ML text model
I first tried to see what results we could achieve with a simple Bag of words NLP model.
The Bag-of-words representation consists in counting occurrences of the each word in a text. The count for each word becomes a column. I decided to use scikit-learn‘s CountVectorizer and implemented a grid search looking for the best hyper-parameters.
Those hyper-parameters include:
- ignoring words that have a frequency in the dataset higher than the specified threshold (
max_df
) - specifying the number of top features to keep when vectorizing (
max_features
) - specifying the length of sequences to be considered (
ngram_range
).
Surprisingly, the ngram_range
that gave the best results was the one keeping only one word length in the training ((1,1)
). Our model would therefore potentially not be able to detect "not happy" as a negative sentiment. Our interpretation was that, most of the time, our model would focus on key words ("good", "disaster") to detect the right sentiment.
After vectorizing, I used a regression model with a Ridge regularization: the idea was to avoid our model to over-fit by adding a penalty term to the Loss function based on the coefficients of the regression (betas). We decided to use a L2 penalty because we assumed all coefficients had a similar impact on the prediction.
To evaluate the performance of the model, I used mean absolute error (MAE), a measure of errors between the predicted values and ‘real’ observations. Here, this basic NLP model gave us a 0.87 MAE, meaning the difference between the predicted value of sentiments and the real value was 0.87 in average, where the sentiment scale was [-3, 3].
As a comparison, I created a baseline model using random samples from a uniform distribution, which gave a 1.77 MAE.
ML Audio model
As mentioned above, the input variables (X) for audio are acoustic features extracted from the audio files.
To predict the sentiments, we built a Random Forest (RF) model using scikit-learn. RF is an ensemble method that bags a set of decision trees on sub-samples of a dataset. The advantage of this method is that RF uses averaging to improve the predictive accuracy and control over-fitting.
Then, we performed a grid search looking to optimize RF’s hyper-parameters, which include:
- The number of trees in the forest (
n_estimators
) - The minimum number of samples required to split an internal node (
min_samples_split
) -
The maximum depth of the tree (
max_depth
).Here, this RF model fitted on audio features gave us a 0.91 MAE.
Stacking ML models to improve our predictions
Once the two models were built, we intended to combine their predictions to see if it could improve our results.
First, I created a custom Feature Selector to enable our pipeline to select the right features for each model, before Stacking them:
Then, I used scikit-learn’ Stacking Regressor & MLP Regressor to create the structure of our stacked model. The idea is to add layers of neurons to combine both models in the pipeline’s execution. After iterating, we selected a single layer with five neurons.
The model works as a Deep Learning neural network: I implemented 500 epochs (max_iter
), a rectified linear activation function in each neuron (activation='relu'
) & an early-stopping tool to limit over-fitting.
The process of stacking ML models enabled to significantly improve our predictions: we reached a MAE of 0.78 with our final model.
Stacking Deep Learning models
NLP CNN model
To analyse text with neural networks, I selected a convolutional network model with a custom embedding.
Embedding consists in placing each word of our training set in a multi dimensional space we created. We decided to create our own vocabulary to put potentially more emphasis on the sentimental ‘value’ of each word & therefore more precision than Word2Vec for example. To do so, we created a class "Vocabulary" to train & save this vocabulary for future predictions.
As for the model, we implemented a Convolutional neural network (CNN): those type of Deep Learning models are widely used in imagery and also perform on certain NLP tasks² , it was the case for sentimental prediction.
The following code shows our neural network construction with Tensorflow‘s keras library. After an integrated embedding, the training data passes through one convolution layer. It is then flattened and gets one Dense layer composed of 32 neurons. All neurons have a rectified linear activation function (activation='relu'
).
This CNN model gave us a 0.75 MAE in our test data, becoming our best model.
Audio CNN model
Our next approach was to use audio Mel spectrogram, which is widely adopted in Deep Learning. We converted frequencies to Mel scale, the result becomes a Mel spectrogram that will be the input (as an image) of a CNN model.
Because humans do not perceive frequencies on a linear scale, the Mel scale is close to human perception of pitches. Therefore, in our study, all frequencies were mapped to 128 Mel bands.
Since all inputs for CNNs should have the same input shape, we padded silence for the shorter audio and clip for the longer one in order to get a unique input shape (128, 850,1), where 128 represents 128 Mel bands, 850 indicates the length, and 1 is for 1-channel (grey-scale image).
Here is the final CNN model we built for the prediction of sentiments using Mel spectrogram.
This analysis of Mel spectrogram as images gave us a 0.89 MAE.
Stacking DL models

From the ML results, we have learnt that stacking NLP and audio models can improve our predictions. Thus, we stacked our two DL models outputs and one dense layer before the output layer for the stacking thanks to keras‘ "Concatenate" method.
Using the same method as in ML staking, the model essentially does a regression based on both model outputs. Unfortunately, it did not improve our predictions on the test data, as this stacking model also gave a 0.75 MAE.
Results & looking further

Overall, the best model for our task resulted to be our NLP Deep Learning model. We found our final 0.75 MAE acceptable reflecting on the time spent on the project, on our dataset’s size and on its quality:
First, it was annotated by humans: since sentiment and emotion are highly subjective (culture, varied interpretation of meaning, sarcasm etc…), the quality of the model was impaired. Secondly, most of the time sentiments were neutral, meaning that data could be skewed. This led to our model disproportionally predicting sentiment as neutral (despite it being positive or negative). A solution to this without sacrificing dataset’s size (data balancing) would be to gather more negative and positive data in order to extract features that predict a wider range of sentiment.
Besides, we believe that our results could have been improved in a few ways:
- Spending more time on model tuning, paying attention on hyper-parameters, text cleaning steps or kernel sizes.
- Trying to stack other Deep Learning models to see if they could improve our predictions. Mel spectrogram did not seem to find patterns that could be complimentary with the text analysed on our NLP CNN. We concluded that this issue deserved to be digged deeper, we are definitely open to comments and suggestions from the community!
- Having a different approach building an emotion classifier able to predict happiness, anger, surprise, etc, rather than a sentiment rating.
Source code can be found on GitHub. We look forward to hear any feedback or questions.
As a conclusion, we believe sentiment recognition from audio and text has a very exciting future as it allows for great insights to be collected from people. If pushed even further and combined with emotion classification some of the use cases for these kinds of projects can add great value to society. For instance:
- Improving phone customer service. Redirecting customers depending on their emotions/sentiments. Happy customers can be directed to sales, unhappy customers to retention, confused sounding customers to technical support, etc.
- It is also a great way of evaluating service quality and brand monitoring.We are definitely open to comments and suggestions from the community and look forward to see the improvements in this field.
[1] Zadeh A, Liang PP, Poria S, Vij P, Cambria E and Morency L-P (2018), Multi-attention recurrent network for human communication comprehension, In Thirty-Second AAAI Conference on Artificial Intelligence.
[2]: Taha Binhuraib. (October 13, 2020). NLP with CNNs https://towardsdatascience.com/nlp-with-cnns-a6aa743bdc1e