
What can we learn by applying Natural Language Processing and classification algorithms to Room Rater’s tweets?
How much thought have you put into what’s behind you on Zoom calls? Working remotely, at least part of the time, might be here to stay for many of us. While you may be able to get away with wearing sweatpants on the job, there’s a new element to your presentation that you may not have had to consider before: your background decor, lighting, and camera angle.
This project predicts the quality of people’s web conference backgrounds using Natural Language Processing and Machine Learning pipelines. The models are trained and tested on tweets collected from the Room Rater (@ratemyskyperoom) Twitter account. This account posts photos of people’s web conference backgrounds, critiquing the background aesthetics and assigning them a score of 0–10.
Natural Language Processing is used to tokenize the tweet text to identify key vocabulary used in the background evaluation criteria. The tokenized text is then fed into the machine learning model, where several classifiers are tested in their ability to predict ratings.
Here are examples of lower and higher scores Room Rater has awarded:


The Challenge
At the start of the COVID pandemic, many professionals found themselves suddenly launched into working from home, requiring that they attend meetings from their kitchens, living rooms, bedrooms, and if they’re lucky, home offices.
Many people are not used to presenting themselves in this context, and may be even less aware of how their backgrounds form a part of the impression they give.
Enter Room Rater. This Twitter account began scoring people’s web conference backgrounds, applauding them for a good arrangements of plants and books in the background, or critiquing their lighting and camera angles.
If we were to apply Natural Language Processing to Room Rater’s tweets, could we crack their scoring system and predict what makes a 10/10 background?
Machine Learning Pipeline: Inputs and Outputs
Variables
Predictive variable: Tweet text
Outcome variable: Rating, a multi-class variable on a 1–10 scale
Evaluation metrics
To evaluate the classification models, the following metrics will be used:
- Accuracy – portion of labels accurately predicted
- Precision – the portion of predictions of a specific class that are correctly predicted (ex. how many predicted to be 9 were actually 9)
- Recall – the portion of a specific class a correctly predicted (ex. how many actual 9’s were predicted to be 9)
- F1 score – the harmonic mean of precision and recall
- ROC AUC score— the area under the ROC curve (true positive rate vs false positive rate), with .5 signifying the model performing on par with random classification.
Because the rating is ordinal, the above metrics don’t account for the degree of misclassification. They won’t recognize that misclassifying a 10 as a 9 is preferable to misclassifying a 10 as a 2. Therefore we’ll also use the following data point to assess the distance from the correct value:
- Average absolute value of the difference between the actual and predicted ratings
Data Collection
Here are resources to get you started with the Twitter API and a helpful Python package for getting tweets:
- https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction
- https://docs.tweepy.org/en/latest/
- https://docs.tweepy.org/en/v3.10.0/cursor_tutorial.html
Creating a function to collect tweets from @ratemyskyperoom allows us to easily return to to collect more tweets if we want to expand our dataset down the road.
Once you collect the tweets and store them in a dataframe, here’s what they look like:

We’ll need to parse this data into our X and y variables before loading it into the models.
Data Preprocessing
The first step to preparing the dataset for analysis is to extract the outcome variable, the rating. Fortunately, Room Rater is consistent in their rating format, enabling us to identify the ratings as numbers preceding "/10".
Data Exploration
We should take time to understand our data before selecting and applying the models.
A histogram reveals that the outcome variable is heavily imbalanced toward 10/10. We will explore ways to address this through various models.

Natural Language Processing
Several Natural Language Processing techniques were used to identify the most common words in Room Rater’s tweets:
- Removal of punctuation, URLs, and other non-text characters, as well as normalization of case
- Removal of English stop words (the, a, an)
- Word tokenization to break sentences into word units for analysis
-
Lemmatization to treat words like plant and plants into a single token
Before feeding the tokens into the classification models, it can be helpful to visualize the tokens.
Let’s take a look at the most common words for 10/10.

We can deduce that Room Rater applauds great use of art in the background. Plants, flowers, pillows, and books are also good accessories, as well as creating a sense of depth.
Repeating this analysis for other ratings, we see that low- to mid-rated backgrounds need work on camera angle and keeping their cords out of sight. Backgrounds scoring in the 7–9 range have the basics down and can focus on incorporating elements like plants and art to add interest. If you have a blank white wall behind you, you’ll be accused of making a hostage video and likely receive a rating of 2 or 3.

Machine Learning Pipeline
Five classifiers were evaluated:
- Random Forest Classifier (fits multiple decision tree classifiers on different sub-samples to minimize over-fitting)
- Balanced Random Forest Classifier (balances the Random Forest Classifier by employing under-sampling)
- Gradient Boosting Classifier (runs multiple decision tree classifiers to minimize the loss function)
- Easy Ensemble Classifier (with the AdaBoost Classifier as a base estimator, employs random under-sampling on the bootstrap samples)
- Ordinal Logistic Regression (takes into account that the order of the ratings are meaningful)
GridSearchCV was implemented to evaluate several combinations of parameters for the classifiers.
The following functions can help us re-run evaluations for multiple classifiers, without having to repeat code for each model.
- As the predicted variable is multi-class (with outputs ranging from 0–10), the outcome labels must be binarized in order to calculate the ROC AUC score.
- Merging the predicted rating back onto the original dataframe will allow us to compare the actual and predicted values through visualization and statistics.
- Creating a visualization function of, such as a heatmap of actual vs predicted values, can provide a quick visual of the results.
- Printing a collection of evaluation scores can help us compare models more easily.
Let’s look at two different models, the Random Forest Classifier, and a variation that addresses the class imbalance in the data.
Building and Applying the Models
The pipeline below first transforms the text using the tokenizer developed earlier, and then applies a classifier of choice. Using GridSearchCV, you can assess several sets of parameters simultaneously and return the best parameters.
Parameters chosen by GridSearchCV:
model_rf.best_params_
{'clf__class_weight': 'balanced_subsample',
'clf__min_samples_leaf': 2,
'clf__n_estimators': 100}
A quick look at the actual versus predicted ratings:

print_scores(y_test_rf, y_pred_rf, pred_df_rf)
accuracy score: 0.45
precision score: 0.5
recall score: 0.45
f1 score: 0.46
roc_auc_score: 0.65
avg diff, actual v pred: 2.36
The model performance isn’t strong in terms of any of the evaluation metrics. On average, it predicts just over 2 ratings away from the actual rating. Let’s re-run the model using Balanced Random Forest Classifiers, which is built to address the class imbalance through under-sampling.
Looking at the metrics for the Balanced Random Forest Classifier, the model scores even worse on the metrics:
accuracy score: 0.1
precision score: 0.33
recall score: 0.1
f1 score: 0.11
roc_auc_score: 0.55
avg diff, actual v pred: 4.54
To understand why that might be, let’s compare the distribution of actual ratings with the predicted ratings of the Random Forest Classifier and the Balanced Random Forest Classifier:

The distribution plot above compares the actual ratings with the predicted ratings from both the Random Forest Classifier and the Balanced Random Forest Classifier. We can see the balanced classifier yields a wider distribution of predictions.
We might want to consider whether the test data set reflects the real world reality. We know that Room Rater provides a lot of 10/10 ratings. Does this match what the reality, or is Room Rater introducing bias into the photos they choose to tweet? From personal experience, the average background I see among my colleagues (including my own!) is rather plain and 1-dimentional, and would likely be closer to a 3/10 than a 10/10.
Gradient Boosting, Balanced AdaBoost, and Ordinal Logistic Regression
The remaining models, Gradient Boosting Classifier, a Balanced AdaBoost Classifier, and an Ordinal Logistic Regression, performed similarly on the metrics.
Like the Balanced Random Forest Classifier, the Balanced AdaBoost Classifier also flattens the distribution. The Balanced AdaBoost performs better than the Balanced Random Forest. We can see above that the Balanced Random Forest led to a more extreme re-balancing, with peaks around ratings of 0/10 and 7/10, instead of 10/10 like the actual values. The Balanced AdaBoost peaks at 9/10, a more close approximation of the actual values.

Although the Ordinal Logistic Regression didn’t outperform the other models despite leveraging the ordinal nature of the outcome variable, developing a custom ensemble method applying an ordinal approach to other classifiers could be worth testing. The Ordinal Logistic Regression was more closely aligned to the true dataset, assigning ratings in closer proximity to 10/10 than the balanced models:


Takeaways
Through Natural Language Processing, we were able to identify key elements that make a good background: art, plants, and books. We also learned that merely having these elements behind you does not a great background make. For top ratings, one must also create a sense of depth, pay attention to lighting, and use skillful framing and camera angles.
The machine learning models did not perform as well. One challenge was the under-representation of low and mid ratings. It’s also hard to judge how well the models may perform in the real world, since the sample data was pre-selected by Room Rater. It’s possible that the over-representation of 10/10 backgrounds is due to Room Rater selecting photos they like, and that your every day background may be more mediocre.
Further Improvement
There are several directions we could take to extract more out of this data.
Sentiment analysis
Adding additional sentiment analysis steps to the pipeline could improve the model. We see common words like "depth", "lighting", "reframe", but would gain more value if we could better distinguish when these words are used positively or negatively.
Verb-form analysis
Identifying verb type, to see whether the command form is used, would also help us identify if Room Rater is either applauding the person for effective use or if they’re making a recommendation. For example, the use of the gerund (-ing) in "good reframing" is positive, whereas the use of the command form in "reframe" is a suggestion for improvement.
Deep learning and image classification
Another data source that could enhance the model is the actual photo. A neural network could be developed to identify visual similarities in what makes a 10/10 background. Like sentiment analysis, this model could help assess the quality of the lighting, placement of the decor, and position of the camera.
One could take this project further and create an app that allows users to upload a photo of their web background, and through image analysis, recommendations would be made to add plants, artwork, books, or adjust lighting and framing.
What do you think makes a 10/10 background? Does Room Rater have the final word?
Access GitHub repository here: laurenmarar/RoomRater: Using data science to optimize your Zoom background (github.com)
Acknowledgements
This blog post was completed as a part of the Capstone Project for the Data Science Nanodegree at Udacity.com.