The world’s leading publication for data science, AI, and ML professionals.

Systematically Improving Your Machine Learning Model

The Machine Learning Inevitables

Photo by GR Stocks on Unsplash
Photo by GR Stocks on Unsplash

When working on a Machine learning task, it’s highly likely the process will be extremely iterative due to the nature of Machine Learning, which involves a computer learning without being explicitly programmed.

Similar to how we work on mock exams to improve ourselves in an attempt to achieve higher grades on the real exam, we use mock instances – better known as validation data – to infer how we can expect to do when the model is placed in the real world.

If you think of how we improve by using mock exams, it generally consist of identifying what we have done wrong and revisit those subjects. Likewise, if we could identify the areas in which our machine learning models makes errors, we can revisit those places and make some adjustments to improve our model "mock score" in turn implying we will achieve a better result when the model meets the real world.

Therefore, before we can begin to understand the type of errors our Machine Learning model is making we ought to have made some sort of attempt at trying to predict something already – we need to take the exam. Hence, many experienced practitioners would advise beginning practitioners to start with a simple algorithm that can be implemented quickly and tested on validation data.

This implementation does not have to be extremely great. As long as we are able to make predictions and gain some sort of insight into our data we can begin to improve our model systematically.

An example of a quick and dirty implementation is the beginning of my Twitter Sentiment Analysis project.

Note: For the rest of this article, I will be making references to this project. Please familiarise yourself with what was done by reading the article below before carrying on with this post.

Predicting Tweet Sentiment with Machine Learning

In this project, the aim is to predict whether a tweet is about a disaster or not given the tweet, location, and keyword (although not all tweets are accompanied by location or keyword).

"Let evidence guide you on where to spend your time rather than use gut feeling which is often wrong" – Andrew Ng

Learning Curves

In machine learning, a learning ** curve (or training curve) shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much a machine learning model benefits from adding more training data and whether the estimator suffers more from a variance error or a bias error _(Sourc_e: Wikipedi**a).

Learning Curves are extremely useful for sanity checks on whether your Machine Learning algorithm is working correctly or if you want to improve the performance of your machine learning algorithm – We will be applying a learning curve to the classifier we used for Twitter sentiment classification at some stage to better understand better our model.

Given we have a training and test set, it’s important to understand how the classifiers perform depending on the size of the training set. Therefore, learning curves are a plot of the model’s performance which is often denoted as cost against the size of the training and test set.

Note: I prefer to do all analysis and exploratory work in a Juypter notebook as I’ve done to generate the outputs shown in Figure 1. See the Github Channel for.

Figure 1: Learning Curve on the Twitter Sentiment Data (Image by Author)
Figure 1: Learning Curve on the Twitter Sentiment Data (Image by Author)

From the learning curve, we can see our model is not doing particularly well on the training data and is doing severely bad on the validation data. The large gap between the training score and validation score suggests there is a large variance between the model predictions on training and validation data. However, since this is a Kaggle dataset, adding more data isn’t necessarily something that is straightforward (though it may be useful in the real-world to overcome this problem).

However, after some analysis of my data I realized that the preprocessing I had done was the error. I attempted to add the 3 text columns ( text, location, keyword) together to make one column containing all of our text, but the location and keyword columns have missing values, so when I added those columns to the text column without first dealing with the missing values, the function deleted a number of tweets – see Figure 2.

# read train data
df = pd.read_csv("../inputs/train.csv")
# shuffle data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
# create new column "all_text"
df["all_text"] = df["text"] + df["keyword"] + df["location"]
# split into features and labels
X = df.drop(["text", "keyword", "location", "target"], axis=1)
y = df["target"]
# process tweets
X["all_text"] = X["all_text"].apply(process_tweet)
X.head() 
Figure 2: Error way of adding columns together since some columns have empty values. This will affect how the model learns since it has fewer data to learn from. (Image by author)
Figure 2: Error way of adding columns together since some columns have empty values. This will affect how the model learns since it has fewer data to learn from. (Image by author)

Therefore, it is likely we have deleted too much data which is making it difficult for the model to learn well. Hence, we can easily repeat this process but instead correct how we add all the columns together by filling the missing values with a "none" marker – see Figure 3 for the result.

# new empty df 
X_new = pd.DataFrame()
# new features with correctly joined columns
X_new["all_text"] = df["text"] + df["keyword"].fillna("none") + df["location"].fillna("none")
# process tweets
X_new["all_text"] = X_new["all_text"].apply(process_tweet)
X_new.head()
Figure 3: We have corrected how we join the columns by filling the empty columns with "none". Now you see the difference between this and Figure 2. (Image by Author)
Figure 3: We have corrected how we join the columns by filling the empty columns with "none". Now you see the difference between this and Figure 2. (Image by Author)

But the all-important question is did this slight change improve our model? We can answer this question by checking in with our learning curve once again.

Figure 4: Fixed data deletion error
Figure 4: Fixed data deletion error

You bet it improved the model! The model is performing much better on the training and the cross-validation data. However, there is still a large variance between the scores on our training data and our cross-validation sets. At this point, you’ll see many Kagglers rush to the more powerful models. This is not necessarily bad, but there may be times where you have limited resources in the real-world and you need a model that can work in production, which is not the case for many of the winning solutions on Kaggle.

Instead, we will first perform something known as error analysis to get a better look at some things that we can adjust to improve our models fit on new instances.

Note: The small preprocessing change we made took our public leaderboard score from 0.71008 to 0.79374.

For a more in-depth tutorial about Learning curves, I recommend viewing "Learning Curves for Diagnosing Machine Learning Performance" by Jason Brownlee at Machine Learning Mastery.

Error Analysis

Error analysis is a process where we manually identify the examples in our validation data that our model predicted incorrectly. The idea of this is to identify systematic trends based on the errors that our model made.

To provide a similar scenario of error analysis, we can use our exam preparation example we used above. Say we have just completed a math mock exam and now we are marking our work and see we got 40/60. We would now go through the paper and see what it was we got wrong and mark them into certain categories:

  • Linear Algebra – 60%
  • Statistics – 25%
  • Calculus – 15%

The table above informs us that among the 20 errors that we made, 15% of those errors were to do with calculus, 25% were to do with Statistics, and 60% was to do with Linear Algebra. Based on this information what would you now seek to improve first?

Hopefully, you said you would spend your time trying to improve your Linear Algebra skills because if you improve that, that would add a more significant boost to your overall score on the new mock paper than if you were to improve statistics or calculus first.

Therefore in a Machine Learning setting, you could say error analysis is what provides direction for where to go next when you are attempting to improve your machine learning model. You simply identify the systematic trends in the errors the model made on the validation data (i.e. Tweets with hashtags were always classified a certain way, or tweets with URLs were always classified incorrectly).

Let’s do some error analysis using our Twitter Sentiment Data. The first thing I will do is take 100 misclassified instances (therefore 20 from each fold). Next, I will create a Dataframe which I will save as a CSV file containing the original tweet, the processed tweet, the actual score, and the predicted score.

Figure 5: Misclassified Tweets (Image by Author)
Figure 5: Misclassified Tweets (Image by Author)

Please note that there may be some offensive language used in some of the tweets.

After that, I will manually examine the errors and categorize them into separate headings. We do this by manually examining the errors based on various factors such as the type of tweet (i.e. news report, experience, opinion), and the actual errors being made that we believe if corrected for can improve the algorithm (i.e. metaphorical errors, spelling mistakes, sarcasm).

Fortunately, I’ve done this task as an example with 50 randomly selected misclassified instances (ideally you may want to do it on more, but this is just an example) based on the type of tweet. Here are my findings:

  • Other— 13 (26 %)
  • Opinion – 3 (6%)
  • Mislabelled -16 (32%)
  • Experience – 7 (14%)
  • News – 9 (18%)

There are 32% mislabeled examples from the 50 random misclassified samples that we took. Meaning, they do not come across as a disaster and they are labelled as a disaster or they are not a disaster and they are labelled disaster, from my own opinion.

For example…

"I liked a @YouTube video http://t.co/z8Cp77lVza Boeing 737 takeoff in snowstorm. HD cockpit view + ATC audio – Episode 18snowstormPorthcawl"

…is labelled as being a disaster, but a human wouldn’t instantly class this as NOT being a disaster.

Another example is…

"@PrablematicLA @Adweek I’m actually currently dressed for a snowstorm…despite being in the middle of a Texas summer. Thanks office A/C.snowstormAustin, TX".

This is marked as being a disaster but we can easily tell it is not necessarily a disaster. It may be a disaster for the person making the tweet but in the general publics eye this is not a disaster that one would be concerned about.

The data we are using is a Kaggle Dataset so going back to relabel instances won’t necessarily be beneficial for us since the test data we have no access to would also probably be labelled accordingly. However, in a real-world scenario I’d definitely seek clarity on the definition of the term disaster in the context that **** we are using it. Based upon the definition, we may have to relabel the data again or look at the 2nd types of tweets our algorithm made errors on and begin working on improving our algorithms ability on those.

Error analysis is very beneficial for determining what sort of errors our algorithm is making, hence providing us with good insight to what we can improve to provide us with the most significant improvement to our model. On the other hand, Error analysis is incapable of distinguishing whether something like stemming or how we vectorize the text is beneficial to the final solution, and the only solution is to try it and evaluate the change on the evaluation metric.

With that being said, let’s attempt to use TF-IDF to convert our text into vectors.

Figure 7: Score with TF-IDF vectorization
Figure 7: Score with TF-IDF vectorization

This didn’t improve our score of 0.79374 so I would not continue with TF-IDF, or I may try tf-idf without stemming, etc.

If you enjoyed this post, you’d also like:

Using Machine Learning To Detect Fraud

How To Make Your Data Science Projects Stand Out

Getting Started With Sentiment Analysis

Wrap Up

You’ve now see the iterative nature of Machine Learning practically… I personally like to use evidence as much as possible when I am attempting to improve my machine learning models. I try to understand the bias-variance of the model with learning curves and then use error analysis to allot my time to the task that will provide the most significant improvement to the model if I were to improve at that task.

Thank you for reading to the end, feel free to shoot me a message on LinkedIn:

Kurtis Pykes – Data Scientist – Freelance, self-employed | LinkedIn

If you are interested in starting your own blog, subscribe to me on youtube for all my best tips on how you may get started:

Kurtis Pykes


Related Articles