Machine Learning Introduction: Applying Logistic Regression to a Kaggle Dataset with TensorFlow

Published in

Towards Data Science

9 min readSep 18, 2018

Machine Learning can create complexly beautiful systems. Source: Pixabay.

Machine Learning is the perfect dessert after a good couple days of Feature Engineering and Exploratory Analysis.

If you’ve been following my previous posts, you’ve read this one, and this one, where I analysed this Kaggle Dataset. I looked at the history of the Olympics’ events and their participants, narrowing the analysis down to a few aspects:

Medal winning.
Female and Male participation over time.
Height and Weight statistics of each sport’s athletes, given a gender.

Finally, I mentioned developing a Machine Learning model using the data could be interesting.

What tipped me off was seeing the strong correlation between an athlete’s Weight and Height and whether they were biologically male or female. It made me think it should be possible to model the prediction of an athlete’s sex as a classification problem.

Therefore on my first ever Machine Learning article, we’ll be looking at how to predict an Olympics’ Athlete’s sex (for the sake of brevity, I’ll be using the word sex from now on to refer to biological sex). To achieve this, we’ll be modeling the problem with the best features we can find.
We’ll use TensorFlow for the model’s architecture.

On this first installment, we’ll look at a simple logistic regression, which actually achieved a better performance than I expected. In the next articles we may be looking at more complex models, like Deep Neural Networks. However I am a firm believer that we should start off by solving a problem with the simplest model we can possibly use, before complicating things further.

While I’ll be adding the relevant code snippets, I encourage you to follow along with the notebook in this GitHub Project, especially if you’d like to experiment with the code and actually run it.

But first, a few definitions.

Machine Learning

Machine Learning is a branch of artificial intelligence. It encompasses all kinds of algorithms, but their common thread is their ability to ‘learn’ from their inputs.

They usually require big amounts of data to actually learn anything, but can be very powerful if trained correctly. Among the many problems Machine Learning can solve (image synthesis, fraud prevention, recommendation systems) many fall under the category of classification: assigning one or many labels to a certain sample of the modeled data. In this particular case, we’ll be assigning a label (Male or Female) to each sample of our data (each Athlete).

Logistic Regression

Back in the ancient times (the ’50s), David Cox, a British Statistician, invented an algorithm to predict the probabilities of events given certain variables.

Logistic Regression assigns a certain probability (from 0 to 1) to a binary event, given its context.
To use it, we’ll first create the input vectors, where each vector corresponds to an athlete, and each of a vector’s fields is a (numerical) feature of that athlete (for instance, their Weight or Height).
We’ll then try to predict the probability of one of the fields being 1 or 0 (in our case, 1 could mean female and male, for instance).

It does this by

Performing an affine transformation in the input features–that is, multiplying them by a matrix, and adding a bias vector to the product. We call the elements of the matrix and bias vector ‘weights’.
Composing that operation with the sigmoid function, which ‘crunches numbers’ from the whole real domain to just the (still infinite) numbers between 0 and 1. This gives a notion of probability to the result.
‘Penalizing’ the model with a cost function (in our case, we’ll use cross entropy): If we want the model to learn a certain thing, we’ll have to penalize it for not learning it..
Finding the gradient of that cost function (which we wish to minimize) as a function of the model’s weights.
Updating the weights just a bit, dictated by a constant called ‘learning rate’ towards the opposite direction so the cost function decreases in the next iteration.

This cycle is performed over and over, iterating the whole inputs set many times. Each of these iterations is called an ‘epoch’.

Eventually the function converges into a value that is usually a local minimum of the cost function. It is, however, not guaranteed to be a global minimum.

This whole process is called ‘training’ the model. After the model has trained, we will look at how well it performs by measuring its accuracy: how many of the predictions it made were true, divided by how many predictions it made in total. In statistics terms, how many true positives and true negatives the model had, over the whole set of predictions.

After this small introduction, I encourage you to read further on Wikipedia, and maybe a book. I particularly like O’Reilly editions.

Getting dirty: preprocessing the data.

We can finally start with the practical, fun part of this article. In order for the model to consume the data, we have to first decide which columns of our CSV are relevant, and turn them into numerical values. That includes categorical features like ‘Sport’.

Keeping the right columns

For this particular problem, I think the only columns we’ll actually find useful are the Height, Weight, practiced Sport, Sex. I’ll also add the Name of each athlete, as we may like to check for their uniqueness.

We do not want athletes that participated in more than one event biasing our data. To avoid that, we pick only one row for each of them. I assume their weight and height did not vary that greatly from event to event, and there is not a significant quantity of athletes trying different sports each time. In order to pick unique athletes, we’ll use our DataFrame’s groupby method, and call the unique method.

We see we now have almost 100000 unique athletes in our Dataset, and they practice 55 different sports.

One-Hot Encoding for Categorical Columns

A column with 55 different categorical values would be hard to digest for a Machine Learning model. Because of that, we’ll use One-Hot Encoding on it. That means we’ll turn a vector with 55 possible different values into 55 binary vectors. Sounds complicated? It would be a bit of a bother to code that by hand. Luckily, Pandas has us covered with a built-in method. Here’s the One-Hot Encoding snippet.

And just like that, df_one_hot_encode is a 54 by 100k matrix that only has one non-zero value for each row, and that value is always a 1. The drop first argument makes it discard one of the categories, as ‘all columns are 0’ could be a category on itself.

Converting to NumPy Arrays

After the encoding is done, we add the Height and Weight columns to it, and of course our label.

Finally we’ll drop our DataFrame altogether and switch to NumPy arrays, since TensorFlow takes matrices as input. I’ll admit right away I am not as proficient in NumPy as in Pandas. If any of the following lines is too ugly please let me know and I’ll fix them.

This is how we convert our DataFrame into a matrix:

My avid readers may be wondering why I’m actually dropping the Sports’ columns. To be honest, I tried a few different approaches, but Sports only added noise to the model. If anyone can come up with a way to use them cleverly (probably grouping them into a few clusters by correlations, but I didn’t try that today), please fork the project and do so. I’d love to see interesting results.

The other controversial thing I’m doing is using two columns, one for males (binary) and one for females (1 — male). To do a regression on more than one column, we’ll use the softmax transformation, a generalization of sigmoid for multi-class classification problems. It assigns a probability to each of the classes in a dependent manner, so they all add up to 1. It may be overkill, but it will make it very easy to adapt this code to predict over a wider range of classes (say, predicting sports…?) in the future.

Actual Machine Learning: Logistic Regression in Tensorflow.

Defining our model

Any good TensorFlow program that doesn’t import a configuration file starts with bureaucracy. I promise it gets better.

There we define the number of inputs our model will have (one for each feature) and the number of outputs (2, one for male and one for female). We also set up the learning rate (how far the gradient will ‘jump’ on each descent), define the model and initialize the variables for the weights.

An interesting thing to note is that TensorFlow variables are lazily evaluated: They won’t have any value until we run them in a session. Printing them would only show they’re tf Variables, without actually getting the value. This allows for a lot of optimizations, as for instance many additions and substractions could be compacted into a single one.

In the following snippet, we deal with how the Machine Learning model will be trained, and evaluated.

We define accuracy, and set up an Optimizer (Gradient Descent). We also add our cost function (the mean of the cross entropy for all inputs).Training our model

Finally, this beautiful snippet actually starts the training, and we can see our Logistic Regression in action.

Notice how we simply iterate the inputs on each epoch, and run the optimizer. Since TensorFlow functions are lazy, we have to initialize them by calling run from a tf.Session, and pass the values for their Variables (which are finally initialized, after some trouble getting out of bed) in a feed_dict term.

At the end of each epoch, we print the Accuracy, which is the metric we’ve been trying to optimize.

Notice we measure accuracy on a completely different dataset from the training one. This is done to avoid overfitting: The Machine Learning model may start to simply memorize its inputs and repeat them to you when asked, without actually learning to generalize from the data. The way I picture it is fitting a few points in a line with a cubic polynomial: sure it’ll fit them, but ask it where another point may be and it could point anywhere outside the line.

Conclusions and Insights

That was the whole code introduction, but now let’s do some actual Data Science.

I first run the training feeding the model with all 56 inputs. Results were pretty bad, with an accuracy just below 60%. The model was basically just guessing everyone was male. It therefore performed only as well as class imbalance (with 70% of the athletes being male) allowed. Not only that, but the few cases where it predicted a female Athlete, it was usually wrong. My theory would be that the Sports inputs, being too sparse, only added noise.

I did away with them and tried a very simple model that only looked at Height and Weight. It overfit the data so quickly it turned to predicting 100% males in just one epoch. Try for yourself, you’ll see it’s true.

At first I thought I must be initializing the variables wrong or something, so I passed it an all-female training set. It then just started predicting 100% females, so it was clear it was actually learning.

I tried normalizing the data, and I finally saw some results. However, it was barely better than predicting all males again.

Finally, I resorted to an old trick. I fed the model with new, non-linear features: not just with the Weight and Height, but also with their ratio, their product, and their squares and cubes, all normalized.
The model achieved 76% accuracy. This means it has to be guessing both sexes correctly in some cases (better than everything else I achieved today), and is actually a lot better than random guess.

To generate those features, I used the following snippet:

I hope you’ve enjoyed this small introduction to the huge world of Machine Learning. In the future, we may tackle this problem with a different model altogether (hint: think deep), and maybe look at the models’ performance in a more visual way.

As usual, the code is available on the GitHub project. If you want to improve it, or add any other features to it, feel free to do so. My code is your code.

If there’s any part of this article you didn’t find clear enough, or simply have a better way of explaining any of these things, please let me know. Your feedback is very helpful to me, and means a lot.
If you’re new to Machine Learning, I hope you’ve found this introduction useful, or at least interesting.
I also hope to see you again soon.

Follow me for more Data Science and Machine Learning tutorials. If you haven’t done so already, please follow me on twitter to find out when I write a new article.

Originally published at www.dataden.tech on September 18, 2018.