Logistic Regression For Facial Recognition

Confusion matrices and ROC-AUC curves with sklearn

Published in

Towards Data Science

8 min readDec 19, 2018

ATTENTION: This post is no longer up-to-date and may contain errors. It is preserved for transparency. The updated version can be found here.

Facial recognition algorithms have always fascinated me, and wanting to flex my newfound logistic regression skills on some data, I created a model based on a dataset I found called “Skin Segmentation.”

As noted in its description, the data in Skin Segmentation were collected “by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from FERET database and PAL database.” The dataset has 245,057 rows and 4 columns (B, G, R, and a binary column indicating if the image was classified as containing skin or not containing skin). The latter, binary column made this dataset ripe for logistic regression.

(If you want to read the two (pay-walled) scholarly articles that used this dataset, see here and here.)

Onto The Regression

My wheelhouse is Python, and I particularly wanted to test out the scikit-learn library since it is the industry standard for scientific tasks like regression. I also opted to use Juypter Notebook as my IDE over text editors like Atom because of its low barrier to entry; with its narrative structure, Jupyter allows most anyone to get a feel for how code works.

To run this code on your own, you will need to have Python 3, scikit-learn, Pandas, NumPy, matplotlib, Seaborn, Itertools, and imblearn installed/imported on your machine/in your virtual environment (requirements.txt file coming to the repo soon!).

You can find the repo for this project here. There, you can find an in-depth Jupyter notebook outlining all the steps summarized in this post.

Munging

The data munging I had to do on this dataset was fairly straightforward. After reading in the text file via URL and saving it to a Pandas dataframe, you can see that its only real issue is that it lacks column headers.

After pushing down the top row into the dataframe and renaming the column headers, I was able to get to this:

In this dataset, 1s indicated pixels with no skin in them, and 2s indicated pixels skin in them.

As you can anticipate, 1s and 2s are indeed helpful, but we really want 1s and 0s for logistic regression. So, I created dummies for the Skin or No-Skin column. (I also coerced the datatype to int64 to be in line with the other columns in the dataframe.)

As you can see, I turned this Skin or No-Skin column into my y-variable (my dependent variable). The other columns I turned into a matrix (X).

In my X matrix, I also wanted to get all my variables on the same scale, so I ran them through a min-max normalizer and got them all on a 0–1 scale.

Train-Test-Split & Building The Model

After getting my data into a workable state, I split it into test and train sets using sklearn. I also created my model, fit it to my training data, and then ran my model on both my train and test data to see how well it performed.

Running the model on my training and test data

Evaluating Performance, Part 1: Simple Readouts

I looked at the difference between my predicted values (y_hat_train and y_hat_test) and my original y_train and y_test values to evaluate my model’s accuracy.

In the readouts, the number of 0s is the number of correct predictions when compared with the true values, and the number of 1s is the number of incorrect predictions. The normalize = True simply turns the first part of the readout from # of predictions to percentages.

You can see that, on our training data, our model was accurate 90.7% of the time, while on our test data it was accurate 90.6% of the time. Pretty good!

Evaluating Performance, Part 2: Confusion Matrix

I also wanted to see how my model did in terms of a confusion matrix. (For a great read on confusion matrices, see here.)

As you can see in the comments, here 0s are pixels classified as having no skin and 1s are pixels classified as having skin.

In order to make my confusion matrix’s values a bit more understandable, I then ran a Classification Report in order to get some common metrics like precision, recall, and f1-score.

Classification Report from sklearn.metrics

You can see here that our model’s precision is 96%, its recall is 92%, and its f1-score is 94%. I also wanted to know my accuracy rate, so I simply calculated it manually and got 91% (the same number we got with our test data in the simple readouts section above).

For a nice representation of what each score means, see the diagram below.

Recall and precision mapped onto a confusion matrix

Evaluating Performance, Part 3: ROC-AUC Curve

ROC-AUC curves are a helpful metric to look at because they measure the true positive rate against the false positive rate.

I generated an ROC-AUC curve for both my train and test data, but they seemed essentially the same, so I will only show the train image here. (The AUC for the training data was 93.4%, while the AUC for the test data was 93.2%.)

More on ROC-AUC curves and their purpose can be found here and here.

As you can see, our ROC curve is hugging the top, left corner of our graph, which is what we want.

Rewinding A Bit: ROC-AUC Curves & Class Imbalance

One of the most useful things about ROC-AUC curves is that they allow us to evaluate models even when there is a class imbalance within the original dataset.

Class imbalance is when there is significantly more of 1 type of class in your data than another type of class. This has the ability to really mess up your analysis. Take, for instance, a dataset of rare diseases that is skewed (i.e. has class imbalance issues). Let’s then say that there are only 2 positive cases in 1000. Even if you make a crappy model that classifies everything as negative, you will still achieve an accuracy rate of 99.8% (i.e. 998 / 1000 classifications was correct). So, you need more context to truly evaluate your model when it’s being run on an imbalanced dataset.

So, let’s rewind a bit and see if our data is imbalanced. Maybe we can make our model better.

We can see here that our data is a little out of wack — about 79% is classified as 1s (i.e. pixels containing skin) and only about 21% as 0s (i.e. pixels not containing skin).

So, before we call it a day, let’s run something called SMOTE (Synthetic Minority Oversampling). SMOTE creates synthetic data to fill more values in for our minority class (in our case this means it will give us more 0-data points). Although creating synthetic data seems like cheating, it’s very common practice in the data-science world where datasets are not very often well-balanced.

Running SMOTE on our original data and re-generating our confusion matrix and our Classification Report, we get the following.

Classification Report for SMOTE-ized data, showing both training and test readouts

We can see that these numbers are significantly different than our original numbers. Recall that the first time around we had a precision of 96%, a recall of 92%, and a f1-score of 94%. After running our data through SMOTE, we have a precision of 85%, a recall of 92%, and a f1-score of 89%. So, our model actually got worse after trying to compensate for class imbalance.

As data scientists, it’s up to us when to run things like SMOTE on our data or just leave our data as-is. In this case, we can see that leaving our data as-is was the way to go.

So, What Does This All Mean?

Besides being a good learning opportunity, we really did create a good model. And we know that doing other things to our data, such as trying to compensate for having a class imbalance, is definitively not what we should do in this case. Our original model was our best performer, correctly predicting an image having skin in it 94% of the time (using our f1-score).

While this might sound like a good classification rate, there are myriad ethical issues that come with creating algorithms used for facial recognition. What if our model was going to be used to identify wanted terrorists in a crowd? This is certainly different than recognizing a face for a filter in Snapchat.

From Christoph Auer-Welsbach’s post “The Ethics of AI: Building technology that benefits people and society”

This is where we as data scientists need as much information as possible when creating algorithms. If our model were to be used for Snapchat filters, we’d likely want to optimize towards upping our recall score — it’s ostensibly better that we mistakenly identify things as faces when they’re not faces than only recognize real faces some of the time.

On the other hand, if our model were to be used for spotting wanted criminals in a crowd, we’d likely still want to optimize towards recall, but the consequences would be vastly different. Is it worth bringing innocent people in for questioning and potentially violating their rights if it meant that you’d catch wanted criminals almost 100% of the time? Perhaps not.

We need to think carefully about these questions when playing around in the world of AI. As Cathy O’Neill writes in Weapons of Math Destruction, constructing algorithms requires “moral imagination, [which] only humans can provide.”