The optimal subgroup puzzle

Jonathan Harel
Towards Data Science
7 min readMay 12, 2018

--

Harnessing the power of machine learning algorithms to solve unusual problems

These days, many career shifts are taking place as more and more computer scientists / statisticians / mathematicians / etc. are entering the world of data science. Being a junior data scientist myself, I meet a lot of people at the beginning of their journey, just like me. It recently occurred to me that many of the juniors I meet own a certain mindset, thinking that if a problem can’t be solved by a known model, then it is probably a major league problem and they should leave it to the pros. Personally, I think that even the tools that we own as junior data scientists are extremely powerful, and I’d like to prove it with an interesting problem that I faced.

I first met the optimal subgroup problem while working in the ad-tech industry. I was a junior data scientist, and among my team’s (which was just me back then, actually) responsibilities was the segmentation process. Simply put, given a set of features describing a person, we had to determine if that person likes shoes, horses, sports, etc. But even though we achieved high accuracy, and in some cases we even had features mapping directly to segments, the business guys were not happy with our results:

“We can’t sell some segments to advertisers, they aren’t market appealing”.

They further explained that while they completely trust our segmentation system, they can’t sell the sports fans segment, for example, if it contains 70% women. “If the segment’s statistics don’t make sense to the advertisers, they won’t buy it”.

So how do we deal with this issue? Let’s continue with the sports fans example: we have self-reported data from many individuals, so we are very confident about our labels. The thing is, our data is distributed differently than the population, meaning that our women-to-men ratio (or any other ratio for that matter) might be different than the real-world one. For example, let’s look at age distribution in the US according to our data and according to Wikipedia:

It is clear that our data is distributed differently. Moreover, we also see the effect of bad data on this graph, as it is highly unlikely that we know about ~15 million more people aged 55 than Wikipedia. Anyhow, the problem is out there, and not only is it relevant for age, it is also relevant for gender, income, and many more features. So in order to keep using our segmentation models, and meet business needs, we need to solve the optimal subgroup problem:

Given a group of data points, G, that has distribution P over the set of features F, find the largest subgroup, G’, that has a wanted distribution P’ over F.

Keep in mind that the size of G’ affects P’ directly. It is possible that a certain G’ will match P’ exactly, but if this same G’ contains five people, then we can’t really sell this group, can we?

So how do we tackle this?

The problem we face has no straight-forward “scikit-learn-implemented” solution. But, using the basic, yet amazing tools from our data science toolbox, an elegant solution can be crafted. Let’s think about the optimal solution: if we had a little birdie that knew the optimal solution, then it could tell us if each person is included in / excluded from G’. So maybe, we can transform this problem into a simpler, binary classification problem. Since eventually we would like to optimize G’, it seems natural to choose logistic regression, which uses gradient descend in order to find the solution that minimizes the error.

The problem is that our data is not labeled, so we will have to define a loss function that fits our needs. We defined the loss function like this:

Why didn’t we just add up the two types of error? That’s because we favored, business-wise, higher ratios accuracy over segment size. For the formula above, if the segment-size-error is small, the loss function still depends on the feature-ratios-error. On the other hand, if the feature-ratios-error is small, the loss function won’t be affected as much by the segment-size-error.

Now let’s define the individual errors using the MSE estimator:

The definition for segment-size-error
The definition for feature-ratios-error

We will proceed just like general regression models: First we’ll calculate who’s in G’ using a set of weights. Then, we’ll use the error derivative in order to update the same set of weights. But before we move on, let’s think about the structure of the data that we want to work with. The right features will help up speed calculations and extend the use of vectorization, as you’ll see next.

We manipulated our input data, X, so that among other features, it will also include the binary form of all the features in F, the features we are trying to optimize the distribution upon. For example, the feature age was divided into buckets for which distribution we cared about: i.e. age_18_24, age_25_30, etc… (one-hot encoding). Another example: the feature gender was transformed into the binary feature is_male.

Why would we want these binary features? Because they allow us to easily calculate the wanted ratios, and do it fast. Suppose we had the label vector y, then calculating each feature ratio reduces to this vectorized form:

X at feature represents the feature column in X

How do we get the label vector, y? Similar to the logistic regression model:

For person i, y is the logistic function applied to f(w)
f(w) gives each person a “score” based on his/her features

At last, we are ready to find our weights!

Calculating the error derivative

Let’s take the derivative of E_total w.r.t the weights vector, w:

E = Error, t = total, f = feature-ratios, s = segment-size

The segment-size-error derivative is really simple, so I won’t describe it here. I still want to show the feature-ratios-error derivative, as it can be vectorized nicely using our binary features. By the quotient rule, the derivative of the feature-ratios-error, for a specific feature a, over all data points i equals:

If we define:

We get that the derivative (w.r.t feature a) equals:

The feature-ratio error derivative for specific feature a. The sums are over all data points.

So this is really cool, but in fact, this can be vectorized even further, if we replace X at feature a with X at all the features we care about, F.

Results

So when we started working on this project, we did not define a segment-size error. This caused the amount of people in the segment to drop by almost a half, but the ratios looked great:

The first run of the algorithm, no segment-size error was defined.

Following these results, we added the segment-size-error to the feature-ratios-error, and received these numbers:

Too many people, feature-ratios accuracy dropped sharply

Well this is obviously too aggressive. Finally, we came up with the formula given above, and got these results, somewhat of a sweet spot between quality and quantity:

One thing to notice about all of the results, is how the algorithm struggled with young and old people. Our data distribution was highly skewed towards older people, and this can be clearly seen in the results. But hey, the business-guys were happy with the ratios we got out of our skewed data, and we got to convergence on 200 K people in less than a minute!

To sum things up, I see junior data scientists out there sticking to the known-and-implemented. This isn’t bad obviously, but don’t be afraid to use the tools that you own to solve non-trivial problems, it just might work!

--

--

On a technological adventure to create OJOS. Follow me to see how it goes!