The world’s leading publication for data science, AI, and ML professionals.

Sometimes, I want answers, not constraints.

I love to participate in code challenges. There are tons of 'em online, each having their strengths and weaknesses. Most of them focus on…

Creating challenges in a data science world

I love to participate in code challenges. There are tons of ’em online, each having their strengths and weaknesses. Most of them focus on data structures and concern themselves with _Big O_ for speed and memory. However, my main occupation is data science and AI, not software engineering.

Sure there’s Kaggle and data and problems to be found everywhere, but this doesn’t provide the quick "1-question challenge" format I love so much about the software engineering challenges.

Anyway, I decided to make something myself.

Data playground

So here’s the idea: a 1-button dataset generator, that comes with a question; i.e. what is the top predictor in X for given y?

I slapped some code together, which you can find here: data-playground. Below is a piece of code that captures the idea:

def generate_data(size=100, n_vars=5):
    data = np.zeros((size, n_vars))

    # insights
    data = add_insights(data)

    # return as X, y
    X = data[:, :-1]
    y = data[:, -1]

    return X, y, answer

Test run

Let’s see if we can solve a challenge we made for ourselves. Let’s answer the question: Which feature in X is the strongest predictor for y?

To get data:

import data_playground
X, y, answer = data_playground.generate_data()

Let’s have a look at the data we got using pandas boxplot function:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(X)
df['y'] = y
fig = plt.figure(figsize=(16,10)) # show large in Jupyter lab
df.boxplot()
Box plots showing the four feature variables in X and the target variable y
Box plots showing the four feature variables in X and the target variable y

Okey dokey, great stuff. Now let’s check out the scatter plots of each feature vs the target separately.

Scatterplots of each feature variable (x-axis) vs the target variable (y-axis)
Scatterplots of each feature variable (x-axis) vs the target variable (y-axis)

Well… it seems our predictor has been found. Since noise is not (yet) part of the data generator, it’s only logical the correlation between the predictor variable and the target variable is 1.

Let’s fit a line using our found predictor and the target variable

from numpy.polynomial.polynomial import polyfit
x = X[:, 2] # select the predictor as x
# Fit with polyfit
b, m = polyfit(x, y, 1)

Oh no! We get an error:

LinAlgError: SVD did not converge in Linear Least Squares

Let’s check out our predictor data:

array([ 804.47413666,  842.97551303,  748.1380677 , 1215.17156503,
       1276.81035949,  584.42126573, 1234.71707604,  882.6102377 ,
        706.84325884, 1199.374478  , 1004.41162981, 1057.4067627 ,
        659.40211494, 1087.62207358,  956.11178348,  823.99392245,
       1143.45258157,           nan, 1431.19129368, 1437.55897016,
       1230.64007   ,  617.81872218, 1062.12360521, 1201.80237097,
       1206.63668203,  882.78966205,  853.57276331, 1308.64976626,
       1097.63082906,  786.40660024, 1162.29695854,  674.5633299 ,
        624.71361214, 1389.87441326, 1104.76126325, 1200.38449289,
       1431.31478774,  591.59451421, 1398.08042032,  686.02957951,
       1201.50246581,  674.86646811,  919.48563472, 1155.23023783,
        914.36359672,  640.30226409, 1332.67749091,  740.15844807,
        829.20419027,           nan, 1003.42401407, 1410.10634418,
       1413.14987011,  772.21443592, 1118.37010464,  657.38659094,
        969.84855077,  926.63859525,  840.08468061, 1127.08928725,
       1235.22602886,  806.12183971, 1321.99606412,  867.78834899,
       1153.57274271,  667.14093202,  763.41042449,  953.45037745,
        670.57004238, 1125.52558347,  733.96942094, 1124.68968026,
        629.2784078 ,  614.83884103, 1112.5953336 , 1274.8264045 ,
       1094.69476358, 1382.11932072,  644.08898909, 1239.15631856,
       1070.14816499, 1224.89101018,  733.14740355, 1374.17210495,
       1054.74359293,  818.35508666, 1123.32875169,  934.47218873,
       1057.92690666, 1331.08625474,  814.7746755 ,  680.6920455 ,
       1024.99541524,  801.86090882, 1339.60803444, 1305.60718156,
        981.19456372, 1346.10765152,  777.84669881, 1154.19337888])

Did you spot it?

There are nan values in our set, one trick implemented in the data generator as to keep you on your toes.

Let’s replace the nan values with the mean of the feature and try again.

import numpy as np
from numpy.polynomial.polynomial import polyfit
x = X[:, 2]
# set the nan values to the mean of the non-nan values
x[np.isnan(x)] = x[~np.isnan(x)].mean()
# Fit with polyfit
b, m = polyfit(x, y, 1)

Yes, it worked this time!

Scatter plot of predictor variable vs target variable with a best-fitted line, the two nan values that are replaced by the mean are clearly shown off the line.
Scatter plot of predictor variable vs target variable with a best-fitted line, the two nan values that are replaced by the mean are clearly shown off the line.

Inspecting the b and m variables shows: b= -48.40 and m = 0.888.

Now let’s take a look at the given answer:

>>> answer.reveal()
Answers: Col 2 adds linearly to target with factor 0.8882422214069292

That seems about right!

Summing up

It was definitely fun to work with this, but it does need a bit more work before it becomes a useful exercise. More complex relations and noise need to be implemented as well as outliers and other pesky things data scientist have to deal with.

I’d love to continue working on this in the future (and merge requests are very welcome!), let’s see where it goes!


Related Articles