Creating challenges in a data science world
I love to participate in code challenges. There are tons of ’em online, each having their strengths and weaknesses. Most of them focus on data structures and concern themselves with _Big O_ for speed and memory. However, my main occupation is data science and AI, not software engineering.
Sure there’s Kaggle and data and problems to be found everywhere, but this doesn’t provide the quick "1-question challenge" format I love so much about the software engineering challenges.
Anyway, I decided to make something myself.
Data playground
So here’s the idea: a 1-button dataset generator, that comes with a question; i.e. what is the top predictor in X for given y?
I slapped some code together, which you can find here: data-playground. Below is a piece of code that captures the idea:
def generate_data(size=100, n_vars=5):
data = np.zeros((size, n_vars))
# insights
data = add_insights(data)
# return as X, y
X = data[:, :-1]
y = data[:, -1]
return X, y, answer
Test run
Let’s see if we can solve a challenge we made for ourselves. Let’s answer the question: Which feature in X is the strongest predictor for y?
To get data:
import data_playground
X, y, answer = data_playground.generate_data()
Let’s have a look at the data we got using pandas boxplot function:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(X)
df['y'] = y
fig = plt.figure(figsize=(16,10)) # show large in Jupyter lab
df.boxplot()

Okey dokey, great stuff. Now let’s check out the scatter plots of each feature vs the target separately.

Well… it seems our predictor has been found. Since noise is not (yet) part of the data generator, it’s only logical the correlation between the predictor variable and the target variable is 1.
Let’s fit a line using our found predictor and the target variable
from numpy.polynomial.polynomial import polyfit
x = X[:, 2] # select the predictor as x
# Fit with polyfit
b, m = polyfit(x, y, 1)
Oh no! We get an error:
LinAlgError: SVD did not converge in Linear Least Squares
Let’s check out our predictor data:
array([ 804.47413666, 842.97551303, 748.1380677 , 1215.17156503,
1276.81035949, 584.42126573, 1234.71707604, 882.6102377 ,
706.84325884, 1199.374478 , 1004.41162981, 1057.4067627 ,
659.40211494, 1087.62207358, 956.11178348, 823.99392245,
1143.45258157, nan, 1431.19129368, 1437.55897016,
1230.64007 , 617.81872218, 1062.12360521, 1201.80237097,
1206.63668203, 882.78966205, 853.57276331, 1308.64976626,
1097.63082906, 786.40660024, 1162.29695854, 674.5633299 ,
624.71361214, 1389.87441326, 1104.76126325, 1200.38449289,
1431.31478774, 591.59451421, 1398.08042032, 686.02957951,
1201.50246581, 674.86646811, 919.48563472, 1155.23023783,
914.36359672, 640.30226409, 1332.67749091, 740.15844807,
829.20419027, nan, 1003.42401407, 1410.10634418,
1413.14987011, 772.21443592, 1118.37010464, 657.38659094,
969.84855077, 926.63859525, 840.08468061, 1127.08928725,
1235.22602886, 806.12183971, 1321.99606412, 867.78834899,
1153.57274271, 667.14093202, 763.41042449, 953.45037745,
670.57004238, 1125.52558347, 733.96942094, 1124.68968026,
629.2784078 , 614.83884103, 1112.5953336 , 1274.8264045 ,
1094.69476358, 1382.11932072, 644.08898909, 1239.15631856,
1070.14816499, 1224.89101018, 733.14740355, 1374.17210495,
1054.74359293, 818.35508666, 1123.32875169, 934.47218873,
1057.92690666, 1331.08625474, 814.7746755 , 680.6920455 ,
1024.99541524, 801.86090882, 1339.60803444, 1305.60718156,
981.19456372, 1346.10765152, 777.84669881, 1154.19337888])
Did you spot it?
There are nan values in our set, one trick implemented in the data generator as to keep you on your toes.
Let’s replace the nan values with the mean of the feature and try again.
import numpy as np
from numpy.polynomial.polynomial import polyfit
x = X[:, 2]
# set the nan values to the mean of the non-nan values
x[np.isnan(x)] = x[~np.isnan(x)].mean()
# Fit with polyfit
b, m = polyfit(x, y, 1)
Yes, it worked this time!

Inspecting the b and m variables shows: b= -48.40 and m = 0.888.
Now let’s take a look at the given answer:
>>> answer.reveal()
Answers: Col 2 adds linearly to target with factor 0.8882422214069292
That seems about right!
Summing up
It was definitely fun to work with this, but it does need a bit more work before it becomes a useful exercise. More complex relations and noise need to be implemented as well as outliers and other pesky things data scientist have to deal with.
I’d love to continue working on this in the future (and merge requests are very welcome!), let’s see where it goes!