The world’s leading publication for data science, AI, and ML professionals.

Determine IF your Data Can Be Modeled

Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features…

Have you collected the right set of data??

Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.

So, WHEN DO I STOP?

  1. Always Visualize your data based on the class label you are trying to predict
columns_pairplot = x_train.select_dtypes(include=['int', 'float']).join(y_train)
sns.pairplot(columns_pairplot, hue = 'readmitted')
plt.show()
Image by Author: Inseparable Classes
Image by Author: Inseparable Classes

The distribution of different classes is almost exact. Of course, it is an imbalanced dataset. But, notice how the spread of the classes overlaps as well?

  1. Apply the t-SNE visualization

t-SNE is "t-distributed stochastic neighbor embedding". It maps higher dimensional data to 2-D space. This approximately preserves the nearness of the samples.

You might need to apply different learning rates to find the best one for your dataset. Usually, try values between 50 and 200.

Hyper-parameter, perplexity balances the importance t-SNE gives to local and global variability of the data. It is a guess on the number of close neighbors each point has. Use values between 5–50. Higher, if there are more data points. Perplexity value should not be more than the number of data points.

NOTE: Axis to t-SNE plot are not interpretable. They will be different every time t-SNE is applied

Image by Author: T-SNE Visualization
Image by Author: T-SNE Visualization

Hmm, let’s look a bit more- tweak some hyperparameters.

# reduce dimensionality with t-sne
tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=1000, learning_rate=50)
tsne_results = tsne.fit_transform(x_train)
Image by Author: Overlapping Data- Unclassifiable!
Image by Author: Overlapping Data- Unclassifiable!

Do you see how the clusters can not be separated! I should have stopped here! But, I could not get myself out of the rabid hole. [YES WE ALL GO DOWN THAT SOMETIMES].

  1. Multi-Class Classification

We already know from above that the decision boundaries are non-linear. So, we can use an SVC (Support Vector Classifier with RBF Kernel)

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svc_model = SVC() ## default kernel - RBF
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svc_model, param_grid = parameters, n_jobs= 4, verbose = 2, return_train_score= True)
searcher.fit(x_train, y_train)
# Report the best parameters and the corresponding score

Train Score: 0.59 Test Score: 0.53 F1 Score: 0.23 Precision Score: 0.24

So, I should have stopped earlier…It is always good to have an understanding of your data before you try to over-tune and complicate the model in the hopes of better results. Good Luck!


Related Articles