Generating Synthetic Classification Data using Scikit

Published in

Towards Data Science

9 min readMar 13, 2019

This is part 1 in a series of articles about imbalanced and noisy data. Part 2 about skewed classification metrics is out.

Why do we need Data Generators?

Data generators help us create data with different distributions and profiles to experiment on. If you are testing various algorithms available to you and you want to find which one works in what cases, then these data generators can help you generate case specific data and then test the algorithm.

For example you want to check whether gradient boosting trees can do well given just 100 data-points and 2 features? Now either you can search for a 100 data-points dataset, or you can use your own dataset that you are working on. But how would you know if the classifier was a good choice, given that you have so less data and doing cross validation and testing still leaves fair chance of overfitting? Or rather you could use generated data and see what usually works well for such a case, a boosting algorithm or a linear model.

Few reasons why you need generated data

Can your models handle noisy labels?
What happens when 99% of your labels are negative and only 1% are positive?
if your models can tell you which features are redundant?
In case of model provided feature importances how does the model handle redundant features.
Does removing redundant features improve your model’s performance?
How does your model behave when Redundant features, noise and imbalance are all present at once in your dataset?
If you have N datapoints and M features, what are the safe values of N,M so your model doesn’t overfit?

Finding a real dataset meeting such combination of criterias with known levels will be very difficult. As a result we take into account few capabilities that a generator must have to give good approximations of real world datasets.

Generator Capabilities

While looking for generators we look for certain capabilities. I list the important capabilities that we look for in generators and classify them accordingly.

Supports Imbalancing the Classes

A lot of times you will get classification data that has huge imbalance. For example fraud detection has imbalance such that most examples (99%) are non-fraud. To check how your classifier does in imbalanced cases, you need to have ability to generate multiple types of imbalanced data.

Guassian Quantiles
Make classification API

Support Generating Noisy Data

Can your classifier perform its job even if the class labels are noisy. What if some fraud examples are marked non-fraud and some non-fraud are marked fraud? How do you know your chosen classifiers behaviour in presence of noise? And how do you select a Robust classifier?

Make classification API

Adding Redundant/Useless features

These are Linear Combinations of your useful features. Many Models like Linear Regression give arbitrary feature coefficient for correlated features. In case of Tree Models they mess up feature importance and also use these features randomly and interchangeably for splits. Removing correlated features usually improves performance.

Make classification API

Examples

The Notebook Used for this is in Github. The helper functions are defined in this file.

Here we will go over 3 very good data generators available in scikit and see how you can use them for various cases.

Guassian Quantiles

2 Class 2D

from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=3.,
                                 n_samples=10000, n_features=2,
                                 n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
visualize_2d(X1,y1)

Multi-Class 2D

from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=3.,
                                 n_samples=10000, n_features=2,
                                 n_classes=3, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
visualize_2d(X1,y1)

2 Class 3D

from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=1.,
                                 n_samples=10000, n_features=3,
                                 n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y','z'])
y1 = pd.Series(y1)
visualize_3d(X1,y1)

A Harder Boundary by Combining 2 Gaussians

We create 2 Gaussian’s with different centre locations. mean=(4,4)in 2nd gaussian creates it centered at x=4, y=4. Next we invert the 2nd gaussian and add it’s data points to first gaussian’s data points.

from sklearn.datasets import make_gaussian_quantiles# Construct dataset# Gaussian 1
X1, y1 = make_gaussian_quantiles(cov=3.,
                                 n_samples=10000, n_features=2,
                                 n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
# Gaussian 2X2, y2 = make_gaussian_quantiles(mean=(4, 4), cov=1,
                                 n_samples=5000, n_features=2,
                                 n_classes=2, random_state=1)X2 = pd.DataFrame(X2,columns=['x','y'])
y2 = pd.Series(y2)# Combine the gaussiansX1.shape
X2.shapeX = pd.DataFrame(np.concatenate((X1, X2)))
y = pd.Series(np.concatenate((y1, - y2 + 1)))X.shapevisualize_2d(X,y)

Blobs

In case you want a little simpler and easily separable data Blobs are the way to go. These can be separated by Linear decision Boundaries. Here I will show an example of 4 Class 3D (3-feature Blob).

You can notice how the Blobs can be separated by simple planes. As such such data points are good to test Linear Algorithms Like LogisticRegression.

Make Classification API

This is the most sophisticated scikit api for data generation and it comes with all bells and whistles. It allows you to have multiple features. Also allows you to add noise and imbalance to your data.

Some of the more nifty features include adding Redundant features which are basically Linear combination of existing features. Adding Non-Informative features to check if model overfits these useless features. Adding directly repeated features as well.

Also to increase complexity of classification you can have multiple clusters of your classes and decrease the separation between classes to force complex non-linear boundary for classifier.

I provide below various ways to use this API.

3 Class 3D simple case

from sklearn.datasets import make_classification
X,y = make_classification(n_samples=10000, n_features=3, n_informative=3, 
                    n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=2,
                          class_sep=1.5,
                   flip_y=0,weights=[0.5,0.5,0.5])X = pd.DataFrame(X)
y = pd.Series(y)visualize_3d(X,y)

3 Class 2D with Noise

Here we will use the parameter flip_y to add additional noise. This can be used to test if our classifiers will work well after added noise or not. In case we have real world noisy data (say from IOT devices), and a classifier that doesn’t work well with noise, then our accuracy is going to suffer.

from sklearn.datasets import make_classification# Generate Clean dataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("No Noise");# Generate noisy DataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0.2,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("With Noise");plt.show();

2 Class 2D with Imbalance

Here we will have 9x more negative examples than positive examples.

from sklearn.datasets import make_classification# Generate Balanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("No Imbalance");# Generate Imbalanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.9,0.1], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("Imbalance 9:1 :: Negative:Postive");plt.show();

Imbalance: Notice how the right side has low volume of class=1

Using Redundant features (3D)

This adds redundant features which are Linear Combinations of other useful features.

from sklearn.datasets import make_classification# All unique featuresX,y = make_classification(n_samples=10000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_3d(X,y,algorithm="pca")# 2 Useful features and 3rd feature as Linear Combination of first 2X,y = make_classification(n_samples=10000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_3d(X,y,algorithm="pca")

Notice how in presence of redundant features, the 2nd graph, appears to be composed of data points that are in a certain 3D plane (Not full 3D space). Contrast this to first graph which has the data points as clouds spread in all 3 dimensions.

For the 2nd graph I intuitively think that if I change my cordinates to the 3D plane in which the data points are, then the data will still be separable but its dimension will reduce to 2D, i.e. I will loose no information by reducing the dimensionality of the 2nd graph. But if I reduce the dimensionality of the first graph the data will not longer remain separable since all 3 features are non-redundant. Lets try this idea.

X,y = make_classification(n_samples=1000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_2d(X,y,algorithm="pca")X,y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_2d(X,y,algorithm="pca")

Redundant 3rd Dim — Separable in 2D as well

Using Class separation

Changing class separation changes the difficulty of the classification task. The data points no longer remain easily separable in case of lower class separation.

from sklearn.datasets import make_classification# Low class Sep, Hard decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2, ax3) = plt.subplots(nrows=1, ncols=3,figsize=(20,5))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("Low class Sep, Hard decision boundary");# Avg class Sep, Normal decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.5,flip_y=0,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("Avg class Sep, Normal decision boundary");# Large class Sep, Easy decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=3,flip_y=0,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax3);
ax3.set_title("Large class Sep, Easy decision boundary");plt.show();

From Left to Right: Higher Class separation and easier decision boundaries

Testing Various Classifiers to see use of Data Generators

We will generate two sets of data and show how you can test your binary classifiers performance and check it’s performance. Our first set will be a standard 2 class data with easy separability. Our 2nd set will be 2 Class data with Non Linear boundary and minor class imbalance.

Hypothesis to Test

The Hypothesis we want to test is Logistic Regression alone cannot learn Non Linear Boundary. Gradient Boosting is most efficient in learning Non Linear Boundaries.

The Data

from sklearn.datasets import make_classification# Easy decision boundaryX1,y1 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X1[:,0],X1[:,1],hue=y1,ax=ax1);
ax1.set_title("Easy decision boundary");# Hard decision boundaryX2,y2 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1,flip_y=0,weights=[0.7,0.3], random_state=17)X2a,y2a = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.25,flip_y=0,weights=[0.8,0.2], random_state=93)X2 = np.concatenate((X2,X2a))
y2 = np.concatenate((y2,y2a))sns.scatterplot(X2[:,0],X2[:,1],hue=y2,ax=ax2);
ax2.set_title("Hard decision boundary");X1,y1 = pd.DataFrame(X1),pd.Series(y1)
X2,y2 = pd.DataFrame(X2),pd.Series(y2)

We will test 3 Algorithms with these and see how the algorithms perform

Logistic Regression
Logistic Regression with Polynomial Features
XGBoost (Gradient Boosting Algorithm)

Testing on Easy decision boundary

Refer Notebook section 5 for full code.

f, (ax1,ax2,ax3) = plt.subplots(nrows=1, ncols=3,figsize=(20,6))
lr_results = run_logistic_plain(X1,y1,ax1)lrp_results = run_logistic_polynomial_features(X1,y1,ax2)xgb_results = run_xgb(X1,y1,ax3)
plt.show()

Lets plot performance and decision boundary structure.

Decision Boundary : LR and XGB on Easy Dataset

Testing on Hard decision boundary

Decision Boundary

Performance

Notice how here XGBoost with 0.916 score emerges as the sure winner. This is because gradient boosting allows learning complex non-linear boundaries.

We were able to test our hypothesis and come to conclude that it was correct. Given that it was easy to generate data, we saved time in initial data gathering process and were able to test our classifiers very fast.

Other Resources

Synthetic Data Generation (Part-1) - Block Bootstrapping

Introduction Data is at the core of quantitative research. The problem is history only has one path. Thus we are…

www.blackarbs.com

Scikit Datasets Module

The sklearn.datasets module includes artificial data generators as well as multiple real datasets…

scikit-learn.org

NoteBook Used here

Full Code in this notebook along with helpers is given…

github.com

Helper File

Helper Functions used in this project…