Generating Synthetic Classification Data using Scikit
This is part 1 in a series of articles about imbalanced and noisy data. Part 2 about skewed classification metrics is out.
Why do we need Data Generators?
Data generators help us create data with different distributions and profiles to experiment on. If you are testing various algorithms available to you and you want to find which one works in what cases, then these data generators can help you generate case specific data and then test the algorithm.
For example you want to check whether gradient boosting trees can do well given just 100 data-points and 2 features? Now either you can search for a 100 data-points dataset, or you can use your own dataset that you are working on. But how would you know if the classifier was a good choice, given that you have so less data and doing cross validation and testing still leaves fair chance of overfitting? Or rather you could use generated data and see what usually works well for such a case, a boosting algorithm or a linear model.
Few reasons why you need generated data
- Can your models handle noisy labels?
- What happens when 99% of your labels are negative and only 1% are positive?
- if your models can tell you which features are redundant?
- In case of model provided feature importances how does the model handle redundant features.
- Does removing redundant features improve your model’s performance?
- How does your model behave when Redundant features, noise and imbalance are all present at once in your dataset?
- If you have N datapoints and M features, what are the safe values of N,M so your model doesn’t overfit?
Finding a real dataset meeting such combination of criterias with known levels will be very difficult. As a result we take into account few capabilities that a generator must have to give good approximations of real world datasets.
Generator Capabilities
While looking for generators we look for certain capabilities. I list the important capabilities that we look for in generators and classify them accordingly.
Supports Imbalancing the Classes
A lot of times you will get classification data that has huge imbalance. For example fraud detection has imbalance such that most examples (99%) are non-fraud. To check how your classifier does in imbalanced cases, you need to have ability to generate multiple types of imbalanced data.
- Guassian Quantiles
- Make classification API
Support Generating Noisy Data
Can your classifier perform its job even if the class labels are noisy. What if some fraud examples are marked non-fraud and some non-fraud are marked fraud? How do you know your chosen classifiers behaviour in presence of noise? And how do you select a Robust classifier?
- Make classification API
Adding Redundant/Useless features
These are Linear Combinations of your useful features. Many Models like Linear Regression give arbitrary feature coefficient for correlated features. In case of Tree Models they mess up feature importance and also use these features randomly and interchangeably for splits. Removing correlated features usually improves performance.
- Make classification API
Examples
The Notebook Used for this is in Github. The helper functions are defined in this file.
Here we will go over 3 very good data generators available in scikit and see how you can use them for various cases.
Guassian Quantiles
2 Class 2D
from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=3.,
n_samples=10000, n_features=2,
n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
visualize_2d(X1,y1)
Multi-Class 2D
from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=3.,
n_samples=10000, n_features=2,
n_classes=3, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
visualize_2d(X1,y1)
2 Class 3D
from sklearn.datasets import make_gaussian_quantiles# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=1.,
n_samples=10000, n_features=3,
n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y','z'])
y1 = pd.Series(y1)
visualize_3d(X1,y1)
A Harder Boundary by Combining 2 Gaussians
We create 2 Gaussian’s with different centre locations. mean=(4,4)
in 2nd gaussian creates it centered at x=4, y=4. Next we invert the 2nd gaussian and add it’s data points to first gaussian’s data points.
from sklearn.datasets import make_gaussian_quantiles# Construct dataset# Gaussian 1
X1, y1 = make_gaussian_quantiles(cov=3.,
n_samples=10000, n_features=2,
n_classes=2, random_state=1)X1 = pd.DataFrame(X1,columns=['x','y'])
y1 = pd.Series(y1)
# Gaussian 2X2, y2 = make_gaussian_quantiles(mean=(4, 4), cov=1,
n_samples=5000, n_features=2,
n_classes=2, random_state=1)X2 = pd.DataFrame(X2,columns=['x','y'])
y2 = pd.Series(y2)# Combine the gaussiansX1.shape
X2.shapeX = pd.DataFrame(np.concatenate((X1, X2)))
y = pd.Series(np.concatenate((y1, - y2 + 1)))X.shapevisualize_2d(X,y)
Blobs
In case you want a little simpler and easily separable data Blobs are the way to go. These can be separated by Linear decision Boundaries. Here I will show an example of 4 Class 3D (3-feature Blob).
You can notice how the Blobs can be separated by simple planes. As such such data points are good to test Linear Algorithms Like LogisticRegression.
Make Classification API
This is the most sophisticated scikit api for data generation and it comes with all bells and whistles. It allows you to have multiple features. Also allows you to add noise and imbalance to your data.
Some of the more nifty features include adding Redundant features which are basically Linear combination of existing features. Adding Non-Informative features to check if model overfits these useless features. Adding directly repeated features as well.
Also to increase complexity of classification you can have multiple clusters of your classes and decrease the separation between classes to force complex non-linear boundary for classifier.
I provide below various ways to use this API.
3 Class 3D simple case
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=10000, n_features=3, n_informative=3,
n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=2,
class_sep=1.5,
flip_y=0,weights=[0.5,0.5,0.5])X = pd.DataFrame(X)
y = pd.Series(y)visualize_3d(X,y)
3 Class 2D with Noise
Here we will use the parameter flip_y
to add additional noise. This can be used to test if our classifiers will work well after added noise or not. In case we have real world noisy data (say from IOT devices), and a classifier that doesn’t work well with noise, then our accuracy is going to suffer.
from sklearn.datasets import make_classification# Generate Clean dataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("No Noise");# Generate noisy DataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0.2,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("With Noise");plt.show();
2 Class 2D with Imbalance
Here we will have 9x more negative examples than positive examples.
from sklearn.datasets import make_classification# Generate Balanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("No Imbalance");# Generate Imbalanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.9,0.1], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("Imbalance 9:1 :: Negative:Postive");plt.show();
Using Redundant features (3D)
This adds redundant features which are Linear Combinations of other useful features.
from sklearn.datasets import make_classification# All unique featuresX,y = make_classification(n_samples=10000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_3d(X,y,algorithm="pca")# 2 Useful features and 3rd feature as Linear Combination of first 2X,y = make_classification(n_samples=10000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_3d(X,y,algorithm="pca")
Notice how in presence of redundant features, the 2nd graph, appears to be composed of data points that are in a certain 3D plane (Not full 3D space). Contrast this to first graph which has the data points as clouds spread in all 3 dimensions.
For the 2nd graph I intuitively think that if I change my cordinates to the 3D plane in which the data points are, then the data will still be separable but its dimension will reduce to 2D, i.e. I will loose no information by reducing the dimensionality of the 2nd graph. But if I reduce the dimensionality of the first graph the data will not longer remain separable since all 3 features are non-redundant. Lets try this idea.
X,y = make_classification(n_samples=1000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_2d(X,y,algorithm="pca")X,y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)visualize_2d(X,y,algorithm="pca")
Using Class separation
Changing class separation changes the difficulty of the classification task. The data points no longer remain easily separable in case of lower class separation.
from sklearn.datasets import make_classification# Low class Sep, Hard decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.75,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2, ax3) = plt.subplots(nrows=1, ncols=3,figsize=(20,5))
sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);
ax1.set_title("Low class Sep, Hard decision boundary");# Avg class Sep, Normal decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.5,flip_y=0,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);
ax2.set_title("Avg class Sep, Normal decision boundary");# Large class Sep, Easy decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=3,flip_y=0,weights=[0.5,0.5], random_state=17)sns.scatterplot(X[:,0],X[:,1],hue=y,ax=ax3);
ax3.set_title("Large class Sep, Easy decision boundary");plt.show();
Testing Various Classifiers to see use of Data Generators
We will generate two sets of data and show how you can test your binary classifiers performance and check it’s performance. Our first set will be a standard 2 class data with easy separability. Our 2nd set will be 2 Class data with Non Linear boundary and minor class imbalance.
Hypothesis to Test
The Hypothesis we want to test is Logistic Regression alone cannot learn Non Linear Boundary. Gradient Boosting is most efficient in learning Non Linear Boundaries.
The Data
from sklearn.datasets import make_classification# Easy decision boundaryX1,y1 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.5,0.5], random_state=17)f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
sns.scatterplot(X1[:,0],X1[:,1],hue=y1,ax=ax1);
ax1.set_title("Easy decision boundary");# Hard decision boundaryX2,y2 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1,flip_y=0,weights=[0.7,0.3], random_state=17)X2a,y2a = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.25,flip_y=0,weights=[0.8,0.2], random_state=93)X2 = np.concatenate((X2,X2a))
y2 = np.concatenate((y2,y2a))sns.scatterplot(X2[:,0],X2[:,1],hue=y2,ax=ax2);
ax2.set_title("Hard decision boundary");X1,y1 = pd.DataFrame(X1),pd.Series(y1)
X2,y2 = pd.DataFrame(X2),pd.Series(y2)
We will test 3 Algorithms with these and see how the algorithms perform
- Logistic Regression
- Logistic Regression with Polynomial Features
- XGBoost (Gradient Boosting Algorithm)
Testing on Easy decision boundary
Refer Notebook section 5 for full code.
f, (ax1,ax2,ax3) = plt.subplots(nrows=1, ncols=3,figsize=(20,6))
lr_results = run_logistic_plain(X1,y1,ax1)lrp_results = run_logistic_polynomial_features(X1,y1,ax2)xgb_results = run_xgb(X1,y1,ax3)
plt.show()
Lets plot performance and decision boundary structure.
Testing on Hard decision boundary
Decision Boundary
Performance
Notice how here XGBoost with 0.916 score emerges as the sure winner. This is because gradient boosting allows learning complex non-linear boundaries.
We were able to test our hypothesis and come to conclude that it was correct. Given that it was easy to generate data, we saved time in initial data gathering process and were able to test our classifiers very fast.
Other Resources
This is the 1st article in a Series where I plan to analyse performance of various classifiers given noise and imbalance. Next Part 2 here.
Thanks for Reading!!
I solve real-world problems leveraging data science, artificial intelligence, machine learning and deep learning. Feel free to reach out to me on LinkedIn.