The world’s leading publication for data science, AI, and ML professionals.

How Do I Finish My Data Science Projects Quickly?

Untold secrets of a data scientist

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

Today, machine learning model development has reached the hands of even non-data scientists. You just need to know a proper approach to problem-solving. This is not a big secret, it is just a matter of being aware of the advanced developments in machine learning and understanding the frameworks available that aid in quick and better development. In this article, I will first describe the traditional approach followed by data scientists over the past many years and then discuss the modern approach that data scientists follow in current days.

Traditional Approach

All these years, when given a Data Science project, a data scientist would first start with exploring the data set. He would cleanse the data, impute the missing values, look at the variance in column values, search for correlations, do features engineering, and so on. There are "n" number of tasks he would perform to create a dataset for training his machine learning algorithms or feeding it to a neural network that he has designed. The list is endless, time-consuming, and mostly laborious.

After he sets the dataset for learning, his next task would be to try out a machine learning algorithm he has selected based on his knowledge and experience. After training the model, he may find that the algorithm that he has selected gives just a 60% accuracy, certainly not acceptable. So, he may go back to data preprocessing, features engineering, dimensionality reduction, and so on to improve the model’s accuracy. He may fiddle around with the hyper-parameters to check on the performance improvements. If all these still do not work out well, he will go in for another algorithm. The process goes on until he finds the best performing algorithm with fine-tuned hyper-parameters for his dataset. At this stage, he would call his model ready-for-deployment.

Modern Approach

The process that I have described so far, you can easily understand, is highly laborious and time-consuming. Not only that, you need an excellent knowledge of Statistics, EDA (Exploratory Data Analysis), Machine Learning algorithms, metrics used in evaluating algorithm’s performance, and so on.

Will it not be nice if somebody automates this entire process? Sounds difficult, but yes, such tools are there in the industry for a good amount of time. You need to learn how to use them on your projects. We call this AutoML. I will give you a quick introduction to such tools. For reference, I will discuss tools based on the widely used sklearn ML library. Auto-sklearn is one such tool that takes a classical ML approach and AutoKeras that of ANN. There are several other commercial and free-to-use toolkits available. I will mention a few at the end so that you can select the one for your purpose.

Classical ML Approach

With auto-sklearn, you just need to know your task – regression or classification. First, I will discuss the auto model development for classification.

Classification Task

In auto-sklearn, it is just two lines of code to get the best performing model on your dataset. For classification task, you would use code like this:

model = AutoSklearnClassifier(time_left_for_this_task=4*60, 
per_run_time_limit=30, n_jobs=-1)
model.fit(X_train, y_train)```

The AutoSklearnClassifier does the magic. It takes a few parameters that give you control on the execution timings. It takes a long time, sometimes even in hours, to find the best fit model, so this timing control is essential. The fit command runs for a long time and, at the end, it gives you the best model that fits on your dataset. Once this is done, you may straight-away use it for inference:

predictions = model.predict(X_test)

If you are curious to know what happened behind the scenes, you can print some statistics.

print(model.sprint_statistics())

One of my test runs on a Kaggle dataset produced the following statistics:

auto-sklearn results:
Dataset name: 1e3ad955125d649ab6cd828885a8d3fb
Metric: accuracy
Best validation score: 0.986602
Number of target algorithm runs: 24
Number of successful target algorithm runs: 15
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 9
Number of target algorithms that exceeded the memory limit: 0
Accuracy: 0.993

You can see it tried out 24 algorithms – I would not have patience to do this. The model has given me 99% accuracy – what more would you ask for? If I have further curiosity in knowing what the last model comprises, I can ask it to show me the final model configuration:

model_auto.show_models()

This is the output I got on my test run. This is just a partial output showing one classifier (_extratrees) used in the ensemble.

'[(0.340000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'extra_trees', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'minmax', 'feature_preprocessor:__choice__': 'no_preprocessing', 'classifier:extra_trees:bootstrap': 'True', 'classifier:extra_trees:criterion': 'gini', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.5033866291997137, 'classifier:extra_trees:max_leaf_nodes': 'None', 'classifier:extra_trees:min_impurity_decrease': 0.0, 'classifier:extra_trees:min_samples_leaf': 2, 'classifier:extra_trees:min_samples_split': 14, 'classifier:extra_trees:min_weight_fraction_leaf': 0.0},ndataset_properties={n 'task': 1,n 'sparse...'

Most of the time, I will not even care to look at these details. As long as the model is giving good accuracy (rather extremely good in above case) on test data, I will just deploy it on a production server.

Just to compare it with the manual way of building models, I tried the SVC classifier on the same dataset. This is the comparison result:

Image by Author
Image by Author

Now, tell me why should I ever try the manual way?

A similar approach is taken even for regression tasks.

Regression Task

For regression, you use these three lines of code:

model_auto_reg = AutoSklearnRegressor(time_left_for_this_task=4*60, per_run_time_limit=30, n_jobs=-1)
model_auto_reg.fit(X_train_scaled,label_train)
print(model_auto_reg.sprint_statistics())

The last line is just printing the statistics, which produced the following results on my test run done on a UCI dataset.

auto-sklearn results:   
Dataset name: 17f763731cd1a8b8a6021d5cd0369d8f   
Metric: r2   
Best validation score: 0.911255   
Number of target algorithm runs: 19   
Number of successful target algorithm runs: 6   
Number of crashed target algorithm runs: 0   
Number of target algorithms that exceeded the time limit: 11   Number of target algorithms that exceeded the memory limit: 2

Again, you see, it tried 19 algorithms – lots of patience for a human-being. I even tried the LinearRegression implementation in sklearn to compare the above model with the manual efforts. This is the error metrics for my two test runs:

Image by Author
Image by Author

For your further curiosity, you can examine the model developed by calling the _showmodels method as discussed above for classification tasks.

As a further check on the developed model, I plotted the regression line for the two approaches, which is shown in this screenshot:

Image by Author
Image by Author

The auto-sklearn features

Internally, auto-sklearn does much more than an exhaustive search on algorithms. We can quickly observe this in the architecture diagram given on their site.

Image source: Efficient and Robust Automated Machine Learning
Image source: Efficient and Robust Automated Machine Learning

As you see, besides algorithm selection, it also provides hyper-parameter tuning. It ensembles the top-performing algorithms. It uses meta-learning and Bayesian optimization to create an efficient pipeline. It also allows you to examine the model details. You have a control on the ensemble process. Just imagine doing all these on your own.

The only thing I find missing is that it did not give me an ipynb file based on the selected model so that I can do further tuning on my data pipeline, ensemble, and hyper-parameters.

Now, comes the ANN auto development.

ANN/DNN Approach

With the splendid success of ANN/DNN in solving the problems which were otherwise unsolvable using classical ML, data scientists are exploring, or rather using it for their ML tasks. Several data scientists design their own ANN architectures and/or use pre-trained models in achieving their business goals. Designing and using ANN architectures requires other skills than being a statistician. We require you to fit into the shoes of an ML engineer and have knowledge of optimizations and mathematics. Fortunately, you have automated tools to do even this kind of task.

Like auto-sklearn, AutoKeras facilitates design of networks for both regression and classification tasks. Not only this, it handles structured, image and text data, too.

Again, using this library is fairly simple. For classification on structured data, the code looks like this:

search = StructuredDataClassifier(max_trials=10,num_classes=3)
search.fit(x=X_train_s, y=y_train, verbose=0, epochs=5)

For image data, it would be something similar to this:

clf = ak.ImageClassifier(num_classes=10, overwrite=True, max_trials=1)

For regression tasks, the code is a bit more involved due to the requirement of a callback. This is shown here:

from tensorflow.keras.callbacks import ReduceLROnPlateau
lr_reduction = ReduceLROnPlateau(monitor='mean_squared_error', patience=1, verbose=1, factor=0.5, min_lr=0.000001)
regressor = StructuredDataRegressor(max_trials=3,loss='mean_absolute_error')
regressor.fit(x=X_train_scaled, y=label_train,
callbacks=[lr_reduction],verbose=0, epochs=200)

After the library creates the network model, you can evaluate its performance on the test data, get its evaluation score, look up error metrics, use it for predictions on unseen data, and so on. I will not discuss all those details here. Rather, I will show you the network model created by the library. You use a built-in function for exporting the model. On my test run done on a UCI dataset for classification, this was the model created:

Image by Author
Image by Author

And this is the network plot for the same:

Isn’t this wonderful? As a data scientist, you need not even learn the skills of an ML engineer. Though I used the above two tools for demonstrating AutoML capabilities, there are several more available in the market, probably with many more features.

A Few Frameworks

Just to mention a few frameworks – you have H2O.ai, TPOT, MLBox, PyCaret, DataRobot, DataBricks, and just recently announced BlobCity AutoAI. The list is not surely exhaustive. I just restricted it to a few that I have evaluated. Some of these offer paid services and others are free-to-use, open-source. Some support R. Especially, H2O supports R, Python, Java, and Scala – so you have a wider choice in coding. Some of these have nice GUIs, so no command line coding is required. Some operate on Cloud. You may like to do your evaluations in selecting an appropriate AutoML framework for your purpose.

One thing that I will surely likely to mention here is that the new baby BlobCity AutoAI generates the code for me that nobody else in the AutoML space that I have seen so far does. It generates an ipynb file along with a detailed nice documentation. In other libraries, it is like working with a black-box model. This feature is a great boon for me as all my customers demand the source code submission before ever accepting the delivery.

The company also claims that they do the automatic selection between classical ML and ANN. This can be a great aid to all data scientists as they always face the dilemma of choosing between classical and ANN. They also claim to do auto selection between regression and classification, which other tools too do, so as such, this is not a big deal for me.

I must say here, I have not completely verified their claims so far. It is an open-source project that gives me an opportunity to validate such claims. All said, I will not jump into it directly. The project is newly launched, is under development, and is likely have lots of bugs, unless they have tested it deeply. The features they are talking about, if they keep it up to it, it will be a great boon to data scientists and the whole of data science space.

Concluding Recommendations

I shared my approach in handling the data science projects in the current days. I use AutoML tools. I do not wish to say which one – at times it is more than one. Mentioning the names may look like an endorsement, so I am avoiding it. That said, using AutoML tools, as a data scientist, will save you lots of energy. A bit of warning – all the tools I have tested, used so far have one or more bugs – I see the issues raised on their GitHubs. I suggest if you find an issue that is reproducible, do submit it to them. Such tools would make our lives easier and not surely make us (data scientists) extinct.

Some Useful Links

Credits

Pooja Gramopadhye – Copy editing

George Saavedra – Program development

Join Medium with my referral link – Poornachandra Sarang


Related Articles