The world’s leading publication for data science, AI, and ML professionals.

AutoML will not replace your data science profession

Let’s walk through the steps of the machine learning process to find out "Why?"

Photo by Alex Knight on Unsplash
Photo by Alex Knight on Unsplash

Many people who are already data scientists or new to the field of data science are looking at an answer to the question "Will AutoML (Automated Machine Learning) replace data scientists?" Asking a question like this is very reasonable because Automation has already been introduced to Machine Learning and it __ plays a key role in the modern world. In addition to that, people who want to become data scientists are thinking about ways to secure a spot in the job market for a long period of time.

AutoML will **** _NO_T replace your data science profession. It’s just here to make things easier for you, such as assisting you in boring repetitive tasks, saving your valuable time, assisting you in code maintenance and consistency, etc!

Let’s walk through the steps of a machine learning process to find out "Why will AutoML NOT replace your data science profession?". We’ll also discuss some popular automation options that can be applied to the machine learning process. At the end of this article, you’ll realize that AutoML will NOT replace your data science profession.

Let’s introduce some key definitions

Basically, Machine Learning is the ability of computers to learn from data without being explicitly programmed. It is something different from traditional programming. Automation is a process in which it requires minimal human input. There exist various types of automation. Only AI Automation uses machine learning. It is something that combines automation with machine learning. AI automated systems can learn and make decisions based on data. Applying automation to machine learning means that we use automation options to achieve some repetitive tasks in a machine learning process with minimal human effort.

The steps of a machine learning process

The following diagram shows the steps of a machine learning process divided into three main categories – Usually, automation applicable, Semi automation applicable and Usually, automation not applicable.

(Image by author)
(Image by author)

Let’s discuss each step in detail.

Problem formulation

This is where data scientists are needed indeed. Problem formulation cannot be automated since every problem is different. It requires a lot of domain knowledge. There is no single approach to problem formulation. Data scientists should use different strategies according to the given scenarios. This is something that AutoML cannot handle. In this step, AutoML cannot replace data scientists.

Collecting data

The data scientists or data engineers should decide what type and how much data needed to be collected. This depends on the problem that they want to solve. These things cannot be automated. But, they can use relevant automation options for data mining to avoid repetitive tasks. Therefore data collection is something that can be semi-automated! Data scientists are still needed for this step and AutoML cannot fully replace data scientists.

Data cleaning

Data scientists and data engineers spend 60–70% of their time on data cleaning. This is because every dataset is different and it needs domain-specific knowledge. This is the most important step in the machine learning process. Data cleaning involves handling missing values, outlier detection, encoding categorical variables, etc. Dealing with missing values is the most time-consuming part for data scientists. Outlier detection involves a lot of domain knowledge. If we detect an outlier, what should we do next? Should we remove it? Should we keep it? or should we replace it with a relevant value? This depends on the domain knowledge of a particular problem of analysis. There may be an interesting story behind an outlier. These things cannot be done with automated systems. Therefore, data science professionals are needed here.

Model selection

(Image by author)
(Image by author)

Model selection refers to choosing an appropriate machine learning algorithm to solve your problem. Data scientist or machine learning engineers should make multiple decisions to select the best algorithm. This depends on the type of problem in your hands and the amount and type of data collected. If your data has labels (ground truth values), you can select an algorithm like regression or classification (supervised methods). If the labels are class values, you can select a classification algorithm. If the labels are continuous values, you can select a regression algorithm. If you’ve non-linear data, you can select a non-linear classifier or regressor (e.g. Random Forest, Decision Tree, XGboost). If you’ve linear data, you can use a linear regressor for regression tasks and a support vector machine with a linear kernel for classification tasks. If your data does not contain any labels (ground truth values), you can select an algorithm like dimensionality reduction or clustering (unsupervised methods). If you still want to find the hidden patterns behind unlabeled data, you can use the KMeans algorithm when the number of clusters is known. If the number of clusters is not known, you can try the MeanShift algorithm. If you want to reduce the dimensionality of data, you can use an algorithm like PCA or LDA for linear data and Kernel PCA, t-SNE for non-linear data. Algorithm selection also depends on the amount of data you have. The general thing is that algorithm selection depends on various criteria. Data science professional should think of an option to automate these things. Some Python frameworks are also available today. But, they cannot fully automate the process.

Feature selection

The selected algorithm should be able to select features by considering feature importances. Backward elimination, forward selection and random forest methods can be used for feature selection. These Algorithms can automatically select the best features, but data scientists should still manually set the values for the parameters before using these methods. It means that data scientists are needed here also.

Hyperparameter tuning

Model parameters learn their values from the input data during the training process. In contrast, model hyperparameters do not learn their values during the training process. Therefore, data science professionals should specify the values for the model hyperparameters before the training process. A machine learning model will typically contain many hyperparameters. Data scientists’ task is to try different values for each hyperparameter and find the best combination of hyperparameter values. If they do it manually one by one, it will take a very long period of time. Automation options such as Grid Search or Randomized Search are available for this. Most of the time, the hyperparameter tuning process contains hundreds or even thousands of iterations that cannot be manually handled by data scientists. Some hyperparameters contain values ranging from 0 to plus infinity. Data scientists’ task is to use the domain knowledge to narrow down the range of hyperparameter values and then apply Grid Search or Randomized Search for an optimal tuning process!

Model evaluation

The model evaluation process ensures that the model is well fitted on the training data and is generalizable to new unseen data as well. For supervised learning, the model evaluation process is easy because labels (ground truth values) are available. In contrast, model evaluation is challenging in unsupervised learning where labels (ground truth values) are not available. It’s hard to find any automation option for model evaluation in unsupervised learning.

Key responsibilities of data scientists

Now, we can figure out some of the key responsibilities of data scientists in a Machine Learning process.

  • Data scientists should formulate the problem.
  • Data scientists should instruct the algorithms on how to learn from data.
  • They should identify true relationships between the features (variables).
  • They should provide a well-prepared adequate amount of data to the algorithms.
  • In most cases, they should be able to interpret the model and its final output.

When taking on these responsibilities, data scientists can use automation options for some parts of a machine learning process. But, AutoML cannot fully replace these responsibilities of a data scientist.

Automation options available for machine learning tasks

In this section, we’ll introduce some of the automation options that can be applied to the steps of a machine learning process. Python code is also included for some methods.

Cross-validation with cross_val_score()

We can use the Scikit-learn cross_val_score() function for model evaluation through cross-validation. The following Python code performs cross-validation for a regression model built on the "house-prices dataset".

(Image by author)
(Image by author)

Here, we’ve trained the model several times with different data folds and then get the average RMSE value. But instead of doing this manually, we’ve automated this process using the Scikit-learn cross_val_score function. Therefore, automation can be used to handle repetitive tasks in model evaluation.

Hyperparameter tuning with grid search

We can use the Scikit-learn GridSearchCV() function for hyperparameter tuning. The following Python code performs hyperparameter tuning for a regression model built on the "house-prices dataset".

(Image by author)
(Image by author)

Here, we tune 3 hyperparameters called _‘max_depth’, ‘min_samples_leaf’ and ‘max_features’_ in __ RandomForestRegressor. So, the hyperparameter space is 3 dimensional so that each combination contains 3 values. The number of combinations is 192 (8 x 8 x 3). This is because _max_depth contains 8 values, min_samples_leaf contains 8 values and max_feature_s contains 3 values. This means we train 192 different models! Each combination is repeated 5 times in the 5-fold cross-validation process. So, the total number of iterations is 960 (192 x 5). But also note that each RandomForestRegressor has 100 decision trees. So, the total computation is 96,000 (960 x 100)! The automation process makes things easier for data scientists.

Training multiple models using Pipeline()

By using the Scikit-learn Pipeline(), we can automate the training process of multiple complex models. The following diagram shows the general workflow of building a Polynomial Regression model. The steps should be applied in the given order.

General workflow of a Polynomial Regression model (Image by author)
General workflow of a Polynomial Regression model (Image by author)

Pipelines automate the training process by sequentially applying a list of transformers and a final predictor. In our workflow,

  1. StandardScaler() is a transformer.
  2. PCA() is a transformer.
  3. PolynomialFeatures() is a transformer.
  4. LinearRegression() is a predictor.

The following Python code will automate the above workflow.

Now, you can train all the estimators by a single .fit() call.

ploy_reg_model.fit(X,y)

Other automation options

The following is the list of other popular automation options that will not discuss in detail in this article.

  • PyCaret – This is the Python version of the Caret package available in R. Most machine learning tasks in PyCaret are as simple as a function call. You can train and visualize multiple machine learning models with a few lines of code.
  • Microsoft Azure AutoML – This applies automation to machine learning models with speed and scale. If you combine this with Microsoft Power BI, you can extract maximum value from your data.
  • TPOT – This is a Python library that handles machine learning tasks by applying automation.
  • Google Cloud AutoML – You can train high-quality custom machine learning models with minimal effort.

Key takeaways

Now, you have a clear understanding of AutoML and how it applies to machine learning tasks. We’ve walked through the steps of a machine learning process and found the reasons for "AutoML will NOT replace your data science profession". We’ve also discussed some of the automation opinions available today. You can try out them! In future articles, I’ll cover some of them as well.

There are two obvious things under the hood that cannot fully automate the machine learning process:

  • Need of the domain knowledge
  • Presence of unlabeled data (in unsupervised learning)

Because of these things, AutoML cannot replace data scientists.

In addition to that, I’d like to give you the following two pieces of advice on using AutoML:

  • If you’re new to the field of data science or machine learning, do not start your machine learning process with easy to use frameworks like Microsoft Azure AutoML or Google Cloud AutoML. Instead, it’s better to learn Python (or R) and its related packages. After you have solid foundational knowledge in machine learning theory and those packages, you can try out Microsoft Azure AutoML or Google Cloud AutoML. Doing so will create a long and clear path to become a master in the field.
  • When you learn data science and machine learning, give more focus on data cleaning tasks such as handling missing values, outlier detection, feature encoding and unsupervised learning methods. In addition to that, give more focus on getting domain knowledge of a specific problem and interpretation of your results in plain English so that even a non-technical person can understand your findings. Those are the things that cannot be replaced by automation.

Finally, data scientists are needed indeed! Automation makes things easier for them.

Thanks for reading!

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

Read my other articles at https://rukshanpramoditha.medium.com

2021–05–11


Related Articles