Undoubtedly, Scikit-learn is one of the best machine learning libraries available today. There are several reasons for that. The consistency among Scikit-learn estimators is one reason. You cannot find such consistency in any other machine learning library. The .fit()/.predict() paradigm best describes the consistency. Another reason is that Scikit-learn has a variety of uses. It can be used for classification, regression, clustering, dimensionality reduction, anomaly detection.
Therefore, Scikit-learn is a must-have Python library in your Data Science toolkit. But, learning to use Scikit-learn is not straightforward. It’s not simple as you imagine. You have to set up some background before learning it. Even while you learning Scikit-learn, you should follow some guidelines and best practices. In this article, I’m happy to share 9 guidelines that worked for me to master the Scikit-learn without giving up the learning process in the middle. Whenever possible, I will include the links to my previous posts which will help you to set up the background and continue to learn the Scikit-learn.
Setting up the background
Guideline 1: You should be familiar with Numpy before stat using Scikit-learn
Numpy is a powerful library that is used to perform numerical calculations in Python. Scikit-learn and many other Python libraries used for data analysis and Machine Learning are built-on top of the Numpy. Scikit-learn estimator’s input and outputs are in the form of Numpy arrays.
About 30–40% of the mathematical knowledge required for data science and machine learning comes from linear algebra. Matrix operations play a significant role in linear algebra. We often use Numpy to perform matrix operations in Python. It also has special classes and sub-packages for matrix operations.
From the above facts, it is very clear that you should be familiar with Numpy before stat using Scikit-learn and machine learning. The following articles written by me cover the Numpy topics such as Numpy basics, array creation, array indexing, performing arithmetic operations and linear algebra with Numpy. Those were specially designed to get a hands-on experience while learning the Numpy-related topics.
- NumPy for Data Science: Part 1 – NumPy Basics and Array Creation
- NumPy for Data Science: Part 2 – Array Indexing and Slicing
- NumPy for Data Science: Part 3 – Arithmetic Operations on NumPy Arrays
- NumPy for Data Science: Part 4 – Linear Algebra with NumPy
- Top 10 Matrix Operations in Numpy with Examples
Guideline 2: Parallel learning is much more efficient
Once you’re familiar with Numpy, you’re ready for parallel learning – a learning process that can be used to learn several Python packages simultaneously. You can start learning Pandas, matplotlib and seaborn packages simultaneously once you’re familiar with Numpy. The following articles written by me cover the topics related to Pandas, matplotlib and seaborn.
- pandas for Data Science: Part 1 – Data Structures in pandas
- pandas for Data Science: Part 2 – Exploring a Dataset
- Say "Hello!" to the World of Plots
Guideline 3: Set up your own coding environment
Practising while you’re learning is the key to success in the fields of data science and machine learning. It’s better to have your own coding environment on your computer. The simplest and easiest way to get Python and other data science libraries is to install them through Anaconda. It is the most preferred distribution of Python for data science. It includes all the things: hundreds of packages, IDEs, package manager, navigator and much more. It also provides the facility to install new libraries. All you need to do is run the relevant command through the Anaconda terminal. To get started using Anaconda:
- Go to https://www.anaconda.com/products/individual
- Click on the relevant download option
- After downloading the setup file, double click on it and follow the on-screen instructions to install Anaconda on your local machine. While installing, please keep the default settings recommended by Anaconda.
Note: At the time of writing, the Anaconda installer is available with Python 3.8 for Windows (both 64-bit and 32-bit), macOS and Linux. You can download the relevant installer depending on your computer.
After installing, you can find the icon on your desktop. Double click on it to launch the Anaconda Navigator. Most of the frequently used packages such as numpy, pandas, matplotlib, seaborn, scikit-learn and a bunch more packages already come with Anaconda. You do not need to install them separately. From the Anaconda navigator, you can launch the Jupyter Notebook IDE to run your own codes.
Start using Scikit-learn
Guideline 4: Distinguish between supervised and unsupervised learning
In supervised learning, we train the model using labelled data. Both X and y are involved in the training process. X includes the input variables. y includes the labels. In machine learning terminology, X is called the feature matrix (usually a two-dimensional numpy array or a pandas DataFrame) and y is called the target vector (usually a one-dimensional numpy array or a pandas Series).

Mathematically, when we have X and y, we use the supervised learning algorithms to learn the mapping function from the input to the output, y=f(x). The goal of supervised learning is to determine the mapping function so well that finding a label (in case of classification) or a value (in case of regression) for new unseen data.
Classification and Regression are two types of supervised learning. You can learn classification and regression algorithms by reading the following articles written by me. All of them will give you hands-on experience. You can learn how those algorithms work behind the scenes while implementing them with Python and Scikit-learn.
- Linear Regression with Gradient Descent
- Support Vector Machines with Scikit-learn
- Train a regression model using a decision tree
- Random forests – An ensemble of decision trees
- Polynomial Regression with a Machine Learning Pipeline
- A Journey through XGBoost: Milestone 1 – Setting up the background
- A Journey through XGBoost: Milestone 2 – Classification with XGBoost
- A Journey through XGBoost: Milestone 3 – Regression with XGBoost
In unsupervised learning, we train the model using only the input variables (X), not using the labels (y).

The goal of unsupervised learning is to find the hidden patterns or the underlying structure or outliers in the given input data.
Clustering, Dimensionality Reduction and Anomaly (Outlier) Detection are three types of unsupervised learning. You can learn clustering, dimensionality reduction and anomaly detection algorithms by reading the following articles written by me. All of them will give you hands-on experience. You can learn how those algorithms work behind the scenes while implementing them with Python and Scikit-learn.
- Hands-On K-Means Clustering
- 4 Useful clustering methods you should know in 2021
- Principal Component Analysis (PCA) with Scikit-learn
- Principal Component Analysis for Breast Cancer Data with R and Python
- Statistical and Mathematical Concepts behind PCA
- Factor Analysis on "Women Track Records" Data with R and Python
- Two outlier detection techniques you should know in 2021
- 4 Machine learning techniques for outlier detection in Python
Guideline 5: Be familiar with Scikit-learn consistency
In Scikit-learn, machine learning models are commonly known as estimators. There are two main types of estimators: predictors and transformers. The predictors are further divided into classifiers and regressors.
The .fit()/.predict() paradigm is applied for predictors. The .fit()/.trasnform() paradigm is applied for transformers. For example:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
predictions = lin_reg.predict(X_test)
When we call the fit() method of lin_reg, the model begins to learn from the data (i.e. find the linear regression model coefficients). When we call the predict() method of lin_reg, the predictions are made on new unseen data. This kind of .fit()/.predict() consistency is applied for all Scikit-learn predictors. Also, note that the consistency when importing the LinearRegression() class. That class is in the linear_model subpackage. That’s why we call "from sklearn.sub_package_name" followed by "import class_name". This kind of consistency is applied when importing all Scikit-learn predictors.
The attributes of lin_reg can be accessed with the relevant names with the underscore sign _. For example, linreg.coef and linreg.intercept give the linear regression model coefficients and the intercept. The underscore sign is another Scikit-learn consistency for estimator attributes.
Let’s take another example:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X)
scaled_data = sc.transform(X)
When we call the fit() method of sc, the model begins to learn from the data (i.e. calculate the mean and standard deviations of each column of X). When we call the transform() method of sc, the transformation happens (i.e. scale the values of X). This kind of .fit()/.transform() consistency is applied for all Scikit-learn transformers. Also, note that the consistency when importing the StandardScaler() class. That class is in the preprocessing subpackage. That’s why we call "from sklearn.sub_package_name" followed by "import class_name". This kind of consistency is applied when importing all Scikit-learn transformers.
You can also call the fit() and transform() methods at once by running:
scaled_data = sc.fit_transform(X)
Guideline 6: Do not memorize Scikit-learn syntax, instead use help()
When doing machine learning, you do not need to memorize the Scikit-learn syntax. All you need to do is just think about the workflow of your model and use the help() function to look at the syntax. For example:
from sklearn.linear_model import LinearRegression
help(LinearRegression)

Guideline 7: Distinguish between model parameters and hyperparameters
Model parameters learn their values during the training process. We do not manually set values for the parameters and they learn from the data that we provide. For example, linear regression model coefficients learn their values during the training process in order to find the best model that minimizes the RMSE.
In contrast, model hyperparameters do not learn their values from data. So, we have to set values for them manually. We always set values for the model hyperparameters at the creation of a particular model and before we start the training process. For example, n_estimators hyperparameter refers to the number of trees in the forest in ensemble models. **** Its default value is 100. We can change it to a higher value to increase the number of trees. Doing so will increase the model performance but also consume more computational power. Our goal is to set its value to an optimum so that we can maintain the balance between model performance and computational power.
Scikit-learn provides the default values for the hyperparameters in its estimators. In most cases, those values are not optimum values and we often want to find the optimum values depending on our data and the problem that we try to solve. The process of finding the optimum values for the hyperparameters is called hyperparameter tuning.
To learn more about the hyperparameter tuning process, please read the "Using k-fold cross-validation for hyperparameter tuning" section of my "k-fold cross-validation explained in plain English" article and "Validation Curve Explained – Plot the influence of a single hyperparameter" article.
Guideline 8: Scikit-learn classes are different from objects
Different objects (models) can be created from the same class by changing the value(s) of hyperparameters. Let’s look at an example:
from sklearn.decomposition import PCA
pca_1 = PCA()
When we create the pca_1 object from the PCA() class in this way, the following default hyperparameter values applied.

But, we can modify the values of hyperparameters to create different objects.
from sklearn.decomposition import PCA
pca_1 = PCA()
pca_2 = PCA(n_components=2)
pca_3 = PCA(n_components=3)
pca_4 = PCA(n_components=2,
svd_solver='randomized')
So, you have the freedom to change the values of the hyperparameters depending on your requirements.
Guideline 9: Do not evaluate your model with the same data used for training
The main intention of doing any kind of machine learning is to develop a more generalized model which can perform well on unseen data. One can build a perfect model on the training data with 100% accuracy or 0 error, but it may fail to generalize for unseen data. A good ML model not only fits the training data very well but also is generalizable to new input data. A model’s performance can only be measured with data points that have never been used during the training process. That is why we often split our data into a training set and a test set. The data splitting process can be done more effectively with k-fold cross-validation. To learn more about the model evaluation process, please read the "Using k-fold cross-validation for evaluating a model’s performance" section of my "k-fold cross-validation explained in plain English" article.
Summary
The above-mentioned guidelines worked for me. All the links included in this article contain my own work. You can read them as well and determine how effective these guidelines to learn Scikit-learn and machine learning. Making a solid foundational background is very important so that you will never give up the learning process in the middle.
Thanks for reading!
This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.
Read my other articles at https://rukshanpramoditha.medium.com
2021–04–05