The world’s leading publication for data science, AI, and ML professionals.

8 Recommended Python Libraries to Start Your Machine Learning Journey

Without learning them properly, you'll give up in the middle!

Photo by Chetan Kolte on Unsplash
Photo by Chetan Kolte on Unsplash

Machine learning is among one of the trending topics these days. Python is the number one programing language among many users. However, Python is a general-purpose programming language meaning that it is used in so many different fields. To use Python for machine learning, you need to learn some additional Python Libraries in addition to general Python.

In this post, I’ll discuss the overview of the most fundamental Python libraries that will help you to begin your machine learning journey. I highly recommend you to be familiar with them as much as possible because they are the fundamentals. Without learning them properly, you’ll give up in the middle of the learning journey!

1. NumPy

NumPy is the mother library for many other libraries that are built on top of NumPy. NumPy stands for Numerical Python. The ndarray (N-dimensional array) object is the main data structure in NumPy. In Machine Learning, we often work with vectors (1D arrays) and matrices (2D arrays). NumPy provides easy methods to create those arrays. When working with image data, we deal with NumPy 3D arrays. NumPy also provides a large collection of mathematical functions especially for linear algebra.

Resources

Screenshot by author
Screenshot by author

Installation

NumPy comes with the Anaconda installer by default. If you’ve installed Python through Anaconda, you do not need to install NumPy again. However, there are two ways to install NumPy.

conda installation

conda install -c anaconda numpy
#OR
conda install -c conda-forge numpy

pip installation

pip install numpy

Import convention

The community accepted import convention for NumPy is:

import numpy as np

2. Pandas

Pandas is the Python data manipulation and analysis library. It was built on top of NumPy meaning that it supports NumPy N-dimensional arrays. Pandas is so popular that the total number of its downloads can represent the entire data science community! Pandas provides methods for data loading, data cleaning, variable encoding, data transforming, and much more. Pandas also provides plotting functions since various plotting libraries have been integrated with it. Series and DataFrame are **** two main data structures in pandas. Pandas Series can be created with 1 dimensional NumPy arrays while Pandas DataFrames can be created with 2 dimensional NumPy arrays since Pandas was built on top of NumPy.

Resources

Screenshot by author
Screenshot by author

Installation

Pandas comes with the Anaconda installer by default. If you’ve installed Python through Anaconda, you do not need to install Pandas again. However, there are two ways to install Pandas.

conda installation

conda install -c anaconda pandas
#OR
conda install -c conda-forge pandas

pip installation

pip install pandas

Import convention

The community accepted import convention for Pandas is:

import pandas as pd

3. Matplotlib

Matplotlib is a basic plotting library in Python. However, it provides tons of customization options for your plots. It is also the mother library for other advanced plotting libraries. The library has two different application programming interfaces (APIs) – Pyplot interface and Object-oriented interface.

Resources

Installation

Matplotlib comes with the Anaconda installer by default. If you’ve installed Python through Anaconda, you do not need to install Matplotlib again. However, there are two ways to install Matplotlib.

conda installation

conda install -c conda-forge matplotlib

pip installation

pip install matplotlib

Import convention

The community accepted import convention for Matplotlib is:

import matplotlib.pyplot as plt

4. Seaborn

Seaborn is a high-level data visualization library meaning that it automatically does many things for us! It also provides a lot of aesthetics for your plots. You can customize seaborn by using Matplotlib.

Resources

Installation

Seaborn comes with the Anaconda installer by default. If you’ve installed Python through Anaconda, you do not need to install Seaborn again. However, there are two ways to install Seaborn.

conda installation

conda install -c anaconda seaborn

pip installation

pip install seaborn

Import convention

The community accepted import convention for Seaborn is:

import seaborn as sns

5. Scikit-learn

Scikit-learn is a Python machine learning library. Its syntax is so consistent that it is very easy to get familiar with the entire library even for beginners by creating one or two models. Its official documentation provides all the support you need for using this library. It includes algorithms for classification, regression, clustering, dimensionality reduction models. It also provides advanced methods for data preprocessing.

Resources

Screenshot by author
Screenshot by author

Installation

Scikit-learn comes with the Anaconda installer by default. If you’ve installed Python through Anaconda, you do not need to install Scikit-learn again. However, there are two ways to install Scikit-learn.

conda installation

conda install -c anaconda scikit-learn
#OR
conda install -c conda-forge scikit-learn

pip installation

pip install scikit-learn

Import convention

We do not import the entire library at once. Instead, we import the classes and functions as we need them.

6. Yellowbrick

Yellowbrick is a machine learning visualization library. As the name suggests, it is suitable for machine learning-related visualizations. The syntax is very similar to the Scikit-learn library. With Yellowbrick, you can create advanced plots with just one or two lines of code!

Resources

Screenshot by author
Screenshot by author

Installation

Yellowbrick doesn’t come with the Anaconda installer by default. Therefore, you need to install it separately. There are two methods.

conda installation

conda install -c districtdatalabs yellowbrick

pip installation

pip install yellowbrick

Import convention

Like Scikit-learn, we do not import the entire library at once. Instead, we import the classes and functions as we need them.

7. XGBoost

When we consider the performance of machine learning models, XGBoost (Extreme Gradient Boosting) ** is the most preferred machine learning algorithm among data scientists and machine learning engineers. XGBoost (this time, the library) is available for many programming languages including Python. XGBoost’s Scikit-learn wrapper (Scikit-learn compatible API) has recently been released so that we can use XGBoost like Scikit-learn. There is also a non-Scikit-learn compatible API for XGBoost. However, it is difficult to use compared to the XGBoost’s Scikit-learn wrapper. Therefore, I recommend you to first use XGBoost’s** Scikit-learn wrapper and then go to the non-Scikit-learn version (if you want).

Resources

Screenshot by author
Screenshot by author

Installation

XGBoost doesn’t come with the Anaconda installer by default. Therefore, you need to install it separately using the following code.

conda install -c anaconda py-xgboost #Windows
conda install -c conda-forge xgboost #MacOS or Linux

Import convention

The community accepted import convention for XGBoost is:

import xgboost as xgb

8. TensorFlow

TensorFlow is a deep learning library created for deep learning tasks. Deep learning is a subset of machine learning. TensorFlow can also be used for general machine learning. It has two APIs – High-level API and Low-level API. Its main data structure is Tensor.

Resources

Screenshot by author
Screenshot by author

Installation

TensorFlow doesn’t come with the Anaconda installer by default. Therefore, you need to install it separately. There are two methods.

conda installation

conda install -c conda-forge tensorflow #CPU-only

pip installation

pip install tensorflow #Both CPU and GPU support

Import convention

The community accepted import convention for TensorFlow is:

import tensorflow as tf

In what order should we learn these libraries?

As in the above order! Just begin with NumPy basics and array creation methods. Then perform NumPy array indexing and slicing. After you are familiar with them, go to the Pandas basics – creating Series and DataFrames. Now, you can do parallel learning – both NumPy and Pandas. In this stage, be familiar with NumPy arithmetic and linear algebra operations and also Pandas advanced theory – subsetting, data cleaning, variable encoding, data transforming, etc. Then, move into Matplotlib and Seaborn. Just begin with Matplotlib. My recommendation is that you use Pandas plotting functions for basic visualizations. If you need more customization, go for Matplotlib. For advanced visualizations, use seaborn. Now, it is time to learn Scikit-learn. Now, this will be easy as you are familiar with NumPy and Pandas. When you’re familiar with Scikit-learn, you can use Yellowbrick for machine learning visualizations. Then go for TensorFlow and the deep learning part. By doing in this way, you’ll never give up in the middle of the learning journey!


Have these methods worked for you? Let me know in the comment section.

Thanks for reading!

Until next time, happy learning for everyone!

Special credit goes to Chetan Kolte on Unsplash, **** who provides me with a nice cover image for this post.

Rukshan Pramoditha 2021–08–04


Related Articles