
Welcome to another article series! This time, we are discussing XGBoost (Extreme Gradient Boosting) – The leading and the most preferred machine learning algorithm among data scientists in the 21st century. Most people say XGBoost is a money-making algorithm because it easily outperforms any other algorithms, gives the best possible scores and helps its users to claim luxury cash prizes from data science competitions.
The topic we are discussing is broad and important so that we discuss it through a series of articles. It is like a journey, maybe a long journey for newcomers. We discuss the entire topic step by step. Here, each subtopic is called a milestone. When you complete the journey through all milestones, you will have a good knowledge and hands-on experience (implementing the algorithm effectively using R and Python) in the followings. When implementing the algorithm, the default programming language is Python. I will occasionally use R where important.
- Milestone 1: Setting up the background
- Milestone 2: Classification with XGBoost
- Milestone 3: Regression with XGBoost
- Milestone 4: Evaluating your XGBoost model through cross-validation
- Milestone 5: XGBoost’s hyperparameters tuning
- Milestone 6: Get your data ready for XGBoost
- Milestone 7: Building a pipeline with XGBoost
Prerequisites
I assume that you are already familiar with popular Python libraries such as numpy, pandas, scikit-learn, etc and using Jupyter Notebook and RStudio. It is recommended to have a good knowledge and understanding of machine learning techniques such as cross-validation, machine learning pipelines, etc and algorithms such as decision trees, random forests, etc. You can also refresh your memory by reading the following contents previously written by me.
- Train a regression model using a decision tree
- Random forests – An ensemble of decision trees
- k-fold cross-validation explained in plain English
- Polynomial Regression with a Machine Learning Pipeline
Setting up the coding environment
It is necessary to set up your local machine before doing any machine learning task. For Python and R users, the simplest way to get Python, R and other data science libraries is to install them through Anaconda. It is the most preferred distribution of Python and R for data science. It includes all the things: hundreds of packages, IDEs, package manager, navigator and much more. It also provides the facility to install new libraries. All you need to do is run the relevant command through the Anaconda terminal. To get started using Anaconda:
- Go to https://www.anaconda.com/products/individual
- Click on the relevant download option
- After downloading the setup file, double click on it and follow the on-screen instructions to install Anaconda on your local machine

After installing, you can find the icon on your desktop. Double click on it to launch the Anaconda Navigator. Most of the frequently used packages such as numpy, pandas, scikit-learn already come with Anaconda. You do not need to install them separately. But, the XGBoost package doesn’t come as a built-in package. We need to manually install it. If you are using the Windows operating system,
- Just go to https://anaconda.org/anaconda/py-xgboost
- Just copy conda install -c anaconda py-xgboost
- Open the Anaconda Navigator
- Click on the Environments tab and then click on the arrow right to base(root)
- From the dropdown menu, select Open Terminal.
- A new window should appear now. Paste conda install -c anaconda py-xgboost and hit Enter.
- Follow the instructions to complete the installation

Now launch the Jupyter Notebook through Anaconda Navigator. Form the Jupyter homepage, open a Python 3 notebook and run
import xgboost as xgb
If there is no error, you have successfully installed the XGBoost package for Python. Now you’re all set to use the XGBoost package in Python
Note: If you’re using MacOS or Linux,
- Go to https://anaconda.org/conda-forge/xgboost
- Copy conda install -c conda-forge xgboost
- Repeat the same steps above
You can also use XGBoost with the R programming language in Jupyter Notebook. To install and run R in a Jupyter Notebook:
- Start Anaconda Navigator
- To install the R language and r-essentials packages, select Environments to create a new environment. Click Create.
- Name the environment "R". Next to Packages, select Python version (At this time, Python 3.8) and R. Select r from the dropdown menu. Click Create.
- Wait till the installation complete. Click on the arrow right to R
- From the dropdown menu, select Open Terminal.
- A new window should appear now. Run conda install -c r r-xgboost
- Follow the instructions to complete the installation
- After installation, click again on the arrow right to R. ** Select Open with Jupyter Noteboo**k option.
- To create a new notebook for the R language, in the Jupyter Notebook menu, select New, then select R.
- Run library("xgboost") in the new notebook.
- If there is no error, you have successfully installed the XGBoost package for R. Now you’re all set to use the XGBoost package with R within Jupyter Notebook.
To install XGBoost in RStudio:
- Launch RStudio.
- Go to the Tools tab and then Install Packages.
- In the new window, type xgboost in the Packages field.
- Click Install.

Note: Alternatively, you can run install.packages("xgboost") in the R terminal to install XGBoost in RStudio.
Background knowledge needed for XGBoost
Decision trees
Decision Trees are a non-parametric supervised learning method, capable of finding complex nonlinear relationships in the data. They can perform both classification and regression tasks. Decision trees are the foundation of XGBoost models.
XGBoost is an ensemble (group) method that consists of different Machine Learning models. The individual models that make up the ensemble (group) in XGBoost are called base learners. The most commonly used XGBoost base learners are decision trees. That’s why knowledge in decision trees are so important in learning XGBoost.
Find out more about decision trees by reading the following resources.
- Train a regression model using a decision tree
- Decision tree classifier in scikit-learn
- Decision tree regressor in scikit-learn
Random forests
The Random Forest is one of the most powerful machine learning algorithms available today. It is a supervised machine learning algorithm that can be used for both classification (predicts a discrete-valued output, i.e. a class) and regression (predicts a continuous-valued output) tasks.
Like Xgboost, random forests are ensembles (groups) of decision trees. In a random forest, each decision tree is combined via bagging (bootstrap aggregating). You may learn more about bagging and random forests by reading the following article written by me.
It is also recommended to take a look at the Scikit-learn official documentation of the random forests algorithms.
Weak learners vs strong learners
A weak learner is a machine learning algorithm that is slightly beer than chance. For example, a decision tree whose predictions are slightly beer than 50% can be considered as a weak learner.
A strong learner, by contrast, **** is a machine learning algorithm that has learned much from data and performs quite well.
Bagging vs boosting
If you have read my previous article, Random forests – An ensemble of decision trees, you are now much familiar with bagging. In a random forest, each tree is combined via bagging.
XGBoost combines trees via boosting – A method that learns from the errors made by previous trees. Boosting converts weak learners into strong learners through hundreds of iterations.
Summary
In this article, we have just started the journey through XGBoost. There are many milestones to complete. We have also made an outline for the entire journey.
As we progress, you will learn how XGBoost works behind the scenes. You will also get hands-on experience in implementing the XGBoost algorithm in the next milestones. So, reading the next articles are very important. I will publish them as soon as possible. Stay tuned for the updates!
Thanks for reading!
This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.
Read my other articles at https://rukshanpramoditha.medium.com
2021–03–02