How to use Scikit-Learn Datasets for Machine Learning

We’ll explore the breast cancer dataset and create a model that classifies tumors

Wafiq Syed
Towards Data Science

--

Scikit-Learn provides clean datasets for you to use when building ML models. And when I say clean, I mean the type of clean that’s ready to be used to train a ML model. The best part? The datasets come with the Scikit-Learn package itself. You don’t need to download anything. Within just a few lines of code, you’ll be working with the data.

Having ready-made datasets is a huge asset because you can get straight to creating models, not having to spend time obtaining, cleaning, and transforming the data — something data scientists spend lots of their time on.

Even with all the ground work complete, you might find using the Scikit-Learn datasets a bit confusing at first. Not to worry, in few minutes you’re going to know exactly how to use the datasets and be well on your way to exploring the world of Artificial Intelligence. This article assumes you have python, scikit-learn, pandas, and Jupyter Notebook (or you may use Google Collab) installed. Let’s begin.

Intro to Scikit-Learn’s Datasets

Scikit-Learn provides seven datasets, which they call toy datasets. Don’t be fooled by the word “toy”. These datasets are powerful and serve as a strong starting point for learning ML. Here are few of the datasets and how ML can be used:

  • Boston House Prices — use ML to predict house prices based on attributes such as number of rooms, crime rate in that town
  • Breast Cancer Wisconsin (diagnostic) dataset — use ML to diagnose cancer scans as benign (does not spread to the rest of the body) or malignant (spreads to rest of the body)
  • Wine Recognition — use ML to identify the type of wine based on chemical features

In this article, we’ll be working with the “Breast Cancer Wisconsin” dataset. We will import the data and understand how to read it. As a bonus, we’ll build a simple ML model that is able to classify cancer scans either as malignant or benign. To read more about the datasets, click here for Scikit-Learn’s documentation.

How do I Import the Datasets?

The datasets can be found in sklearn.datasets.Let’s import the data. We first import datasets which holds all the seven datasets.

from sklearn import datasets

Each dataset has a corresponding function used to load the dataset. These functions follow the same format: “load_DATASET()”, where DATASET refers to the name of the dataset. For the breast cancer dataset, we use load_breast_cancer(). Similarly, for the wine dataset we would use load_wine(). Let’s load the dataset and store it into a variable called data.

data = datasets.load_breast_cancer()

So far, so good. These load functions (such as load_breast_cancer()) don’t return data in the tabular format we may expect. They return a Bunch object. Don’t know what a Bunch is? No worries.

Think of a Bunch object as Scikit-Learn’s fancy name for a dictionary

Photo by Edho Pratama on Unsplash

Let’s quickly refresh our memory on dictionaries. A dictionary is a type of data structure that stores data as keys and values. Think of a dictionary just like the dictionary book you’re used to. You search for words (keys), and get their definition (value). In programming, you can make the keys and values anything you choose (words, numbers, etc.). For example, to store a phonebook, the keys can be names, and the values can be phone numbers. So you see, a dictionary in Python isn’t just limited to the typical dictionary you’re familiar with, but can be applied to whatever you like.

What’s in our Dictionary (Bunch)?

Scikit’s dictionary or Bunchis really powerful. Let’s begin this dictionary by looking at its keys.

print(data.keys())

We get the following keys:

  • data is all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array
  • target is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array,

These two keys are the actual data. The remaining keys (below), serve a descriptive purpose. It’s important to note that all of Scikit-Learn datasets are divided into data and target. data represents the features, which are the variables that help the model learn how to predict. target includes the actual labels. In our case, the target data is one column classifies the tumor as either 0 indicating malignant or 1 for benign.

  • feature_names are the names of the feature variables, in other words names of the columns in data
  • target_names is the name(s) of the target variable(s), in other words name(s) of the target column(s)
  • DESCR , short for DESCRIPTION, is a description of the dataset
  • filename is the path to the actual file of the data in CSV format.

To look at a key’s value, you can type data.KEYNAME where KEYNAME represents the key. So if we wanted to see the description of the dataset,

print(data.DESCR) 

Here’s a preview of the output (the full description is too long to include):

Description of Scikit-Learn’s Breast Cancer dataset

You can also view the data set info by visiting Scikit-Learn’s documentation. Their documentation is much more readable and neat.

Working with the Dataset

Now that we understand what the load function returns, let’s see how we can use the dataset in our ML model. Before anything, if you want to explore the dataset, you can use pandas to do so. Here’s how:

# Import pandas
import pandas as pd
# Read the DataFrame, first using the feature data
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a target column, and fill it with the target data
df['target'] = data.target
# Show the first five rows
df.head()
Cropped view of the dataframe (does not include all columns)

You should be proud. You’ve loaded a dataset into a Pandas dataframe, that’s ready to be explored and used. To really see the value of this dataset, run

df.info()
df.info() output — Notice how there are no missing values

There are a few things to observe:

  • There aren’t any missing values, all the columns have 569 values. This saves us time from having to account for missing values.
  • All the data types are numerical. This is important because Scikit-Learn models do not accept categorical variables. In the real world, when we get categorical variables, we transform them into numerical variables. Scikit-Learn’s datasets are free of categorical variables.

Hence, Scikit-Learn takes care of the data cleansing work. Their datasets are extremely valuable. You will benefit from learning ML by using them.

Let’s do some AI

Finally, the exciting part. Let’s build a model that classifies cancer tumors as malignant (spreading) or benign (non-spreading). This will show you how to use the data for your own models. We’ll build a simple K-Nearest Neighbors model.

First, let’s split the dataset into two, one for training the model — giving it data to learn from, and the second for testing the model — seeing how well the model performs on data (scans) it hasn’t seen before.

# Store the feature data
X = data.data
# store the target data
y = data.target
# split the data using Scikit-Learn's train_test_splitfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

This gives us two datasets —one for training and one for testing. Let’s get onto training the model.

from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

Did you get an output of 0.909? This means the model is 91% accurate! Isn’t that amazing? In just a few minutes you made a model that classifies cancer scans with 90% accuracy. Now, of course, it’s more complicated than this in the real world, but you’re off to a great start. You will learn a lot by trying to build models using Scikit-Learn’s datasets. When in doubt, just Google any question you have. There’s a huge machine learning community, and it’s likely your question’s been asked before. Happy AI learning!

--

--