
When I was first learning how to code, I would practice my data skills on different data sets to create mini Jupyter Notebook reference guides. Since I have found this to be an incredibly useful tool, I thought I’d start sharing code walk-throughs. Hopefully this is a positive resource for those learning to use Python for data science, and something that can be referenced in future projects. Full code is available on Github.
Getting Set Up
The first step is to import the preloaded data sets from the scikit-learn python library. More info on the "toy" data sets included in the package can be found here. The data description will also give more information on the features, statistics, and sources.
from sklearn.datasets import load_iris
#save data information as variable
iris = load_iris()
#view data description and information
print(iris.DESCR)

The data will be pre-saved as a dictionary with the keys "data" and "target", each paired with an array of lists as the values. Initially the information will be output like this:
{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3] ...
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ...}
Putting Data into a Data Frame
Feature Data
To view the data more easily we can put this information into a data frame by using the Pandas library. Let’s create a data frame to store the data information about the flowers’ features first.
import pandas as pd
#make sure to save the data frame to a variable
data = pd.DataFrame(iris.data)
data.head()

Using data.head() will automatically output the first 5 rows of data, but if we can also specify how many rows we want in the brackets, data.head(10).
Now, we have a data frame with the iris data, but the columns are not clearly labeled. Looking at the data description we printed above, or referencing the source code tells us more about the features. In the documentation the data features are listed as:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
Let’s rename the columns so the features are clear.
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
#note: it is common practice to use underscores between words, and avoid spaces
data.head()

Target Data
Now that the data related to the features is neatly in a data frame, we can do the same with the target data.
#put target data into data frame
target = pd.DataFrame(iris.target)
#Lets rename the column so that we know that these values refer to the target values
target = target.rename(columns = {0: 'target'})
target.head()
The target data frame is only one column, and it gives a list of the values 0, 1, and 2. We will use the information from the feature data to predict if a flower belongs in group 0, 1, or 2. But what do these numbers refer to?
- 0 is Iris Setosa
- 1 is Iris Versicolour
- 2 is Iris Virginica
Exploratory Data Analysis (EDA)
To help us understand our data better, let’s first combine the two data frames we just created. By doing this we can see the features and class determination of the flowers together.
df = pd.concat([data, target], axis = 1)
#note: it is common practice to name your data frame as "df", but you can name it anything as long as you are clear and consistent
#in the code above, axis = 1 tells the data frame to add the target data frame as another column of the data data frame, axis = 0 would add the values as another row on the bottom
df.head()

Data Cleaning
It’s super important to look through your data, make sure it is clean, and begin to explore relationships between features and target variables. Since this is a relatively simple data set there is not much cleaning that needs to be done, but let’s walk through the steps.
- Look at Data Types
df.dtypes

float = numbers with decimals (1.678) int = integer or whole number without decimals (1, 2, 3) obj = object, string, or words (‘hello’) The 64 after these data types refers to how many bits of storage the value occupies. You will often seen 32 or 64.
In this data set, the data types are all ready for modeling. In some instances the number values will be coded as objects, so we would have to change the data types before performing statistic modeling.
- Check for Missing Values
df.isnull().sum()
This data set is not missing any values. While this makes modeling much easier, this is not usually the case – data is always messy in real life. If there were missing values you could delete rows of data that had missing values, or there are several options of how you could fill that missing number (with the column’s mean, previous value…).
- Statistical Overview
df.describe()

This allows us to get a quick overview of the data. We can check for outliers by looking at the min and max values of each column in relation to the mean. Spend a bit of time looking through this chart to begin understanding the spread of the data.
Visualizing
The next step in the EDA process is to start visualizing some relationships.
Correlations
The Seaborn library has a great heat map visual for mapping the correlations between features. The higher the number is, the greater the correlation between the two elements. A high positive correlation indicates that the two elements have a positive linear relationship (as one increases the other also increases), and a low negative correlation indicates a negative linear relationship (as one increases the other decreases).
import seaborn as sns
sns.heatmap(df.corr(), annot = True);
#annot = True adds the numbers onto the squares

Petal length and width is most correlated with the target, meaning that as these numbers increase, so does the target value. In this case, it means that flowers in class 2 often have longer petal length and width than flowers in class 0. Sepal width is most anti-correlated, indicating that flowers in class 0 have the greatest sepal width than those in class 2. We can also see some intercorrelation between features, for example petal width and length are also highly correlated. This information is not necessarily the best way to analyze the data, but it allows us to start seeing these relationships.
Scatter plot
To start looking at the relationships between features, we can create scatter plots to further visualize the way the different classes of flowers relate to sepal and petal data.
# The indices of the features that we are plotting (class 0 & 1)
x_index = 0
y_index = 1
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.figure(figsize=(5, 4))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plt.tight_layout()
plt.show()

Now let’s create the same scatter plot to compare the petal data points.
x_index = 2
y_index = 3
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.figure(figsize=(5, 4))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plt.tight_layout()
plt.show()

Modeling
Now that we have cleaned and explored the data, we can begin to develop a model. Our goal is to create a Logistic Regression Classification model that will predict which class the flower is based on petal and sepal sizes.
#divide our data into predictors (X) and target values (y)
X = df.copy()
y = X.pop('target')
Train Test Split
Once we separate the features from the target, we can create a train and test class. As the names suggest, we will train our model on the train set, and test the model on the test set. We will randomly select 80% of the data to be in our training, and 20% as test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify = y)
'''
by stratifying on y we assure that the different classes are represented proportionally to the amount in the total data (this makes sure that all of class 1 is not in the test group only)
'''
Standardize
With the X values split between training and test, now we can standardize the values. This puts the numbers on a consistent scale while keeping the proportional relationship between them.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
Baseline Prediction
The baseline is the probability of predicting class before the model is implemented. If the data is split into 2 classes evenly, there is already a 50% chance of randomly assigning an element to the correct class. The goal of our model is to improve on this baseline, or random prediction. Also, if there is a strong class imbalance (if 90% of the data was in class 1), then we could alter the proportion of each class to help the model predict more accurately.
df.target.value_counts(normalize= True)

Logistic Regression Model
from sklearn.linear_model import LogisticRegression
#create the model instance
model = LogisticRegression()
#fit the model on the training data
model.fit(X_train, y_train)
#the score, or accuracy of the model
model.score(X_test, y_test)
# Output = 0.9666666666666667
#the test score is already very high, but we can use the cross validated score to ensure the model's strength
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=10)
print(np.mean(scores))
# Output = 0.9499999999999998
Without any adjustments or tuning, this model is already performing very well with a test score of .9667 and a cross validation score of .9499. This means that the model is predicting the correct class for the flower about 95% of time. Much higher than the baseline of 33%!
Understanding the Predictions
Normally there will be lots of fine tuning and experimentation with parameters to find the model that performs with the highest scores. However, since this data set was straightforward, we can move on for now and start looking at how the model made its predictions.
Coefficients
df_coef = pd.DataFrame(model.coef_, columns=X_train.columns)
df_coef

Coefficients are often a bit hard to interpret in Logistic Regression, but we can get an idea of how much of an impact each of the features had in deciding if a flower belonged to that class. For instance, petal length was barely a deciding factor for if a flower was in class 1, but petal width was a strong predictor for class 2.
Predicted Values
We can also compare the values that our model predicted with the actual values.
predictions = model.predict(X_test)
#compare predicted values with the actual scores
compare_df = pd.DataFrame({'actual': y_test, 'predicted': predictions})
compare_df = compare_df.reset_index(drop = True)
compare_df

Confusion Matrix
To look more closely at the predictions that the model made, we can use the confusion matrix. In the confusion matrix, the predicted values are the columns and the actual are the rows. It allows us to see where the model makes true and false predictions, and if it predicts incorrectly, we can see which class it is predicting falsely.
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(y_test, predictions, labels=[2, 1, 0]),index=[2, 1, 0], columns=[2, 1, 0])

Classification Report
Another good way to check how your model is performing is by looking at the classification report. It shows the precision, recall, f1 scores, and accuracy scores, and below is a very brief explanation of these features.
- Precision: Number of correctly predicted Iris Virginica flowers (10) out of total number of predicted Iris Virginica flowers (10). Precision in predicting Iris Virginica =10/10 = 1.0
- Recall: Number of correctly predicted Iris Virginica out of the number of actual Iris Virginica. Recall = 9/10 = .9
- F1 Score: This is a harmonic mean of precision and recall. The formula is F1 Score = 2 (precision recall) / (precision + recall)
- Accuracy: Add all the correct predictions together for all classes and divide by the total number of predictions. 29 correct predictions /30 total values = accuracy of .9667.
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

Predicted Probabilities
Using the code below we can look at the probabilities of each row of data being assigned to one of the three classes. By default, the model will assign the item to the class with the highest probability. If we wanted to adjust the accuracy or precision, we could do this by changing the threshold of how high the predicted probability would have to be before it was assigned to that class.
In this case, there is not a consequence to incorrectly assigning a flower to another class, but models used to detect cancer cells adjust their models to ‘assume the worst’ and assign it as a true cancer cell more often. This is used in many cases when it is better to be over cautious than mislabel the cell as safe and healthy.
probs = model.predict_proba(X_test)
#put the probabilities into a dataframe for easier viewing
Y_pp = pd.DataFrame(model.predict_proba(X_test),
columns=['class_0_pp', 'class_1_pp', 'class_2_pp'])
Y_pp.head()

Conclusion
Hopefully this walk-through helped to show some major steps in the process of a Data Science project. Of course this is not an exhaustive list of steps that could be taken with this data set, but it aims to carefully show some of the important steps of classification.
This is a classic data set because it is relatively straightforward, but the steps highlighted here can be applied to a classification project of any kind. Follow for more simple (and advanced) data set walk-throughs in the future!
Looking for the next step? Read about the basics of regression with a data science project predicting car sale prices.