Knowledge of Statistics is important if you work with data. Having a firm grasp of some fundamental concepts goes a long way in your ability to effectively communicate. You’ll also understand the proper methods to collect, analyze, make decisions, and effectively present results that have been discovered from data.
In this article, we are going to be using the Breast Cancer Wisconsin dataset from sklearn to cover some fundamental statistics concepts. It’s a classification dataset with 569 observations and 30 features.
Below we’ve imported the necessary frameworks and loaded our data into memory.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
The Basics
Once a dataset has been built, one of the first things that should be on the back of your mind is to inspect it. The idea is to gather information on whether the data contains what you expect, but this process will also give rise to some insights that you can keep as a seed when it comes to data preprocessing.
The first few things to understand are the facts about the data. That means you want to have a clear idea of:
- Data dimensions – Knowing how much data you have available is important. The amount of data you have may be the deciding factor on which machine learning algorithm to use, or on whether you remove/add certain features.
- Data types – It’s important to know the type of data you have available as this will influence the type of statistical tests we conduct. Also, most machine learning models can only deal with numbers so it’s good to know beforehand the type of data available so that it can be converted to a suitable format for ml algorithms.
- Missing data – This comes back to knowing the data you’ve got available. Data can be missing for a variety of reasons but for most machine learning models, it must be dealt with (i.e. imputation or deletion) beforehand.
Let’s take a look at our data and its dimensions:
print(f"feature dimensions: {X.shape}n
target dimension: {y.shape}")
X.head()

Pandas using None
or NaN
to represent missing values. We can check for missing values and the data type of our features by using the info()
method on a pandas dataframe.
X.info()

The Non-Null Count
column shows us there’s no missing values in our dataset; We have 569 observations in our data and 569 values that are not null in our dataset.
Dtype
stands for the data type. This column tells us the data type of each feature in our dataset, which we can see are all float64
.
Descriptive statistics
There are two types of statistics: descriptive and inferential.
Inferential statistics refers to making inferences from data, such as taking a sample to make a generalization about a population. Descriptive statistics, on the other hand, is used to better describe data.
When becoming acquainted with some data, it’s important to leverage descriptive statistics as this plays a major role during inference.
Some descriptive statistical terms to be aware of are:
Count
The count is as simple as it sounds: the number of items or observations you have. Knowing the count is vital. If you wish to properly evaluate your results then you must have a good sense of how many items or observations are apparent in your data.
Mean
The mean is the average of the numbers; Calculating the mean is as simple as working out the sum of observations and diving it but the number of observations.

It’s important because it takes into account every subject involved in the population in its calculation. But it can be a bad measurement for discovering central tendency for this same reason – i.e. outliers in a skewed distribution will have an impact on the mean.
Standard Deviation
The standard deviation measures the amount of variation of a set of values. It’s calculated by taking the square root of the sum of squared differences from the mean divided by the size of the data.

We typically use the standard deviation in conjunction with the mean to summarize continuous data – about 95% of all values fall within two standard deviations of the mean. In a similar fashion to the mean, when the data is significantly skewed or outliers are present then the standard deviation is impacted.
Minimum
The minimum refers to the smallest value in the data set.
25th Percentile
The 25th percentile (also called the lower quartile or first quartile) is the value under which 25% of the values in the data fall below when they are arranged.
Median (50th Percentile)
The median is the middle number of a set of values. It is the value separating the higher half from the lower half. Its purpose is to inform us of the center value of a dataset and is more useful than the mean when the data is skewed or outliers are present.
75th Percentile
The 75th percentile (also called the upper quartile or third quartile) is the value under which 75% of the values in the data fall below when they are arranged in increasing order.
Maximum
The maximum refers to the largest value in the data set.
Let’s see how we can find this information in our dataset:
X.describe()

You may also see the interquartile range being discussed in certain circles. This is in reference to the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
Univariate analysis
Univariate analysis is possibly the simplest form of statistical analysis. It consists of exploring the individual features of a dataset independently of one another.
Good information to derive from this type of analysis includes things such as the range of values, the central tendency, the skewness, and the distribution of a group. We’ve already covered how we can analyze the range of values (i.e. minimum, maximum, percentiles, and IQR) but we haven’t spoken much about skew and distribution.
Let’s start by taking a look at the distribution of the class then we will talk about why distributions are important.
counts = y.value_counts()
plt.subplots(figsize=(8, 5))
sns.barplot(x=counts.index, y=counts.values)
plt.title("Distribution of target label")
plt.xlabel("Target classes")
plt.ylabel("Counts")
plt.show()

Earlier we stated that it’s important to understand the data types we have available in our data since this influences our choice of statistical tests: we need to know the distribution of our data for the same reason.
For a classification problem, we need to know how balanced the classes are. A problem with highly imbalanced classes may require some manual handling to help our machine learning model predict the minority class better.
Let’s also take a look at the distribution of our features.
X.hist(figsize=(18, 16))
plt.show()

Histograms are a great way of representing the distributions of numerical data; We can discover the skewness of each feature from the image above. This tells us the measure of the asymmetry of the probability distribution about the mean – we can also learn the modality of a feature, which is the number of peaks.
The importance of understanding the shape of our data is high because it tells us where the most of information about our data lies. It’s also important because certain models make assumptions about our data, and if our data aligned with those assumptions, it would result in better models.
Another way we could have learned the skew of our data is as follows:
X.skew()

A skewness value of 0 means that the data is perfectly symmetrical. If the skewness is greater than 1 or less than -1 then we have highly skewed data. For features between -1 to -0.5 and 0.5 to 1, we’d say they are moderately skewed.
Bivariate analysis
Bivariate analysis is possible the second simplest form of statistical analysis. It refers to analyzing two variables to determine relationships between them. If there’s a relationship between two variables then there will be a correlation.
A heatmap is one way to visualize the relationships within a dataset.
plt.subplots(figsize=(18, 16))
sns.heatmap(X.corr(), annot=True)
plt.show()

With a deeper understanding of the relationships in our data, we can gather more informative insights about our data and know to what extent it becomes easier to predict a value for one variable if we know the value of the other.
This list is by no means comprehensive, but it’s a good foundation to get started with. Other concepts you may want to learn about once you’ve grounded yourself in these ones include understanding kurtosis, hypothesis testing, and confidence intervals.
Thanks for reading.
If you enjoy reading stories like this one and wish to support my writing, consider becoming a Medium member. With a $5 a month commitment, you unlock unlimited access to stories on Medium. If you use my sign-up link, I’ll receive a small commission.
Already a member? Subscribe to be notified when I publish.