Inside AI

Data analysis is a vital element of any Machine Learning workflow. The performance and accuracy of any machine learning model prediction hinge on the data analysis and follow-on appropriate data preprocessing. Every machine learning professional should be adept in data analysis.
In this article, I will discuss four very quick data visualisation techniques which can be achieved with few lines of code and can help to plan the data pre-processing required.
We will be using Indian Liver Patient Dataset from the open ML to learn quick and efficient data visualisation techniques. This data set contains a mix of categorical and numerical independent features and diagnosis result as the liver and non-liver condition.
from sklearn.datasets import fetch_openml
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
I feel it is easier and efficient to work with Pandas dataframe than default bunch object. The parameters "as_frame=True" ensures that the data is a pandas DataFrame, including columns with appropriate data types.
X,y= fetch_openml(name="ilpd",return_X_y=True,as_frame=True)
print(X.info())
As the independent data is in Pandas, we can view the number of features, the number of records with null values, and data type of each feature with "info".

It immediately provides a lot of information about the independent variables with just one line of code.
Considering, we have several numerical features it is prudent to understand the correlation among these variables. In the below code, we have created a new dataframe X_Numerical without the categorical variable "V2".
X_Numerical=X.drop(["V2"], axis=1)
Few of the machine learning algorithm doesn’t perform well with highly linearly related independent variables. Also, the objective is always to build a machine learning model with the minimum required variables/dimensions.
We can get the correlation among all the numerical features using the "corr" function in Pandas.
With the seaborn package, we can plot the heatmap of the correlation to get a very quick visual snapshot of the correlation among the independent variables.
relation=X_Numerical.corr(method='pearson')
sns.heatmap(relation, annot=True,cmap="Spectral")
plt.show()
In a glance, we can conclude that the independent variable V4 and V3 have close relation and few of the features like V1 and V10 are loosely negatively correlated.

After knowing the correlation among the features next, it will be useful to get a quick sense of the distribution of values for numerical features.
Just like the correlation function, Pandas a native function ‘hist’ to get the distribution of the features.
X_Numerical.hist(color="Lightgreen")
plt.show()

Until now the power and versatility of pandas are clearly illustrated. We could get the relation among numerical features and distribution of the features with only two lines of code.
Next, it will be interesting to get a better sense of the numerical independent variables with scatter plots between all the numerical variable combinations.
sns.pairplot(X_Numerical)
plt.show()
We can identify the variables which are positively or negatively related with the help of visualisation. Also, with a glance, the features which have no relation among themselves can be identified.

Finally, we can learn about the distribution of the categorical feature with the count plot.
sns.countplot(x="V2", data=X)
plt.show()
We get to know that males are over-represented in the dataset compare to females. It is vital to understand if we have an imbalanced dataset and take appropriate action.
You can read more on 4 Unique Approaches To Manage Imbalanced Classification Scenarios

Key Takeaways And Conclusion
We can get a lot of information like the correlation among different features, their distribution and scatter plots very quickly with less than 15 lines of code.
These set of quick visualisations helps to focus on the areas for data pre-processing before embarking any complex modelling exercise.
With the help of these visualisations, we can learn a lot about the data and make deductions even without formal modelling or advanced statistical analysis.
You can learn more on 5 Advanced Visualisation for Exploratory data analysis (EDA) and 5 Powerful Visualisation with Pandas for Data Preprocessing
"""Full Code"""
from sklearn.datasets import fetch_openml
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
X,y= fetch_openml(name="ilpd",return_X_y=True,as_frame=True)
print(X.info())
X_Numerical=X.drop(["V2"], axis=1)
relation=X_Numerical.corr(method='pearson')
sns.heatmap(relation, annot=True,cmap="Spectral")
plt.show()
X_Numerical.hist(color="Lightgreen")
plt.show(
sns.pairplot(X_Numerical)
plt.show()
sns.countplot(x="V2", data=X)
plt.show()