The world’s leading publication for data science, AI, and ML professionals.

Visualising Machine Learning Datasets with Google’s FACETS.

An open source tool from Google to easily learn patterns from large amounts of data

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

More data beats clever algorithms, but better data beats more data : Peter Norvig

There has been a lot of uproar about how a large quantity of training data can have a tremendous impact on the results of a machine learning model. However, along with data quantity, it is also the quality that is critical to building a powerful and robust ML system. After all, ‘GARBAGE IN: GARBAGE OUT,’ i.e., what you get from the system, will represent what you feed into the system.

A Machine Learning dataset sometimes consists of data points ranging from thousands to millions, which may contain hundreds or thousands of features. Additionally, real-world data is messy, comprising of missing values, unbalanced data, outliers, etc. Therefore it becomes imperative that we clean the data before proceeding with model building. Visualizing the data can help locate these irregularities and point out where the data needs cleaning. Data Visualisation gives an overview of the entire data irrespective of its quantity and enables to perform EDA quickly and accurately.


FACETS

The dictionary meaning of facets boils down to a particular aspect or feature of something. In the same way, the FACETS tool helps us understand the various features of data and explore them without having to code explicitly.

Facets is an open-source visualization tool released by Google under the PAIR(People + AI Research) initiative. This tool helps us to understand and analyze the Machine Learning datasets. Facets consist of two visualizations, which help drill down the data and provide great insights without much work at the user’s end.

  • Facets Overview

As the name suggests, this visualization gives an overview of the entire dataset and gives a sense of the shape of each feature of the data. Facets Overview summarizes statistics for each feature and compares the training and test datasets.

  • Facets Dive

This feature helps the user dive deep into the individual feature/observation of the data to get more information. It helps in interactively exploring large numbers of data points at once.

These visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or web pages.


Usage & Installation

There are two ways in which FACETS can be used with data:

Web App

It can be used directly from its demo page, whose link is embedded below.

Facets – Visualizations for ML datasets

This website allows anyone to visualize their datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer.

Within Jupyter Notebooks/Colaboratory

It is also possible to use FACETS within Jupyter Notebook/Colaboratoty. This gives more flexibility since the entire EDA and modeling can be done in a single notebook. Please refer to their Github Repository for complete details on installation. However, later in the article, we will see how to get going with FACETS in colab.


Data

Although you can work with the data provided on the demo page, I shall be working with another set of data. I will be doing EDA with FACETS on the Loan Prediction Dataset. The problem statement is to predict whether an applicant who a company has granted a loan will repay it back or not. It is a reasonably known example in the ML community.

The dataset, which has already been divided into Training and Testing set can be accessed from here. Let’s load our data into the Colab.

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Now lets us understand how we can use Facets Overview with this data.

FACETS Overview

The Overview automatically gives a quick understanding of the distribution of values across the various features of the data. The distribution can also be compared across the training and testing datasets instantly. If some anomaly exists in the data, it just pops out from the data there and then.

Some of the information that can be easily accessed through this feature are:

  • Statistics like mean, median, and Standard Deviation
  • Min and Max values of a column
  • Missing data
  • Values that have zero values
  • Since it is possible to view the distributions across the test dataset, we can easily confirm if the training and testing data follow the same distributions.

One would argue that we can achieve these tasks easily with Pandas and why should we invest into another tool. This is true and maybe not required when we have few data points with minimum features. However, the scenario changes when we are talking about a large dataset where it becomes kind of difficult to analyse each and every data point in multiple columns.

Google Colaboaratory makes it very easy to work since we do not need to install other things. By writing a few lines of code, our work gets done.

# Clone the facets github repo to get access to the python feature stats generation code
!git clone https://github.com/pair-code/facets.git

To calculate the feature statistics, we need to use the function GenericFeatureStatisticsGenerator() which lies in a Python Script.


# Add the path to the feature stats generation code.
import sys
sys.path.insert(0, '/content/facets/facets_overview/python/')
# Create the feature stats for the datasets and stringify it.
import base64
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train},
                                  {'name': 'test', 'table': test}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

Now with the following lines of code, we can easily display the visualization right in our notebook.

# Display the facets overview visualization for this data
from IPython.core.display import display, HTML
HTML_TEMPLATE = """<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

As soon as you type Shift+Enter, you are welcomed by this excellent interactive visualization:

Here, we see the Facets Overview visualization of the five numeric features of the Loan Prediction dataset. The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top. Numbers in red indicate possible trouble spots, in this case, numeric features with a high percentage of values set to 0. The histograms at the right allow you to compare the distributions between the training data (blue) and test data (orange).

The above visualization shows one of the eight categorical features of the dataset. The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top.

FACETS Dive

Facets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features. With Facets Dive, you control each data point’s position, color, and visual representation based on its feature values. If the data points have images associated with them, the images can be used as visual representations.

To use the Dive visualization, the data has to be transformed into JSON format.

# Display the Dive visualization for the training data.
from IPython.core.display import display, HTML
jsonstr = train.to_json(orient='records')
HTML_TEMPLATE = """<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/master/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

After you run the code, you should be able to see this:

Facets Dive Visualisation
Facets Dive Visualisation

Now we can easily perform Univariate and Bivariate Analysis and let us see some of the results obtained:

Univariate Analysis

Here we will look at the target variable, i.e., Loan_Status, and other categorical features like gender, Marital Status, Employment Status, and Credit history, independently. Likewise, you can play around with different features also.

Inferences:

  • Most of the applicants in the dataset are male.
  • Again a majority of the applicants in the dataset are married and have repaid their debts.
  • Also, most of the applicants have no dependents and are graduates from semi-urban areas.

Now let’s visualize the ordinal variables, i.e., Dependents, Education, and Property Area.

Following inferences can be made from the above bar plots:

  • Most of the applicants don’t have any dependents.
  • Most of the applicants are graduates.
  • Most of the applicants are from semi-urban areas.

Now you can continue your analysis with the numerical data.

Bivariate Analysis

We will find the relationship between the target variable and categorical independent variables.

It can be inferred from the above bar plots that:

  • The proportion of married applicants is higher for the approved loans.
  • The distribution of applicants with one or 3+ dependents is similar across both the categories of Loan_Status.
  • It seems people with a credit history of 1 are more likely to get their loans approved.
  • The proportion of loans getting approved in the semiurban area is higher than that in rural or urban areas.

Conclusion

FACETS provides an easy and intuitive environment to perform EDA for datasets and helps us derive meaningful results. The only catch is that currently, it only works with Chrome.

Before ending this article, let us also see a fun fact highlighting how a small human labeling error in the CIFAR-10 dataset was caught using the FACETS Dive. While analyzing the dataset, it came to notice that an image of a frog had been incorrectly labeled as a cat. Well, this is indeed some achievement since it would be an impossible task for the human eye.

Source
Source

Related Articles