Data Exploration

Recap
In Part I of this course, we introduced the names of several common machine learning algorithms, such as decision trees, k-nearest neighbors, and neural networks, and discussed how they fit into one another. We proceeded to set up our project by downloading a public domain dataset, the 500 Cities dataset and setting up a JavaScript machine learning library called the DRESS Kit. Next, We went through the data preparation process to extract useful data points from the dataset using several basic functions from the DRESS Kit, including DRESS.local
(to load a local file), DRESS.save
(to save a file to your local machine), DRESS.fromCSV
(to convert a CSV file to native JavaScript objects), DRESS.print
(to print text onto the HTML), and DRESS.async
(to execute a function asynchronously). At the end of Part I, we created a JSON file data.json
containing only the census tract-level crude prevalence data from the 500 Cities dataset from those census tracks with a population count of at least 50. We also create a JSON file measures.json
which groups the MeasureId
by Category
and maps each MeasureId
from the original dataset to its definition.
Just as a point of reference, here is the content of measures.json
.
DRESS Kit Update
Before proceeding with the rest of this course, please take a moment to update the DRESS Kit to release 1.1.0. This version contains several performance improvements as well as a few new features, including generating histograms and heatmaps, that we are going to explore during this part of the course.
Histogram
We begin the data exploration process by performing basic descriptive analysis of the dataset so that we can get a rough idea of the quality of the available data. In particular, we want to pay attention to the presence of any missing or erroneous data points, the relative ranges of the numerical features (do we need to normalize/standardize these features?), the dimensions of the categorical features (is it feasible to do one-hot encoding on these features?), and the distributions of these features (normal, uniform, skewed, etc.)
Once again, we create a boilerplate HTML file named part2_1.htm
that loads the DRESS Kit as well as our custom JavaScript file part2_1.js
.
We load the dataset created during Part I of the course using the DRESS.local
function. Notice that the third parameter is no longer set to FALSE
because the dataset is now stored in the JSON format, which can be parsed natively by JavaScript. We assign all the 27 measures of chronic disease into an array so that we don’t have to type them over and over again. One of the most popular and straightforward data exploration techniques used by data scientists are histogram. It describes the distribution of the values in the dataset in a concise and intuitive manner. While there are plenty of statistical software packages that can generate high-quality multi-color histograms, there are really not necessary because the whole idea of a histogram is that it represents an approximation of the underlying dataset. We can gather all the information we need from the rough outline of a histogram, which we can generate using the DRESS.histograms
function.
Open part2_1.htm
in the browser, click on the File Input button, select the data.json
file to generate a list of text-based histograms. Here are three of those histograms generated by the function.
Each histogram provides several key pieces of information. First, it displays the number (and percentage) of data points that contain a non-null value. We can see that there are no null values for the CANCER
and BPMED
measures, but a small portion (0.4%) of data points is missing from the COREW
measure. Next, it calculates the ranges of values (the first bar represents the minimum, the difference between the bars represents the interval, and the last bar represents the maximum – interval). For instance, we can see that the range for the CANCER
measure is between 0.7 and 22.9. Most importantly, we can see the rough distribution of values from the bar graphs. We can see that the CANCER
measure is skewed heavily towards the left, while the BPMED
measure is skewed heavily towards the right. In comparison, the COREW
measure is nearly evenly distributed around the mean.
The DRESS.histograms
function can also generate histograms from categorical features. Simply pass the names of the categorical feature as the third parameter, as followed:
We can see that data points from California, New York, and Texas made up the majority of the dataset, which is not entirely unexpected given the population distribution of the United States. It is important to remember that each subject
in this dataset refers to a census tract, not an individual. It is also worth noting that certain states, such as Delaware, Maine, Vermont, West Virginia, and Wyoming, have very few census tracts and may create some problems when we actually use the dataset to build our machine learning models. We need to keep this in mind.
Imputation
Notice that certain features, such as the COREW
measure, in the dataset contain missing data points. Missing data is a fairly common problem during Clinical Research; participants may drop out in the middle of the study, data may have been entered incorrectly, or data collection forms may have been misplaced. This is the point in most statistic textbooks where they bring up the concepts of Missing At Random (MAR), Missing Completely At Random (MCAR), and Missing Not At Random (MNAR). We are not going to waste time discussing the boring theoretical stuff behind these definitions. Suffice to say that MCAR does not affect data analysis but, in reality, it is almost impossible to prove that data is missing completely 100% at random. MNAR is the exact opposite situation, in which the reason the data is missing is related to the value of the missing data. The only way to reliably recover the missing data is to address the underlying reason by modifying the data collection step or the data analysis step. For instance, in an online survey that asks about the number of hours spent on the computer, it is likely that those without computer access (or at least internet access) would not be represented in the dataset. The researcher should either attempt to collect the data in person or acknowledge during the analysis that the study is limited only to those who can access the survey online. No one-size-fit-all statistical manipulations can resolve MNAR. In most well-designed clinical studies, we are dealing with MAR, which can be addressed through some sort of statistical method. One strategy is to ignore those missing data points. We can either discard the whole subject or just the missing data point depending on how we want to analyze the data. The second strategy is to replace the missing data with a reasonable estimate, a process known as imputation. Such an estimate can be computed by randomly selecting a data point from other non-missing data points (e.g. Last Observation Carried Forward or Base Observation Carried Forward), by calculating the arithmetic mean or mode, or by some sort of statistical regression analysis.
We will demonstrate some basic imputation techniques when it is time for us to actually build our machine learning models. And we will also learn that some of the machine learning techniques can, in turn, be used an imputation technique.
Mean, Median, and Mode
Another way to explore the data more analytically is to study the dataset’s central tendency (mean, median, and mode) and dispersion (variance and interquartile range). These properties are especially important because many other statistical methods, such as regression analysis, operate on them and we often perform comparisons between two or more datasets using these properties (i.e. does the treatment group have a higher average survival rate than the placebo group).
Naturally, the DRESS Kit comes with several built-in functions to compute the central tendency and dispersion of a dataset.
Here again is a portion of the results produced by the script.
The DRESS.[mean](https://en.wikipedia.org/wiki/Mean)s
function calculates the mean, the 95% confidence-interval for the mean, the skewness (based on the third standardized moment), and the excess kurtosis (based on the fourth standardized moment). The DRESS.[median](https://en.wikipedia.org/wiki/Median)s
function calculates the median, the interquartile range, the skewness (basead on quartile), and the excess kurtosis (based on percentile). Finally, the DRESS.frequencies
function enumerates all the possible values from a categorical feature sorted by their frequency of occurrences (the first one being the mode).
Some very astute readers may point out that the values of most features in the dataset are not strictly normally distributed, as evidenced by the shapes of the histograms as well as the presences of significant skewness and excess kurtosis. It is, however, important to note that when we are working with a sample size of thousands or tens of thousands, whether the values are normally distributed or not is less of a concern. If we were to apply formal normality testing algorithms, such as the Shapiro-Wilk test (which can easily be accomplished by using the function DRESS.normalities
), we would find that all of the features in the dataset are, in fact, NOT normally distributed. Just because the test result is statistically significant, however, does not necessarily imply that it is practically meaningful. Here is an excellent review paper that explains the statistics behind this. Suffice to say, we can safely apply most parametric statistical operations on this dataset without worrying about the underlying assumption of normality.
Correlation
After learning about the characteristics of each individual feature in the dataset, we should switch our attention to see how the various features in the dataset relate to one another. Remember the whole idea of Machine Learning to build a model based on the dataset and to subsequently use the model to make predictions. Ideally, we want to see that the various features in the dataset are independent of one another. To put it another way, if all of the features within a dataset are highly correlated, then that dataset really contains no more information then another dataset that only contains one of those features. We also want to check for any correlation between the exposure features and outcome features in the dataset. If there is a simple linear relationship between the exposures and an outcome of interest, then there is really no need to employ some complex machine learning algorithms.
Of course, the DRESS KIT comes with a function called DRESS.correlations
that calculates the Pearson correlation coefficient (or the Spearman rank correlation coefficient) automatically. Unfortunately, the text output from the function can be quite long and difficult to interpret. Luckily, we can easily convert the text output into a heat map using the DRESS.heatmap
function.

Green represents positive correlation and blue represents negative correlation. The shade of the color represents the strength of the correlation. Text in red represents statistical significance. We can see that there is little correlation between Mammography or Pap Smear and the other features, while most components of the metabolic syndrome, including high blood pressure, high cholesterol, and diabetes, are highly correlated with each other. Here is an excellent online chapter on multicollinearity. We need to keep this in mind when we are actually trying to build and interpret our machine learning models.
Wrap Up
Let’s review what we have learned in Part II. We went through the basic steps of the data exploration process. We started by creating a series of histograms using the DRESS.histograms
function, allowing us to identify the ranges and the distributions of values in the dataset. We briefly talked about the different approaches to dealing with missing data. Next, we focused on the central tendency and dispersion of each feature in the dataset by calculating the mean, median, and mode using DRESS.means
, DRESS.medians
, and DRESS.frequencies
. We touched upon the concepts of skewness and kurtosis and discussed how it is often unnecessary to worry about normal distribution when dealing with a large dataset. Finally, we demonstrated a way to assess the degree of correlation among the various features within a dataset using the DRESS.correlations
and DRESS.heatmap
functions.
Now that we have a general understanding of the various features within the dataset, we are ready to proceed with building our machine learning models using this dataset.