The world’s leading publication for data science, AI, and ML professionals.

How to Analyze Continuous Data from Two Groups

Statistical Hypothesis Testing + Visuals with SciPy and Seaborn

FData Visualization


Traditional statistics are not understood by most data science practitioners. Many modern figures are not utilized by traditional statisticians. Let’s bridge this gap.

Imagine a task. You have tens or hundreds of features available to predict an outcome. Let’s say this outcome is the number of clicks on an eCommerce web page. Let’s say you have lots of data on eCommerce web pages: owners, URLs, hosting services, update frequencies, host country, etc. Your model predicts click counts, and it does so with impressive accuracy. But your boss asks you, "Is there a difference in the number of clicks between the web pages that come from the US versus those that come from the EU?" Can you answer this question with a hypothesis test and supporting visuals? If your answer ranges from not-at-all to probably, keep reading.

The previous example is representative of a typical data science problem. To greatly simplify the analysis, let’s put together a toy dataset. The toy set represents an extraction from big data more similar to above.

We want to assess the effectiveness of a physical activity program. This program is meant to encourage participants to exercise more frequently. The study has two independent groups: a control group and an intervention group. Two weeks into the program, the investigator records the daily minutes of exercise each participant reports. These data form the pandas DataFrame below. Full source code here. Let’s explore the following question:

Is there a difference in daily exercise rates (minutes per day) between the control and intervention groups?

from Scipy import stats
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
exercise_group = ['control']*38+['intervention']*42
exercise_rates = [25, 20, 75, 0, 50, 0, 40, 0, 0, 0, 0, 0, 25,
                 75, 0, 0, 20, 0, 0, 0, 0, 20, 20, 0, 25, 0,
                 40, 20, 40, 50, 25, 30, 25, 20, 25, 50, 30,
                 40, 20, 30, 25, 50, 0, 40, 75, 10, 15, 3, 15,
                 95, 25, 50, 40, 8, 20, 25, 50, 5, 5, 12, 30,
                 40, 10, 0, 10, 20, 20, 25, 10, 0, 50, 20, 20,
                 5, 15, 30, 10, 25, 20, 15]
exercise = pd.DataFrame({'group': exercise_group,
                         'rates': exercise_rates})
exercise
Practice Data
Practice Data

We can visualize these data using a table. First get the groups from the group column.

consolidated_unique_strings = []
for unique_string, sub_df in exercise.groupby('group'):  
  consolidated_unique_strings.append(unique_string)
print(consolidated_unique_strings)
>>>['control', 'intervention']

Then apply to the aggregated sub-DataFrame results. We see the number of observations in each group, the means, similar standard deviations, and identical medians between the intervention and control groups.

consolidated = pd.DataFrame(unique_string.describe().rename(
    columns={'score':sub_df}).squeeze()
    for sub_df, unique_string 
    in exercise.groupby('group'))
consolidated.index = consolidated_unique_strings #replace row names
print(consolidated.round(1)) #round results to one decimal
Aggregated Data Summary by Group
Aggregated Data Summary by Group

The conventional way to view this data is through a Tukey boxplot. The boxplot’s box has three main components, the bottom line at the 25% quartile, the middle line at the median or 50% quartile, and the upper border at the 75% quartile. For the blue control box, these would be 0, 20, 30, as shown below. The "whiskers" that come off the box represent the outlier thresholds. The endpoints of the whiskers occur at 1.5 times the interquartile range (IQR). For the orange intervention box below, the IQR would be the 75th minus the 25th percentile, which is the 3rd quartile minus the 1st quartile, or 30–10 = 20. For the orange intervention group, the median, which is the 50th percentile, which is the same as the 2nd quartile, occurs at 20. The whisker at 50 is (median + 1.5IQR) = 20+30. The whisker at 0 is (median – 1.5IQR) = 20–30 = -10. However, the whiskers do not extend past the range of the data, which in this case is 0. Hence, we find the lower whisker floored at 0.

The two points we see above the top whisker on the orange side are outliers in the intervention group. Unless these outlier points have known data quality issues, they must always be incorporated in the analysis. Please note data science jargon inconsistency #10,000,321: outliers are called "fliers" in matplotlib documentation, on top of which seaborn is built.

sns.boxplot(x='group', y='rates', data=exercise)
Default Boxplot
Default Boxplot

Personally, I, like probably many of you, do not care for seaborn’s default color palettes. So, using two pretty colors from a hex chart, I will manually change the palette for the next figures, like so.

sns.set_theme(style='darkgrid')
sns.set_palette(['#299EF0','#40E0D0'])

Below are two side-by-side presentations of the boxplot without and with the original observations. For most applications, I would advocate for adding the original observations to boxplots if possible. By adding the points below, we see the frequencies of observations on the y axis. This information has been historically relegated to histograms; however, we now have the option to put both types of information on one diagram. In the control group, we see that pretend daily exercise rates remain at zero for a large fraction of the participants. However, we can see that the intervention, the exercise program, appears to reduce the number of observations at 0 and may be associated with higher rates.

f, axes = plt.subplots(1, 2, figsize=(7, 7)) 
sns.boxplot(x='group', y='rates',data=exercise,ax=axes[0]
            ).set_title('Traditional Boxplot') 
sns.boxplot(x='group', y='rates', data=exercise,ax=axes[1]
            ).set_title('Boxplot Overlaid with Observations') 
sns.swarmplot(x='group', y='rates', data=exercise, color='0.25',
              ax=axes[1]) 
plt.show()

Alternatively, we can use a violin plot to add horizontal trend lines, representing the hybrid boxplot-histogram in a more decorative fashion.

h = sns.catplot(x='group', y='rates', kind='violin', inner=None, 
                data=exercise)
sns.swarmplot(x='group', y='rates', color='k', size=3, 
                data=exercise, ax=h.ax)
h.ax.set_title('Violin Plot')

Sometimes, we just want to show the points without distraction, interference, or inference. In this case, the catplot is a great option. This is the type of plot to send someone if you have no interest in telling them anything.

g = sns.catplot(x='group', y='rates', kind='swarm', data=exercise) g.ax.set_title('Raw Observations')

The boxenplot function is the seaborn implementation of a letter-value plot. This type of plot is particularly powerful for large datasets. Additional detail can be found in this great article.

i = sns.boxenplot(x='group', y='rates', 
                  data=exercise,showfliers=False) 
i = sns.stripplot(x='group', y='rates', 
                  data=exercise,size=4,color='0.25') 
i.set_title('Boxenplot with Points')

Finally, interpretation of the tried-and-true histogram can be optimized by showing a smoothed representation such as the KDE plot. The kernel density view is powerful due to visual simplicity. By selecting a smoothing parameter, the data is used to fit a smoothed Gaussian kernel, producing a continuous probability density function estimate. If the underlying data is bounded or not smooth, this estimate can introduce distortions. For example, see the new x-axis, which now extends from -20 to 120, while the original data range was 0 to 95. However, even with such limitations the KDE view illuminates bimodal behavior in the control group. With two clear peaks, this behavior is shown in other figures yet not highlighted so clearly.

g = sns.displot(exercise, x='rates', hue='group',bins=9)
g.ax.set_title('Overlaid Histograms')

h = sns.displot(exercise, x='rates', hue='group',kind='kde',bw_adjust=.75)
h.ax.set_title('Kernel Density View')
plt.show()

Now that the visual analysis of such data is thoroughly exhausted, we can return to the original question. Is there a difference in daily exercise rates (minutes per day) between the control and intervention groups? Our visual analysis suggests maybe. Using the SciPy package, we can additionally conduct a statistical hypothesis test. The appropriate test here is the two-sided t-test of two independent groups.

The important key here is independence. There is no crossover between group participation, and we assume random selection of these groups was employed. If there were pre-existing differences in baseline exercise rates of each group, then the independence requirement goes unmet, and we would not be able to use this t-test.

The hypothesis test will conclude either:

Null hypothesis (Ho): There is no difference in exercise rates between groups.

Alternative hypothesis (Ha): There is an association between exercise rate and group membership.

As the standard deviations of each group were similar (21.2 and 19.8 for the control and intervention groups), we can use the t-statistic formula that assumes equal variance. The resulting t test statistic was -0.62. Using a significance level α of 0.05, this corresponds to a p-value of 0.54, which is not significant. Thus, we fail to reject the null hypothesis due to lack of evidence that there is a difference in rates between these groups. Correspondingly, we conclude the exercise program is ineffective.

This is the way that most analyses leave the result. However, THERE IS MORE. Now we have to remember these results could contain error by design. In Statistics, there are type I and type II errors.

  • Type I error would be to conclude there is a difference between the groups when there really is not.
  • Type II error would be to conclude there is no difference between groups when there really is.

A t-test is meant to look at 1D data from two sources and calculate an aggregate value, or statistic, to determine whether the data looks similar to the figure below, whether there are distinct peaks with separation. The statistic is mapped to its assumed distribution, and a p-value, the area under the tail, is calculated. If a p-value is less than 0.05, it indicates data may look similar to that of this borrowed figure.

Visualization of the general t-test
Visualization of the general t-test

However, my data does not match that graph. Let us compare this to our actual data. In the figure below, the green region on the left is analogous to the blue region on the right. The p-value of 0.54 is the probability of obtaining results at least as extreme as the observed results. In other words, the chance of the control group mean and the intervention group means being close is high. That chance is over 50% in this case. Thus, we conclude the peaks are not well separated, and there is no difference.

(left) Visualization of this t-test. (right) Visualization of the general t-test
(left) Visualization of this t-test. (right) Visualization of the general t-test

However, given the high levels of variation in this data, especially in the control group, the power to detect the difference in sample means of 3 minutes/day is low. To detect this difference with 80% power, or a 20% chance of making a type II error, we would need at least 1,804 observations in the dataset. See calculation below generated from this free power calculator using a difference in means of (20.8 +/- 21.2 control and 23.6 intervention), a ratio of intervention to control of 1.1, significance level of 0.05, and power of 0.8, the trial should have included 859+946=1,804 participants rather than 80. As it stands, the power of the conducted test is about 10% power, so there is a 90% chance we made a type II error in the t-test. In other words, due to the small sample size of 80 participants, there was a 90% chance we concluded there was no difference when there was.

Power Calculation for t-test
Power Calculation for t-test

Though the hypothesis test result for a two-sided t-test between independent groups was insignificant, the test was not sufficiently powered for detection. Thus, hypothesis testing results remain inconclusive for practitioners wishing to limit error rate. While the hypothesis test was not informative, visual analysis of the distributions yields marginal evidence to suggest there could be a difference in participant behaviors between groups.

Incorporate these tools into your own analysis. Find source notebook here.


Related Articles