The world’s leading publication for data science, AI, and ML professionals.

Small Samples with Meaningful Results

Expand Your Data Science Toolkit with the Nonparametric Sign Test

Photo by Christopher Burns on Unsplash
Photo by Christopher Burns on Unsplash

Data comes in all shapes and sizes. Most statistical methods rely on assumptions about abundant and well-behaved data. Nonparametric methodologies, like the sign test, have few assumptions and so make a good alternative to analyze small datasets with unusual shapes. Extracting meaningful information from limited data is a common problem in biology (genetic analysis), management (Delphi interviews), geology (water flow), and business (production and consumer sentiment).

This article reviews the sign test and applies it to two practical business scenarios. Ideally, the examples will prompt you to reconsider your approach to small datasets, and perhaps statistics in general. The usual assumptions and rules do not apply. Actionable results with a high P-value? Absolutely. Dependent data? Entirely manageable. A test statistic not derived from the sample’s expected value? Definitely. It’s all possible.

About the sign test

The sign test is a method based on the binomial distribution that is used to investigate small samples. The classic binomial example is flipping a fair coin to count the numbers of heads and tails in an experiment. It’s unlikely a large number of heads or tails will come up, and if that happened, it would be unexpected and unusual. The same principle applies to all kinds of phenomena that can be divided into one thing or another by using experiments that are classified a success or failure. The sign test will identify consistent differences between either matched pairs or a single sample and a reference point.

The sign test is nonparametric, meaning that the data can be distribution free. It’s also possible that we may know the distribution but leave the parameters (like mean or standard deviation) unspecified. Nonparametric tests thus can be applied to data without visualizing or determining its distribution characteristics. This can save time and effort and opens up unusually distributed data to rigorous statistical analysis.

The sign test has only three assumptions. The first is intuitively obvious: Each trial measurement must come from the same continuous population. This assumption is typically addressed in a study’s design phase. For example, an experiment on cola preferences will have observations only about cola and not about catsup. The second assumption is that trial measurements are consistently ordered to distinguish between a success and a failure. [1] The successes are labeled "+"; the failures, "-". This is where the sign test gets its name. Measurements can take the form of dichotomous yes/no answers or A/B preferences, Likert-type scales, or positive/negative text sentiment results. The third and final assumption is that trials are independent so that one observation will not affect any other one. This constraint is partly what makes a fair coin "fair," because a prior toss does not influence the outcome of future ones. Later we will see how dependent data can be modified to generate useful results.

The binomial distribution and base R

The statistical method to run a sign test is the binomial test. Three variables define the binomial distribution: The number of trials (n), the number of trial successes (x), and the probability of success on a single trial (p). The formula for the binomial probability mass function is:

U.S. Department of Commerce, Public Domain Image [2]
U.S. Department of Commerce, Public Domain Image [2]

The right-hand side of the equation consists of two parts: the combination of the number of trials multiplied by the probability of success and failure, each raised to an exponent. The probability acts like a coefficient on the combination. To see p‘s effect, the density function can be plotted for 100 trials using values of 0.35, 0.5, 0.6, and 0.9.

Image by Author
Image by Author

These density functions show us how influential p is for detecting differences because the trial results cluster around the selected probability. For large numbers of trials, the binomial starts to approximate a normal distribution and a more powerful z-test is a better choice.

To run an exact binomial test in R, plug in the values for the three variables, then choose a one or two-sided test, and finally set the confidence level. For example, the general syntax for a fair binomial test in R is [3]:

binom.test(x, n, p,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95)

The choice of hypothesis depends on the study’s goal. The "two.sided" choice detects consistent differences in either the number of successes or failures. "Less" is used to detect failures; "greater", successes. How the P-value is reported may vary depending on the statistical software package, so be sure to verify your results.

The binomial test has three main limitations. First, the confidence interval calculation is not very helpful with small numbers of trials. Even with very significant P-values, the intervals will be quite wide with fewer than 50 trials. Second, other tests are more efficient if you know the distribution. A general rule is that having more than 25 observations indicates other alternatives may be better than the sign test. Finally, as a nonparametric test, the sign test is low powered, meaning that it is more likely to fail to reject a null hypothesis if it is false. [4] This is a Type II error and has implications for how a data scientist applies the results from a nonparametric test.

In two examples, we will utilize these statistical concepts to solve practical data science problems in business. As you read through the examples, consider how a business environment’s constraints affect a study’s design. More importantly, think about how these examples differ from classic textbook problems and how they may challenge your beliefs about applying statistics.

Scenario I, The need for speed: Quick and approximate results

Scenario: A global telemarketing company implemented a new system for reporting sales, but the rollout has gone poorly. Any error made in entering a sale will cause the report to be rejected by an automated system so a manager can review it. Thus, reports are either accepted or rejected, and with a substantial telemarketing operation of 5,000 representatives, it’s unclear which workers would benefit most from additional training to reduce error rates. Senior management has labeled the situation a crisis because too many reports are being rejected, and the managers are overwhelmed reviewing reports manually. The sales and marketing team has asked you to identify 1,000 workers to receive additional training. They give you a data file with one row for each worker and 1 to 25 columns for each of their reports that were either accepted or rejected. Some workers submitted a handful of reports, while others sent in 25.

Solution: Limited information and time combines to yield few options to discover patterns and draw statistically relevant conclusions. The data is derived from the census of all submitted reports, N. We can then compute p by dividing the number of successfully submitted reports by N. The expected value and mean is N * p. We could simply compute an average of successful submissions for each worker and compare it to p, but that would not differentiate among employees with only a few versus many reports. However, with the results from each employee, there is enough information to run 5,000 "less" one-tailed sign tests to determine which 1,000 workers are likely underperforming and need more training.

The study’s results will be 5,000 P-values, each associated with a worker. Remember that the P-value tells us how much different our data is relative to the prediction of our test hypothesis. As an illustration, "P = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation." [5] In a typical statistical study, the required significance to reject the null hypothesis is established in the design stage based on how the results will be used. Low P-values are important for medical studies to ensure patients receive appropriate care. For example, in disease association studies, the cutoff is quite stringent at 0.01, while gene expression studies are set at 0.05. [6] Published academic articles feature studies reporting low P-values because interesting results are defined as something different from what was expected.

In our scenario, the sales and marketing team has implicitly set the P-value for you because they want a list of 1,000 workers with the lowest performance, i.e., with the greatest difference relative to what our model’s probability of success predicts. You will select the P-value for the 1,000th-worst-performing worker as your critical value. It is entirely possible that the significance will be low, but that is fine because the stakes of being wrong are also low. Some may simply miss out on training, while others may receive it unnecessarily.

To further understand P-values in studies with a large number of tests, the values can be displayed on a histogram or density function to determine their approximate shape. Here is a density function from a very statistically significant study with 50 P-values:

Image by Author
Image by Author

Thirty P-values are clustered around 0.01 and 0.05, meaning many hypotheses have been rejected. The remaining twenty P-values are evenly distributed and represent null hypotheses that were not rejected. Very well behaved data like this is much appreciated by researchers but is rare. Compare this graph to a hypothetical density function for the first 50 P-values in our scenario:

Image by Author
Image by Author

This plot is much more difficult to interpret. The greatest density is on the right at 1.0, representing individual workers who perfectly filled out all their reports. The workers on the far left had all rejected reports. But there are three other bumps in the middle. There may be many possible explanations for the other humps, but we need more data to determine why they are there. Having said that, the existing P-values are still usable for decision making. For this hypothetical sample of 50 P-values, the bottom 20% of workers (coincidentally) would have a P-value <= 0.2 and be sent for additional training. Graphically, this comprises all of the workers with P-values from the red arrow to 0.0.

At the completion of this task, you can offer further insights. Additional data, like worker location or job tenure, might explain the humps in the P-values. After the training is completed, a paired sign test can be run on participants to look for improvements in the success rate for submissions. More powerful tests can be considered to identify the most successful report submitters for bonus compensation purposes. (Remember the nonparametric sign test is prone to Type II errors, meaning it’s more likely a high-performing worker will be rejected for a bonus when they deserved one. The stakes are higher for compensation than training assignments.) Each new insight will bring clarity and aid in rational decision making.

Scenario 2, Complex data: Simplified statistics

Scenario: An online electronics retailer sells more than 2,000 products on its platform. The corporate buyer will be attending a trade show and needs an analysis to support which products to buy based on customer ratings and reviews from its website. In preparation for your analysis, you have scraped quantitative Likert-type 5-point ratings and qualitative written reviews for all the products. How will you determine which products were the highest rated and best reviewed?

Solution: The dataset is complex and can be analyzed from different perspectives. You could use a mixed methods approach, though some products may have a paucity of written reviews. Also, the data might not be independent because early reviews may influence later ones. Although the dataset itself is large, you will need to drill down to individual products, and there may not be many observations for some electronics. The products are also dissimilar – the average rating for a television may be quite different from one for a hearing aid battery. In any event, the retailer will have to order highly rated products across all categories to carry a complete electronics selection. This implies there is no useful mean for all items sold because of product heterogeneity.

To formulate actionable recommendations, the data can be simplified and transformed to run sign tests. Because there is no useful population mean, the Likert-type scale can serve as an a priori proxy. This is reasonable because consumers make purchasing decisions based on their perceptions and not a statistical average. If "3 stars" is "average," then testing each rating for "stars" > 3 will identify above-average-rated products. Some information is lost, in this case capturing differences between 4- and 5-star ratings, but the tradeoff is a new dichotomous variable. The qualitative reviews can be run through a sentiment analysis to classify them as "positive" or "negative." The result is a vector of logical data for each product. The first two assumptions for a sign test are now satisfied; each product is from the same continuous population, and the data is meaningfully ordered.

The transformed data is almost ready to be analyzed, but the third assumption is a problem because the data is likely not independent. This is difficult to overcome but not impossible if we can determine the nature of the dependency. The most direct method is to examine the data to see if any relationship exists between early and subsequent reviews and ratings. However, this is also an opportunity to embrace external research and incorporate it into our own model. A plethora of academic literature is devoted to consumer psychology and purchasing choices, especially for online retail. Recent research demonstrated the impact the first review had on a product’s subsequent star ratings. Using a Case Study of electronics products, the researchers found that a negative first review was associated with an average of 0.6 fewer stars, regardless of differences in product quality or price. [7] We can use this kind of information to adjust for data dependency. This R code produces a list of highly rated products with a 0.1 significance level: [8]

# Count the number of trial success ratings with a positive first review
pos_review <- filter(v, Review == 1) %>%
  apply(1, function (x) sum(x>3))
# Count the number of trial success ratings with a negative first review
neg_review <- filter(v, Review == 0) %>%
  apply(1, function (x) sum(x>=3)) # Note the change in inequality
# Create a vector of all successes
success <- c(pos_review, neg_review)
# Extract just the ratings to find the number of trials
ratings <- select(v,c(3:ncol(v)))
trials <- apply(ratings, 1, function(x) sum(x>0))
# Find the P-values for each row
p_val <- mapply(function(success, trials) binom.test(success, trials, p=0.5, "greater")$p.value, success, trials)
# Create a dataframe of each product's P-value and identify products <=0.1 significance level
v_new <- data.frame(select(v,Product), p_val) %>%
  filter(p_val <= 0.1)
print(v_new)

This scenario’s proposed model is greatly simplified to illustrate how a data scientist can choose what constitutes a successful trial. There are many avenues to refine the analysis; for sentiment analysis, you can use intensity levels to determine if more strongly worded reviews were correlated with larger shifts in star ratings. You also can adjust the hypotheses to expand or narrow the scope of the output to suit the corporate buyer. As long as the data is meaningfully measured in a sign test, a researcher can apply a priori information in lieu of observational mean data to develop worthwhile and purposeful results.

Moving forward

Small datasets can be overlooked sources for decision making. Any time data is limited or unusual, think of running a sign test. Its lack of assumptions provides tremendous flexibility to investigate ordered data. Do not be constrained by traditional rules for test statistics, significance, or dependence. Carefully craft your experiment to elicit insights, and the results will be meaningful and actionable.

References and Notes

[1] U.S. Department of Commerce, NIST/SEMATECH e-Handbook of Statistical Methods (2012), https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/signtest.htm

[2] U.S. Department of Commerce, NIST/SEMATECH e-Handbook of Statistical Methods (2012), https://www.itl.nist.gov/div898/handbook/eda/section3/eda366i.htm

[3] DataCamp, RDocumentation (2022), https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/binom.test

[4] W.W. LaMorte, MPH Online Learning Modules (2017), https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric8.html#:~:text=A%20Type%20I%20error%20occurs,due%20to%20small%20sample%20size

[5] S. Greenland et al, "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations," European Journal of Epidemiology (2016), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/

[6] J. Mohieddin and N. Ansari-Pour, "Why, When and How to Adjust Your P Values?," Cell Journal (2019), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6099145/

[7] S. Park, W. Shin, and J. Xie, "The Fateful First Customer Review," Marketing Science (2021), https://pubsonline.informs.org/doi/abs/10.1287/mksc.2020.1264

[8] The code is for a sample dataset, v, with column 1 being the product name, column 2 the review, and columns 3 to 8 "star" ratings. Rating N/A’s were replaced by zeroes. The data was sorted so that positive reviews were the top rows and negative thereafter. As an example, a dataset might look like this:

Hypothetical Data Created by Author.
Hypothetical Data Created by Author.

Related Articles