How many samples do you need?

Simulation in R

Dr. Marc Jacobs
Towards Data Science
4 min readSep 30, 2021

--

In aquaculture, experiments are conducted in tanks containing fish. These fish are measured individually which is a tedious and stressful endeavor. I was asked to calculate how many fish one would need to sample in order to maintain a power equivalent to a complete sample experiment.

To detect an overall effect, I used the F-value from an ANOVA. Any p-value obtained from such a model shows if, in an experiment containing multiple treatments, there is any ‘significant’ difference between any of the treatments. The limit for ‘significance’ if often set at 0.05, which means we accept a 5% Type 1 Error or False Positive Rate of 1 out of every 20 replicated trials (not diving to deep into Frequentist Theory here).

To try and answer the original question, I received a dataset containing observations from individual fish per tank across treatments. From Frequentist theory we know that the power to obtain a significant p-value — if one truly exists — is dependent on the mean, the standard deviation, and the sample size. The mean and the standard deviation are used to calculate the coefficient of variation within a treatment, and effect size between treatments. In this example, the effect size is the difference between treatments, corrected for by the standard deviation.

For this exercise I used the R programming language, but it can be easily replicated in either SAS, Python or any other programming language of choice.

Lets load in the libraries I think I will need.
Then, lets import the data and take a look at it.
Here, you can see the spread of data by treatment and by tank. The code below shows that each tank contains 35 observations.
# A tibble: 12 x 2
Tank no_rows
<fct> <int>
1 120 35
2 121 35
3 122 35
4 123 35
5 125 35
6 126 35
7 127 35
8 128 35
9 129 35
10 132 35
11 133 35
12 137 35
The model and metric used to test the current power (post-hoc) was the ANOVA and F-test, respectively. The F-test looks for differences between all treatments in order to state that their is a difference, somewhere between treatments, given the chosen False Positive Rate.

The results, significance, and F-value are shown in the code below.

Call:
lm(formula = gain ~ Diet, data = ARCdata)
Residuals:
Min 1Q Median 3Q
-204.067 -48.865 -4.894 45.291
Max
213.933
Coefficients:
Estimate Std. Error
(Intercept) 290.7048 6.9934
DietB -12.6381 9.8901
DietC -15.9836 9.9139
DietD 0.2087 9.9139
t value Pr(>|t|)
(Intercept) 41.569 <2e-16 ***
DietB -1.278 0.202
DietC -1.612 0.108
DietD 0.021 0.983
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual standard error: 71.66 on 414 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.01037, Adjusted R-squared: 0.003198
F-statistic: 1.446 on 3 and 414 DF, p-value: 0.2289
[1] 0.2288678
[1] 1.445967
To obtain a post-hoc power analysis, I need to look at the probability of obtaining a significant p-value (p<=0.05). I applied bootstrapping (resampling with replacement) to estimate the population mean and standard deviation (sd). Once I have the population mean and sd, I can then simulate 1000 trials, and look at the achieved ‘power’ in a population.

In the code below, you see the results of the differences between the population mean and standard deviation to the sample mean and sd. This is often called ‘bias’, and gives a good approximation of how useful the sample data is for simulation.

         120          121          122          123          125 
0.020574286 -0.086813862 0.085062857 -0.243245714 0.185080000
126 127 128 129 132
-0.052434286 0.195677143 -0.005534286 -0.153240000 0.114102857
133 137
0.044703253 0.031260000
120 121 122 123 125 126 127
-1.318715 -1.472052 -1.720122 -1.763247 -1.601380 -1.356375 -1.600663
128 129 132 133 137
-1.624573 -1.583112 -1.250887 -1.307703 -1.889067

Then, I started simulating 10k new trials, using the population mean and standard deviation, looking for the power when using 35 samples per tank.

apply(x, 2, FUN = power)
V1 39.21

As you can see, the results are not that great showing a post-hoc power of 39% based on population values. This means that the effect size found in the current experiment does not allow for sampling less than 35 fish per tank. Not, unless we can increase the effect size of the differences.

To simulate such a scenario, we need to use the population mean, standard deviation, coefficient of variation, and start expanding effect sizes by increasing the mean value of one single treatment. Here, I chose treatment D (and its tanks 121, 126, 129).

print(popcv<-(popsd/popmean))
120 121 122 123 125
0.2036404 0.2273521 0.2912939 0.2887722 0.2576816
126 127 128 129 132
0.2100662 0.2421316 0.2734029 0.2557708 0.2290366
133 137
0.2546334 0.2381327

The code above is a lot to digest, but what I basically do is ask R to create an array in which F-values and p-values are stored when changing a matrix of effect sizes and number of fish sampled per tank.

Above is some data wrangling to produce the graph I want, using the variables ‘power’, ’N’, and ‘ES’. Below is the code to produce the graph, which should be self-explanatory if you see the graph.

The graph showing the results of the simulation. Changing variables are the Effect Size (ES) and Number of Fish. The curves are not smooth because only 1000 samples were used, which took about 60 minutes on an i7.

The graph above clearly shows that power is a function of effect size and the number of fish sampled. To reach a power of 80%, you need to have a certain effect size and no amount of fish can counterbalance this. Hence, why when designing an experiment, you need to do whatever you can to create the necessary effect to detect the signal from the noise.

--

--

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.