Research Methods Involving Human Studies

Research hypothesis, hypothesis testing, and significance tests

Fangyi Yu
Towards Data Science

--

Image from Pexels

I am currently working on two research projects, both of which need me to undertake a human study in order to validate my research hypothesis. I was thinking: Why not take advantage of this opportunity to note down the key points in research methods involving human studies for future reference? And that’s why this article exists.

Research Hypothesis

When performing experimental research, the first step is often to formulate a research hypothesis, which is a specific problem statement that can be empirically evaluated. Typically, an experiment will include at least one null hypothesis (H0, 0 should be subscripted) and one alternative hypothesis (HA, A should be subscripted). Typically, H0 indicates that no difference exists between experimental treatments. HA is always a proposition that contradicts H0. The experiment’s objective is to gather statistical data to disprove or invalidate H0 in order to support HA (Rosenthal and Rosnow, 2008). Generally, a good hypothesis meets the following criteria:

  • is focused on a testable topic that can be addressed in one experiment;
  • is written in concise, straightforward language;
  • explicitly describes the control groups or experimental conditions.

Variables

A well-defined hypothesis must specify the dependent variables (DVs) and independent variables (IVs). To easily distinguish these two concepts, IVs refer to treatments or conditions that researchers can control, and DVs refer to the outcomes that researchers need to monitor (Oehlert, 2010). The five types of DVs typically assessed include efficiency, accuracy, subjective satisfaction, ease of learning and retention rate, and physical or cognitive demand. For example, given a H0:

There is no difference in time required to log in when using FaceID and fingerprint.

The IV in the example is the authentication method (FaceID or fingerprint), and the DV is the amount of time spent logging in and is a measurement of efficiency.

Experimental Design

Following that, we need to design the experiment based on the research hypothesis we established. This enables us to paint a broad picture of the experiment’s overall scope and develop a reliable estimate of the experiment’s budget and timeline. The experiment’s structure can be defined by answering two questions:

  • How many IVs do we want to examine throughout the experiment?
  • How many different values (conditions) does each IV contain?

For the first question, if the experiment has just one IV, a simple one-level design suffices. If there are two or more IVs, a factorial design is required.

Between-group and Within-group

Depending on the answer to the second question, we choose to adopt a between-group or within-group design. Each participant is exposed to just one experimental condition in a between-group design, but in a within-group design, each person is required throughout the duration of the experiment and is exposed to all experimental conditions. The following is an illustration of a between-group and a within-group design for the logging example given above.

Between-group (left) and Within-group (right) (image by author)

Both between-group design and within-group design have their pros and cons, and the pros and cons of these two designs are exactly diametrically opposed. The choice of a between-group or within-group design will result in a different type of significance test.

Pros and cons of the between-group and within-group design (image by author)

So when to choose which? Generally, researchers prefer the within-group design unless the following conditions exist: the experiment is investigating simple tasks with little individual differences; tasks that would be significantly impacted by the learning effect; or problems that cannot be investigated using a within-group design (e.g., the conditions are mutually exclusive, such as users from Canada and China).

The within-group design is more favored by researchers, and the limitations of the design type can be mitigated in some ways. The negative impacts of learning effects and fatigue may be mitigated by the use of a Latin Square Design. Furthermore, the learning effect can be reduced by allowing appropriate training time for users to get familiar with the task, since a typical learning curve for humans is that we make rapid progress in learning during the first stages, followed by progressive deterioration with continued practice. Moreover, to address the issue of fatigue induced by multiple experimental tasks, it is typically recommended that a single session should last between 60 and 90 minutes or less.

Additionally, it is critical to conduct multiple pilot tests prior to conducting the real data collection in order to discover possible biases.

Statistical Analysis

Almost all experimental studies are examined and reported using significance tests, which enable us to assess our confidence in the generalizability of the findings obtained in the sample population. However, before deciding on the type of test to use, it is necessary to understand the difference between Type I and Type II errors.

Type I and Type II errors

A Type I error is sometimes referred to as an ⍺ error or a “false positive”. It is a reference to the mistake of rejecting H0 when it is true and should not be rejected. Type II errors are frequently referred to as a β error or a “false negative”. It refers to the error of failing to reject H0 when it is demonstrably false and should be rejected (Rosenthal and Rosnow, 2008). For example, in the case of COVID testing, we have the hypothesis: “The person is COVID positive”.

Then the Type I error would be: “The person is tested COVID positive while she/he is actually negative”. The Type II error would be: “The person is tested COVID negative while she/he is actually positive”.

It is generally believed that Type I errors are more harmful than Type II errors. By statisticians, Type I errors are referred to as “gullibility errors,” whereas Type II errors are referred to as “blindness errors.” A Type I error may result in a scenario that is worse than the existing state, and a Type II error may result in the loss of a chance to improve the current state.

Controlling the risks of Type I and Type II errors

We must consider the possibility of making Type I and Type II errors while planning experiments and analyzing data. The probability of making a Type I error is referred to as alpha (or significance level, P-value) in statistics, and the probability of making a Type II error is referred to as beta. A test’s statistical power, defined as 1-β, is the probability of successfully rejecting H0 when it is false and should be rejected (Cohen, 2013).

Typically, a very low P-value (<0.05) is used to minimize the occurrence of Type I errors. If a significance test gives a value greater than the t value at P<0.05, it indicates that the probability of making a Type I error is less than or equal to 0.05. In other words, the probability of mistakenly rejecting the H0 is less than 0.05.

To control the occurrence of Type II errors, it is typically recommended to use a large sample size, which allows us to detect the difference between two conditions even when the effect size is small.

Significance Tests

The ultimate goal of user studies incorporating multiple conditions or groups is to determine if there are any differences between the conditions or groups. Due to the data’s variance, we cannot simply compare the means of the different conditions and declare that a difference exists because the means are different. Rather than that, we must apply statistical significance tests to determine which variances can be explained by the IVs and which cannot. The significance test will indicate the probability that the observed difference occurred randomly. If the probability of the difference occurring by chance is small (less than 0.05), we can confidently state that the observed difference is due to the difference in the controlled IVs.

Commonly used significance tests for comparing means (Image by author)

The image above illustrates some regularly used significance tests for comparing means. A t-test is a simplified version of ANOVA (sometimes known as an “F test”) that involves just two groups or conditions. These tests can easily be done in software such as SPSS and Excel. So let’s see how to interpret the test results and we will take the t-test as an example.

The t-tests return a value in the software, with larger t values indicating a greater possibility of the H0 being false. In other words, the greater the t value, the more probable it is that the two means diverge. Bear in mind that the t value must be reported together with the degrees of freedom and the significance level. This can assist readers in determining if the data analysis was conducted properly and accurately interpreting the findings.

Here is an example of reporting a t-test result:

An independent samples t-test suggests that there is a significant difference in the logging time between the group who used FaceID and the group who used fingerprint (t(15)=2.178, p<0.05).

The t value in the above example is 2.178, which is greater than the t value for the particular degree of freedom (df=15) at the 95% confidence level (t=2.131, which can be found in a t-table). That’s why it was reported as there is a significant difference.

Two-tailed and one-tailed

The above-discussed experiments are two-tailed, meaning the direction of the difference is not specified, the use of FaceID may improve login speed, reduce login speed, or have no effect on login speed. However, in certain empirical research, the hypothesis provides information about the direction of the difference. For instance, in the logging time example, if the hypothesis was changed to:

Individuals who use FaceID spend less time logging than those who use fingerprints.

In this case, we expect the use of FaceID will improve the logging speed, so a one-tail t-test should be used. A t value greater than the 90% confidence interval indicates that the H0 hypothesis is false and that the difference between the two means is significant.

Conclusion

In conclusion, a typical lifecycle of a Human-Computer Interaction research experiment is as follows:

  1. Identify the research hypothesis.
  2. Specify the study design.
  3. Conduct a pilot study to evaluate the research’s design, system, and instruments.
  4. Recruit participants.
  5. Run the actual data collection sessions.
  6. Analyze the data.
  7. Report the results.

We have covered a lot in this article but have more to discuss.

If you are interested in this topic, I highly recommend you to read the book “Research Methods in Human-Computer Interaction” written by Lazar, Feng, and Hochheiser (Lazar, 2017). It further talks about the common types of research approaches such as surveys, diaries, case studies, and interviews; how to manage structured usability tests; how to analyze qualitative data; data collection approaches, ethical issues, and more.

I appreciate you taking the time to read my blogs. Keep an eye out for updates and feel free to leave a comment or connect with me on Linkedin.

References:

Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and data analysis.

Oehlert, G. W. (2010). A first course in design and analysis of experiments.

Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Routledge.

Lazar, J., Feng, J. H., & Hochheiser, H. (2017). Research methods in human-computer interaction. Morgan Kaufmann.

--

--