
Do you know what’s the median age of a person in the United States? You don’t have to google. I have done that for you below,

A quick google search above revealed that the average age of a person in the US is 38. Have you ever wondered how statisticians in Census Bureau came up with that number? Do you think they would go up and ask everyone in person or by mail? Not because that would be a mere waste of time, money, and resources just to find some statistic and put up on their website all bold and fancy.
So how do they do it? They use some basic principles of inferential statistics.
Alright, so in this article, we will be finding an answer to the following question using statistical inferences.
Are women paid less than men ?
Let us scratch some surface of inferential statistics before diving into the case study.
Some Background on Inferential Statistics
Population: The set that contains all data points in our experimenting space. Population size is denoted by N.
Sample: It is a **** randomly selected subset from the population – the sample size is denoted by __ n.
Distribution: It **** describes the data/population/sample range and how data is spread in that range.
Mean: Average value of all data from your population or sample. This is denoted by µ for populations and x̄ for samples.
Standard Deviation is a measure of how to spread your population is – denoted by σ (Sigma).
Normal Distribution: When your population is spread perfectly symmetrical with σ standard deviations around the mean value, you get the following bell-shaped curve.

Central Limit Theorem
From Wikipedia:
In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed.
Below video has a very intuitive explanation for Central Limit Theorem
In other words, all that this theorem states that no matter what the shape of the initial population is, the sampling distribution will always approximate to a normal distribution.
The standard error is the measure of how much the sample mean deviates from the population mean.

sample size (n) is the size of the sampled population. The below plot shows the relationship between sample size and standard error. As sample size increases, standard error decreases.

While selecting a large sample size is no problem, however, this is not feasible in most real-world complex problems. Hence an optimal sample size is needed.
Confidence intervals represent the range of values between which we are fairly sure that our population means lies. In the below image, both the lower limit and upper limit represents the confidence interval. The area between confidence intervals is called the acceptance region while the area outside is called the rejection region.

the p-value is the Probability that the test result happened by chance. In other words, it is the probability that our population means falls in the rejection region. The lower p-value indicates higher confidence in the test result.
significance level (α) is the threshold p-value set to decide if the test results are statistically significant. The significance level is usually set to 0.05, 0.01, or 0.001. If the test result’s p-value is less than the significance level (α), then we can conclude that the obtained test results are statistically significant and they are not due to a random chance or noise.
Data Collection for Case studies
For our analysis, we will use data collected from the General Social Survey (GSS) who are conducting annual surveys since 1972 from the general American public mainly through face-to-face interviews. Below is the description from their website.
The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
The GSS sample is drawn using an area probability design that randomly selects respondents in households across the nation from a mix of urban, suburban, and rural geographic areas. Because random sampling was used, the data is representative of the US population as a whole.
Alright, so now that we have our data ready, let us dive into our case studies and find answers.
Case Study: Are women paid less than men?
Motivation
This is a very fundamental problem in our societies. In this case study, we will investigate if this claim holds any water, or is it just another perception in society.
Data
For our analysis, we will be using the most recent survey data available (2018).
We will be using the following variables for our analysis:
- sex: This is the self-reported gender of the respondent who took the survey.
- conrinc: This is the inflation-adjusted annual income earned by the respondent.
- age: This is the age of the respondent.
- race: This is the self-reported race of the respondent.
- uscitzn: This field identifies if the respondent is either a) U.S. Citizen or b) not a U.S. citizen or c) U.S. citizen born in Puerto Rico, U.S. Virgin Islands, or the Northern Marianas Islands or d) Born outside the U.S. to parents who were U.S. citizens at that time or e) don’t know.
We do the following data cleansing to finalize our dataset:
- The maximum value of age is floored to 89 because we have fields such as ’89 and above’ which makes it difficult for our analysis.
- We create an indicator _uscitzn_ind_ which takes a 1 if the respondent is a U.S. citizen either they are born within the US or outside or 0 if the respondent is not a citizen and -1 if there are invalid values in the source field.
- We noticed some of the respondents reported their annual income as $0. Either they are not working or taking a break or just keeping home. So, we will remove these data points from our analysis.
Exploratory Data Analysis
Let us look at the shape of our final dataset:
- We have a total of 1363 data points.
- We have 646 males and 717 females.

Let us divide the data into two groups; Group A with males and Group B with females. Here is the distribution of our initial population.

We can make the following observations from the initial population distributions:
- Both the groups are right-skewed distributions and hence are not perfectly normal distributions.
- Men have a population mean of 44K and women have a population mean of 29K.
Inference
We start by defining our hypothesis:
- H0 (Null hypothesis): The mean difference between both men and women is zero.
- H1 (Alternate hypothesis): The mean difference between both men and women is greater than zero.
First, we need to come up with an ideal sample size for each group. Using the following code we calculate the sample size for both the groups.
For a 95% confidence interval and 5% margin error, we calculate the ideal sample size for Group A (males) is 242, and for Group B (females) is 251. This is the minimum sample size needed for the desired confidence and error specification.

Using the following code we will create sampling distributions for both the groups:
And here is the sampling distribution structure for both the groups:

These are now perfect normal distributions! Sampling distribution from any population shows normal distribution because of Central Limit Theorem.
Regardless of the initial shape of the population distribution, sampling distribution will approximate to a normal distribution. As the sample size increases, sampling distribution will get narrower and more normal.
From the sampling distributions, we observe that the mean income of men is greater than the mean income of women. Also, the difference of their means is 14706.00. Does that mean we proved the claim? Not quite yet, because we are still not sure that the result obtained from the sample is not by chance.
Hence to be sure about our claim, we perform the one-sided t-test test for two independent sample proportions. t-test requires the following check conditions:
- Independence condition: GSS data is based on random sampling hence we ensure that both the sample groups are independent of each other.
- Sample size condition: Each sample size needs to be at least 30 to follow a normal distribution. We ensure that both of our groups have a sample size way more than 30. Additionally, we confirm that both our sample distributions are perfectly normal.
Using the code below we calculate t-statistic, degrees of freedom, critical value, p-value.

We can make the following interpretations from the t-test result:
- A large t-score indicates that there is a greater difference between the means of two groups.
- Every t-score comes with a p-value which is the probability that the result from our sample data occurred by chance. Since the p-value is very less than our desired significance level α=0.05, the test result is statistically significant and we can reject the null hypothesis that the means of both the groups are equal. Hence we can accept the alternate hypothesis.
There is statistically significant evidence that the women are paid less than men. At significance level 0.05, we can be sure that test results are not due to a random chance.
Confidence Intervals
Additionally, we can calculate the confidence interval for the difference of means using the following formula:

Here, s1 and s2 are the standard error for both the sample groups, n1 and n2 are the sample sizes for both the groups.
Using the code below, we calculate the confidence interval for a 95% confidence level.

We observe our difference of means of both the groups very well falls in this confidence interval.
More Interesting Insights
By repeating the same analysis for various cross-sectional groups, we can make the following observations:
White men and women: In general, white women’s mean income is about 65% that of white men. t-score of 139 also suggests a huge difference.

Black men and women: The gender pay gap is not very significant in the black population. In general, black women as much as 95% of black men.

Immigrant men and women: The difference is worse for the immigrant population. In general, an immigrant woman earns only 48% that of immigrant men.

Millennial men and women: A millennial woman earns as much as 69% that of millennial men.

Middle-aged men and women: The difference is not too bad for the middle-aged population. Middle-aged women earn as much as 72% of middle-aged men.

Old-aged men and women: The difference again becomes worse for old-aged adults, retirement population. An old-aged woman earns only 58% that of old-aged men.

Conclusion
Based on our findings, there is statistically significant evidence that women are paid less than men.
All the code and data used for the analysis can be downloaded from this GitHub repository.