
Statistics is everywhere and so are the terms associated with it. It plays a vital role in every field of human activity. To analyze what is happening around us, we need statistics. However, it gets daunting for many because of so many terms especially in the text related to data science and machine learning. ✍️Whenever we read any such text, we almost always come across at least four to five statistical terms or phrases. 😫
One such phrase is independent and identically distributed random variables. This phrase is so common that we see it everywhere and search about it various times. 💻
In this article, I will try to explain this phrase in an easy way.
Note: i.i.d. is the abbreviated form of independent and identically distributed.
The most basic example in statistics is the flipping of a coin. 😁 So, I will also use this object to explain the idea behind independent and identically distributed variables.

Suppose I flip a fair coin 100 times and get head 53 times and tail 47 times. And, now I want to flip my coin again for the 101st time. What will I get? Well, I can either get a head or a tail with a probability of 0.5 for each (since it is a fair coin). However, the probability of getting a head or a tail does not depend on any of the previous outcomes, that is, 53 heads and 47 tails. That means I will still get an outcome of head or tail even if I do not save the information about my previous outcomes. So, I can say that past behavior does not impact future behavior. Here, the outcomes we get from the flipping of a coin are independent and identically distributed. Independent because one outcome does not depend on the other outcome and identical because every sample comes from the same distribution (there is no change in the distribution when we flip a coin).
Note: Identically distributed does not mean equiprobable. It is not required that the two random variables can only have the probability of 0.5 each or four random variables can only have the probability of 0.25 each in order for them to be i.i.d.
Thus, we say that random variables _X_₁, _X_₂, …, Xn are all independent and identically distributed if all Xᵢ are mutually independent and they all have (or belong to) the same distribution.
Let us understand it using one more example:
Take an urn and put n balls in it such that each ball has a different number written on it. Now draw m (where m < n) balls from the urn with replacement such that Xᵢ is the number written on the _i_th ball drawn from the urn. Since the probability of drawing a ball from the urn with replacement is the same and all the balls are drawn from the same urn, we can say that _X_₁, _X_₂, …, Xn are i.i.d.
Kindly note that drawing the balls without replacement will not be i.i.d. as they will not be independent even if they all have the same distribution. The probability of drawing each ball will depend on the probability of the previously drawn ball or balls.
An example of random variables that are not i.i.d.:
Let us assume that we have a deck of cards and we draw a card and it is an ace of diamond. Now, when we draw another card from the deck we know that it cannot be an ace of diamond. Thus, the random variables are not mutually independent and thus not i.i.d.
Thus, the formal definition of the independent and identically distributed random variable is as follows:
A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
Having independent and identically distributed data is one of the common assumptions for machine learning, statistical procedures, and hypothesis testing. This assumption can be useful in Data Analysis tasks even when the data is not strictly i.i.d.
Often i.i.d. assumption arises in the context of sequences of random variables to state that a random variable in the sequence is independent of the random variables that came before this one. And, because of this we say that i.i.d. is different from a Markov sequence where the probability distribution of a random variable is a function of or dependent on the previous random variable in the sequence.
The i.i.d. assumption is important in central limit theorem which itself is a very important concept.
References:
- http://www.utstat.utoronto.ca/~radford/sta247.F11/lec7.html
- https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0208.pdf
Thank you, everyone, for reading this. Do share your valuable feedback or suggestion regarding this post! Happy reading! 📗 🖌