Probability and Statistics / Pareto Distribution

Generating Pareto Distribution in Python

Let’s understand Pareto distribution better

Bipin P.
Towards Data Science
6 min readMar 18, 2020

--

Photo by ©iambipin

1. Pareto Distribution

Pareto distribution is a power-law probability distribution named after Italian civil engineer, economist, and sociologist Vilfredo Pareto, that is used to describe social, scientific, geophysical, actuarial and various other types of observable phenomenon. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.

After reading the definition, you must be wondering what is power law? The power law is a functional relationship between two quantities such that a change in one quantity triggers a proportional change in the other quantity irrespective of the initial size of two quantities.

Photo by ©iambipin

The 80–20 rule holds true in many cases. For instance, Vilfredo Pareto found that 80% of Italy’s land is owned by 20% of the population. He also found that 80% of peas procured from his garden came from 20% of its pea plants. 82.7% of the world’s income is controlled by 20% of the population. A 2002 report of Microsoft suggests that 80% of errors and crashes of Windows and MS Office are caused by 20% of all the bugs detected. 80% of sales coming from 20% of the products. 80% of customers only use 20% of software features. This 80–20 distribution occurs quite frequently.

2. Generating Pareto distribution in Python

Pareto distribution can be replicated in Python using either Scipy.stats module or using NumPy. Scipy.stats module encompasses various probability distributions and an ever-growing library of statistical functions. Scipy is a Python library used for scientific computing and technical computing. NumPy is a Python library used for scientific computing that apart from its scientific uses can be used as a multi-dimensional container for generic data.

2.1 Using Scipy.stats

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto
x_m = 1 #scale
alpha = [1, 2, 3] #list of values of shape parameters
samples = np.linspace(start=0, stop=5, num=1000)
for a in alpha:
output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)])
plt.plot(samples, output.T, label='alpha {0}' .format(a))
plt.xlabel('samples', fontsize=15)
plt.ylabel('PDF', fontsize=15)
plt.title('Probability Density function', fontsize=15)
plt.grid(b=True, color='grey', alpha=0.3, linestyle='-.', linewidth=2)
plt.rcParams["figure.figsize"] = [5, 5]
plt.legend(loc='best')
plt.show()
Photo by ©iambipin

x_m and alpha are the two parameters of Pareto distribution. x_m is the scale parameter and represents the smallest value that Pareto distributed random variable can take. alpha is the shape parameter and is equal to n/SUM{ln(x_i/x_m)}.

np.linspace() returns evenly spaced samples(number of samples equal to num) over a specific interval[start, stop]. In the code above, linspace() method returns 1000 evenly spaced samples over a range[0, 5].

The list of shape values -alpha is iterated to plot lines for each value. np.array() creates an array. The scipy.stats.pareto() method returns a Pareto continuous random variable. pareto.pdf() creates a probability density function(PDF). The parameters x, b, loc, and scale are array-like quantiles, array-like shape parameters, array-like optional location parameter(default=0), and array-like optional scale parameter(default=1) respectively.

plt.plot() plots the evenly spaced samples and the array of PDF values. The graph is plotted for each value of alpha. Here, output.T is transposing the output. The output is an array of Pareto distribution values with 3 rows, one for each shape parameter. On transposing, the output is converted to an array of 1000 rows.

The remaining lines of code are almost self-explanatory. plt.xlabel() and plt.ylabel() are used for labelling x-axis and y-axis. plt.title() assigns title to the graph. plt.grid() configures the grid line.

plt.rcParams['figure.figsize'] = [width, height]

The plt.rcParams[] set the current rc params. Matplotlib uses matplotlibrc configuration files to customize all kinds of properties, which are called ‘rc settings’ or ‘rc parameters’. Defaults of almost every property in Matplotlib can be controlled: figure size and DPI, line width, color and style, axes, axis and grid properties, text and font properties and so on.

plt.legend() shows the legend and plt.show() displays all figures. For further reading on ways to make your plot more meaningful, kindly visit here.

2.2 Using Numpy

import numpy as np
import matplotlib.pyplot as plt
x_m, alpha = 1, 3.
#drawing samples from distribution
samples = (np.random.pareto(alpha, 1000) + 1) * x_m
count, bins, _ = plt.hist(samples, 100, normed=True)
fit = alpha*x_m**alpha / bins**(alpha+1)
plt.plot(bins, max(count)*fit/max(fit), linewidth=2, color='r')
plt.xlabel('bins', fontsize=15)
plt.ylabel('probability density', fontsize=15)
plt.title('Probability Density Function', fontsize=15)
plt.grid(b=True, color='grey', alpha=0.3, linestyle='-.', linewidth=2)
plt.rcParams['figure.figsize'] = [8, 8]
plt.show()
Photo by ©iambipin

np.random.pareto() draws random samples from a Pareto II or Lomax distribution with a specified shape. Pareto II distribution is a shifted Pareto distribution. By adding 1 and multiplying by the scale parameter x_m, classical Pareto distribution can be obtained from Lomax distribution.

samples = (np.random.pareto(alpha, 1000) + 1) * x_m

The smallest value of the Pareto II distribution is zero while for the classical Pareto distribution is mu, where the standard Pareto distribution has location mu=1.

plt.hist() plots a histogram. When the parameter density or normed is set to True, the returned tuple will have the first element as count normalized to form probability density. Hence area under the histogram will be 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations. Hence y-axis will have the density of samples.

The ‘_’ in the count, bin, _ conveys that the last value of the returned tuple is not important(plt.hist() would return a tuple with three values).

We will plot the curve on the binned data,

We will fit a Pareto distribution to our randomly sampled data and plot this distribution on top of our data, by computing the probability density of the Pareto distribution at the x-values defined by bins with parameters x_m and alpha.

fit = alpha*x_m**alpha / bins**(alpha+1)

3. Verifying Pareto distribution

Q-Q plot(Quantile-Quantile plot) is used to determine whether the continuous random variables follow Pareto distribution.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
x_m = 10
alpha = 15
size = 100000 #the size of the sample(no. of random samples)
samples = (np.random.pareto(alpha, size) + 1) * x_m
stats.probplot(samples, dist='pareto', sparams=(15, 10), plot=pylab)
plt.show()
Photo by ©iambipin

stats.probplot generates a probability plot of the random sample drawn from the distribution(sample data) against the quantiles of a specified theoretical distribution(Pareto distribution). As the vast majority of blue dots(sample data) are almost aligned with the red line(theoretical distribution), we can conclude that the distribution follows Pareto distribution.

Before concluding, it’s imperative to know the real-world applications of Pareto distribution.

4. Applications of Pareto Distribution

  • The sizes of human settlements(fewer cities and more villages).
  • Sizes of sand particles.
  • The file size of data being distributed over the internet that follows the TCP protocol.
  • The values of oil reserves in oil fields(a few large fields and many small fields)
  • Male dating success in Tinder where 80% of females compete for 20% of most attractive males.
  • Time spend by a user playing various games in Steam(few games are played a lot and more often than the majority of games which are seldom played).

Pareto distribution and its concepts are pretty simple yet powerful. It invariably helps to garner vital clues to understand a wide range of human behavior, scientific and social phenomena. I hope you got a better understanding of Pareto distribution and how to draw samples from it and plot using Pyplot, Numpy, Scipy, and Python. Happy Coding!!!

References

https://en.wikipedia.org/wiki/Pareto_distribution

https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.random.pareto.html

https://stackoverflow.com/questions/19525899/how-to-generate-random-numbers-in-specyfic-range-using-pareto-distribution-in-py

--

--