Intro to Probability Distributions and Distribution Fitting with Python’s SciPy

A tutorial by example on:
- SciPy’s probability distributions, their properties and methods
- an example that models the lifetime of components by fitting a Weibull extreme value distribution
- an automatized fitter procedure that selects the best among ~60 candidate distributions
A probability distribution describes phenomena that are influenced by random processes:
- naturally occurring random processes;
- or uncertainties caused by incomplete knowledge.
The outcomes of a random process are called a random variable, X. The distribution function maps probabilities to the occurrences of X.
SciPy counts 104 continuous and 19 discrete distributions that can be instantiated in its _stats.rvcontinuous and _stats.rvdiscrete classes. Discrete distributions deal with countable outcomes such as customers arriving at a counter. Continuous distributions compute the probability of occurrences between two outcomes or points on the x-axis, such as variations in height, temperature, or time.
Distributions related to engineering and technology, which attempt to model, for instance, the lifetime or time to failure of equipment, as well as in biology and pharmaceutics, have blossomed in recent years, driven by the fast increasing availability of sensor data and other large sources of quantifiable information. SciPy comprises several variants of the Weibull and Extreme Value distributions, as well as the Lognormal and Fatigue Life or Birnbaum-Saunders distributions, which are applicable to tasks in reliability engineering. In the literature, one can find more recent amalgamations like the Weibull-Birnbaum-Saunders or Kummer-beta-Birnbaum-Saunders distributions and many more, for fields like production, supply chain, architecture, biology, hydrology, and engineering. Thus, probability distributions are not only a tool primarily for stock and options traders. We should throw a glance beyond the uniform and normal distributions we encounter most often. Many of our Data Science methods rely on normally distributed data or residuals. To model real-world random processes, though, we need to be prepared to identify and evaluate alternative random models.
123 distributions are available in SciPy:

I’ve imported 60+ of Scipy’s continuous distributions to the fitter script I’m going to describe below. Admittedly, I’ve been on kind of a roll once I saw that the fitter could cope, relatively fast, with the initial 10 distributions that I had imported. I concluded that I could append 50 more in a matter of minutes and see if the fitting process would stumble over one or the other of the more exotically parametrized distributions. So far, it has worked without hiccups.
0. Dependencies
We begin by importing – besides our all-weather core libraries such as pandas and numpy – the stats class of SciPy.
Two constants should be added: the number of samples which the Kolmogorov-Smirnov test for goodness of fit will draw from a chosen distribution; and a significance level of 0.05.
Next, we compose a list of about 60 SciPy distributions we want to instantiate for the fitter and import them.
Python does not accept a list object like distributions in its import statements, therefore the same distribution names must be listed twice. A library like AST could theoretically help to read and then re-insert the imports automatically, but the loops would require more code lines than copying the imported distributions down to the list distributions.
The list distributions contains the selection we want to pass as our chosen candidate distributions to the fitter procedure. Of course, you can trim down the list to fewer candidates among which you want to search for the best fit.
1. SciPy’s Distributions and their Properties
Before we delve into the construction of the fitter, let’s go on a quick sightseeing tour around Scipy‘s distribution objects to understand how they work and interact.
1.1 Select and Instantiate a Distribution
We choose the Beta distribution as the first example and parametrize it by setting its two shape parameters a(lpha) and b(eta) to 2 and 6.
The Beta distribution has an extremely flexible shape, much more versatile than the normal distribution. Its default support or domain is the interval [0;1] for its random variates of x. Below, we will see how the support can be extended to much wider intervals by adding location and scale parameters to the two share parameters. Our choice of shape parameters will make it look similar to the bell curve of a normal distribution, but not quite aligned with it – it will be somewhat positive-skewed, with a drawn-out tail on the right.
We generate 1,000 random variates x that follow the Beta(2,6) distribution, by applying the rvs() function, and then plot them in a histogram.

SciPy’s beta is an object of the type _distribution generator, betagen, inherited from its class of continuous distributions.

1.2 Properties of a Distribution
Let’s retrieve more properties of Beta(2,6): the moments and the shape parameters. We compute them separately, then collect them in a dictionary stats, where we pair them with descriptive names, and finally we use a list comprehension to pretty-print the dictionary row by row.

To obtain four core statistics all at once, i.e. the moments up to order 4, we use the stats method with its keyword moments = ‘mvsk’: mean, variance, skewness, and kurtosis.

We observe a skewness different from the yardstick of 0 we would get from a standard normal distibution. The skewness measures the asymmetry of the distribution about its mean. Beta(2,6) is moderately positive-skewed, right-tailed, or right-skewed. Its curve appears left-leaning, but the skewness refers to the asymmetry of its tails. The tails do not balance out each other, instead the bulk or mass of the distribution is concentrated on the left of the mean while the right tail is drawn out.

The kurtosis is a measure of the "tailedness" of a distribution (not its "peakedness", contrary to interpretations offered by various sources). Skewness measures an imbalance between tails. A distribution with high kurtosis, by contrast, has a propensity to produce more outliers in either tail; it is tail-weighted relative to the center, therefore its center appears skinnier. The tails extend farther out than a normal distribution’s three standard deviations from the mean, which comprise 99.7% of its mass. In finance or engineering, a high kurtosis indicates higher risks of experiencing very small or very large observational values.
A normal distribution, acting as the yardstick, has a kurtosis of 3.0. But SciPy uses the excess kurtosis and calibrates the normal distribution’s metric to 0. The excess kurtosis measures how heavily the tails differ from those of a normal distribution.
Beta(2,6) has a small positive excess kurtosis of 0.11. A value close to zero indicates that it is mesokurtic – it has only slightly heavier tails than a normal distribution, thus its extreme value patterns will be similar. A more distinctive positive kurtosis makes a distribution leptokurtic, meaning "skinny"; its outliers along the x-axis lie farther away from a center, which therefore appears less bulky. A platykurtic ("broad") distribution, by contrast, with its negative kurtosis, has short tails; it exhibits a reduced propensity for extreme values compared with the normal distribution. Its behavior is easier to predict. Our Beta(2,6) variate has an excess kurtosis close enough to zero that we can consider it mesokurtic, still quite similar to a normal distribution.
1.3 Freeze the Parametrization
We have carried the parametrization of Beta(2,6) around with us in all previous code lines, by referring to beta(a,b) time and again. If we don’t need to modify the parameters in our future lines of code, we can assign the distribution object with its current parameters to the variable rv. SciPy calls this the frozen random variate object.

1.4 Getting Probabilities and Quantiles
We introduce two more functions: the probability density function pdf and the percent point function ppf.

To plot the probability density function pdf, we calibrate the x-axis by having linspace() draw a line of coordinate points. As endpoints, we choose the 1% and 99% quantiles: these are the values of x that the Beta(2,6) distribution will not exceed with 1% and 99% probability, respectively. The percent point function ppf() provides the x-values at these probabilities. Minimum and maximum values will not work for many distributions with a domain that is unbounded: their x-axis can, in theory, extend to positive and/or negative infinity, as in the case of the normal distribution; in practice, the script would report an error for axis endpoints that are set at infinity.
1.5 The Domain or Support of a Distribution
What is the domain of the Beta distribution? Let’s ask SciPy, by calling its support() method.

The standard Beta distribution extends just from 0 to 1 along the x-axis. In this, it is not unlike the standard normal distribution that is centered around a mean of 0 and a standard deviation of 1, which implies that beyond 3 deviations to the left and right of zero, the standard normal distribution only carries minuscule amounts of its mass – though its tails theoretically do extend to infinity, with smaller and smaller probabilities that those values will ever be observed.

The Gamma(1,1) distribution stretches itself out along the x-axis all the way to positive infinity.
The standard Beta distribution is too limited for many practical applications. Therefore, let’s equip it with location and scale parameters, in addition to its two shape parameters. Location and scale are parametrizations that are available for most of the other SciPy distributions as well.
1.6 Rescaling and Shifting a Distribution
We define the shifted and rescaled beta distribution by entering beta(shape_a, shape_b, loc, scale) in row 5 and assign the result to the variable rv.

The location parameter loc moved the minimum of its support along the x-axis to a value of 100; the sum of location and scale moved its maximum to 100+220 = 320. This distribution can well reflect a real-world phenomenon that is subject to a random process or uncertainty between the boundaries of 100 and 320.
Next, we compute the first four moments by applying stats() and convert the resulting tuple to a list in row 3. Since the variance is hard to interpret, we also calculate the standard deviation, which we append to the list of moments in row 4. We write down the names of the five metrics, then zip-combine the list of values with the list of names in the dictionary moments, and finally hand the dictionary over to a list comprehension that will pretty-print its contents row by row, in row 8.

The random process is centered around a mean of 155, with a standard deviation of 31.8. Since we only shifted and scaled the curve, its shape remains unchanged – it has retained its skewness and excess kurtosis.
Let’s plot the cumulative distribution function cdf and its inverse, the percent point or quantile function ppf.


We feed selected points on the x-axis— among them the mean, median, 1% and 99% quantiles in row 2— to the cdf and pdf functions to obtain more precise results than a glance at the charts can offer. The dictionary comprehension in row 5 combines the x-values – used as the keys – with the cdf and pdf values and then prints them, cdf to the left, pdf to the right.

- At x = 150.3, we observe the median: 50% of the observations x are lower than 150.3.
- At x = 241.5, we observe the 99% quantile – 99% of the x-values do not exceed 241.5.
Lastly, we pass the cdf’s cumulative probabilities into the percent point function ppf. A dictionary comprehension in row 5 zips the x-values we had evaluated above with the list of quantiles which the ppf function returns. The x-values which the ppf computes should match the original x-values we had passed to the cdf. In the table below, we see the x-values in the left column and the matching quantiles to the right. Thus, we have confirmed their consistency.

2. Distribution Fitting
2.1 Principles
The first distribution that comes to mind for describing a random process is the normal distribution. Despite its dominance in text books, it does not qualify for large numbers of random processes:
- The normal distribution is symmetric about its mean and median. But often, the observational data make it quite clear that the random process that generated them is left-skewed or right-skewed, particularly if the domain is bounded, for instance if it has a hard minimum value at zero.
- The normal distribution is unbounded: Its domain extends to positive and negative infinity. Most real phenomena defy attempts to describe them in terms of unbounded growth or magnitude; or they cannot assume negative values, which invalidates half of the normal distribution’s probability mass.
- The normal distribution is not heavy-tailed. Many random processes, for instance in financial markets, exhibit so-called long tails: the probability that extreme outcomes will occur – critical for decisions in engineering or finance – is higher than normal distributions would predict on their own.
The candidate distributions we want to fit to our observational date should be chosen based on the following criteria:
- The nature of the random process if we can discern it. Of the 200+ distribution models in existence (yes, really – even SciPy does not yet enumerate all of them), many are tailored to describe specific types of random processes. Examples: To model the time till failure of equipment – if failure depends on how long the equipment has been in operation – a Weibull distribution should be among the candidates chosen for the fitting process; the fatigue-life or Birnbaum-Saunders distribution is an alternative for modeling equipment lifetime. Whereas a waiting line – a different kind of random process – invites a description by a Poisson variate.
- The domain or support of the distribution stands for the interval of x-values on which it is defined. Most natural phenomena and most technical processes, for instance, cannot assume negative values. The time between technical failures cannot become negative. The candidate distributions for these processes need to have a non-negative domain. For some of them, this domain is inherent in their mathematical definition. Others can undergo transformations: they can be truncated or shifted and rescaled to recalibrate them to a non-negative domain, such as the half-normal distribution. SciPy offers location and scale parameters for moving the domain of a distribution to an appropriate interval. Most instances of normal variates represent shifted and rescaled variants of the standard normal distribution: the mean has been shifted from 0 to an observed or estimated value; the standard deviation has been recalibrated from 1 to a rescaled interval around the shifted mean, so that a range within 3 deviations contains ~99.7% of all occurrences.
- The shape of the observational data. If the nature of the random process cannot be described a priori, then the shape of the histogram will provide clues as to whether, for instance, a symmetric, left-skewed, or right- skewed distribution will be a better fit.
- Multi-model validations: When we run data science models, we do not limit the investigation to a single model. Rather, we apply several alternative models or at least alternative hyperparameter tuples to the data. Similarly, we need to try out several alternative distributions in a fitting process. The large number of 60+ candidates we will pass to the fitter will help to ascertain that we won’t end up with a suboptimal model as if we cut our search short.
- Goodness-of-fit tests, such as the Kolmogorov-Smirnov test, provide us with more precise test results than the visual clues we get from charts that compare a histogram of observations with a fitted distribution curve.
2.2 Time to Equipment Failure: Fitting a Weibull Distribution
Before we build the auto-fitter, we want to understand how we can fit a chosen individual distribution to observational data. Let’s set up an example that fits a Weibull distribution to an equipment lifetime problem.
We will work with a fictional dataset: the time to failure of an electronic circuit board, to be used in the first space ship that will propel astronauts to Mars. Thus, there will be no easy way to obtain replacements parts at some truck stop bar & grill beyond earth orbit.
The Weibull model can be mathematically derived as the time to occurrence of the weakest link among parallel failure processes. It is used in a wide range of fields such as reliability engineering, hydrology, biology, or wind power analysis.
SciPy’s "Weibull Minimum Extreme Value" distribution (Weibull Minimum Extreme Value distribution – SciPy v1.7.1 Manual, as opposed to SciPy’s "Weibull Maximum Extreme Value" variate) is equivalent to the Weibull distribution explained by NIST (U.S. National Institute of Standards and Technology, 1.3.6.6.8. Weibull Distribution (nist.gov)) and wikipedia (Weibull distribution – Wikipedia). You will find that the symbols used for the parameters vary widely between sources – it is safer to identify them as shape, scale, and location.
SciPy expresses its distributions objects in standardized form. The weibull_min distribution comes with just a shape parameter c (called k in wikipedia, gamma at NIST, and beta in other sources). Scale and location parameters are optional.
- if shape < 1: the failure rate decreases with time, after a kind of burn-in phase
- if shape =1: constant failure rate => the Weibull distribution matches an exponential distribution
- if shape > 1: the failure rate increases with time
The scale parameter – denoted lambda, alpha, or beta in various sources -is also called the characteristic life. With increasing scale, the Weibull distribution becomes more symmetric, resembling a normal distribution.
A location parameter, if included to form a 3-parameter Weibull variate, is also known as waiting time: no failure can occur before L hours, with L ≥ 0.
First – while we are waiting for the circuit board manufacturer to provide us with real data -, we create an array of synthetic observational data to prepare our model. We want to simulate the time to failure, in hours, for 1,000 components for which test data will soon arrive.
We choose a shape of 1.5 – thus, a failure rate that increases with time – and a characteristic life or scale of 50,000 hours.

The long right tail indicates that the actual data are considerably right-skewed, above 1.0 compared with 0 for a normal distribution. The failures begin at time zero, so there is no waiting time or location parameter involved – instead, quality problems surface and begin to knock out the first circuits immediately after they have passed the factory gates and are built into prototypes at the space center.
Next, we feed the actual data into the fit() function of the weibull_min distribution object in row 6.
Note the argument floc = 0. floc indicates that I have fixed the value of the location parameter at 0 by entering the parameter name with prefix ‘f’. The fit() function will search for shape and scale values so that the Weibull distribution mirrors the actual data; but it will leave the location unchanged at 0. The reason is that a flexible location parameter can massively influence the values found for shape and scale. The fitter looks for the best compromise between all three parameters if we don’t bar it from accessing the location parameter. If we had no prior knowledge about the random process, we might be undecided about the location. But if we know that failures can occur from time zero, even if the actual records start a bit later, we should set the location to zero and have the fitter focus on the shape and the scale without undue interference from an inaccurate location value. The fitter is not supposed to trade a higher location value off against higher or lower scale or shape values.


The pdf curve closely traces the contours of the histogram. Let’s confirm the visual impression with a hypothesis test for goodness of fit : the Kolmogorov-Smirnov KS test.

The very high p-value of the KS test confirms with near-certainty that the 3-parameter Weibull-Minimum(1.54, 0, 49201) variate mirrors the structure of the actual data.

The mean time to failure is 45,137 hours.
Let’s assume that a threshold point in time is located at T = 20,000 hours. After time T, the circuits are no longer deemed mission-critical, they will have completed their assigned tasks.
- Below, we compute the cumulative distribution function cdf, which tells us that after 20,000 hours, 22.4% of all circuits will have burned out.
- The survival function sf(T) can also be calculated as 1 – cdf(T). It reports that at 20,000 hours, 77.6% of the circuits will still be functional.
- At 10,000 hours, 91.4% of the circuits will have survived.
- We also observe that only the first 0.1% of circuits will fail during the initial 500 hours of testing.
- The percent point function ppf informs us that it will take 2,329 hours until 1.0% of components will have failed. Thus, extended testing immediately after production, for 500 or even 2,300 hours, will not significantly reduce the failure quantity. Most circuits will break later, i.e. not under terrestrial conditions.
-
We conclude that we will need a large inventory of spare parts to replace the 22.4% of circuits that will flame out during the first 20,000 hours; and need to train the astronauts to install them. Alternatively, we could try to improve the robustness of the circuit design or reduce the usage hours.



3. The Auto-Fitter
SciPy provides a method .fit() for every distribution object individually. To set up a multi-model evaluation process, we are going to write a script for an automatic fitter procedure. We will feed our list of 60 candidates into the maw of the fitter and have it find the one random variate that most closely aligns with the actual observations.
We start with a function for the KS goodness-of-fit test, which will be called 60 times to compare the distance of a given candidate distribution from the actual data.
Next, we write a short function fitdist(), which fits the selected candidate and then calls the KS test. As before, we freeze the location parameter to 0 for all candidates: floc=0. This is a default setting you can choose to modify if you want to throw a wider net for solutions.
Row 2 puts a list comprehension to work. The comprehension pulls each of the distribution objects from the list distributions, at the top of our script, and passes it to the fitter function.
The comprehension res returns the parameters of all fitted distributions, as well as the p-values of their respective KS tests, and tabulates them in a dataframe.

The dataframe contains 61 fitted distributions with their parameters.
We prepare a plot function that can chart all of the distributions which a function caller will pass to it.
The plot function
- determines how many rows with three subplots each will be needed to chart all distributions in the dataframe,
- pulls the row of a distribution with its parameters from the dataframe,
- omits the parameters that have NaN values,
- calibrates the endpoints of the x-axis to the 1% and 99% quantiles,
- draws the histogram of actual observations,
-
and plots the pdf of the selected distribution.
Before calling the plotting function, we make a selection in row 2: We only ask for charts of those candidate distributions which exhibit a KS test result with a p-value higher than the significance level ALPHA = 0.05. We ignore the candidates that provide a worse fit. In this example, just 6 of the 61 candidates successfully passed the KS test. The script creates a dataframe df_ks that is limited to these six KS-approved distributions before the code calls the plotter function _plot_fittedpdf().

- Not unexpectedly, the 2-parameter Weibull Minimum distribution leads the field with a p-value of 42.5%.
- A 3-parameter Beta distribution, with 2 shape values and 1 scale value, reaches 35.2%.
- The 2-parameter Nakagami distribution is a relative of the Gamma family and reaches a solid p-value of 26.9%.
- The non-central F, Mielke, and Burr distributions are more exotic candidates, with p-values that pass the significance threshold, but fall off in fitting quality compared with the Weibull and Beta distributions.


The script selects the candidate with the highest KS p-value, weibull_min, and assigns it to the variable _Dopt.

We can start to review and work with this best fit like we did with any other distribution instance in the examples.

- The stats() function reports the four main moments of the fitted distribution.
- We can obtain the mean time to failure, and ask which percentage of components will have failed at that point in time: 57.8%. The stats and percentage point functions inform us that the mean time to failure will be 42,879 hours. The median time to failure, when 50% of all circuits will go dark, will be at 36,949 hours.
- The support of the Weibull Minimum distribution is, of course, in the interval from time zero to positive infinity.

This concludes today’s tutorial.
- We have reviewed SciPy’s class of distributions and the methods and properties associated with them.
- We fitted a chosen distribution – Weibull-Minimum – to an array of synthetic data that represented time-to-failure observations and thus modeled a problem in reliability engineering.
- Finally, we wrote a fitter procedure that tests the actual against a list of 61 candidate distributions, three of which offered a very good fit in our example, with p-values that exceeded 20% when we ran the Kolmogorov-Smirnov test function on them.
The Jupyter notebook is available for download on GitHub: h3ik0th/DistFitter: SciPy distribution fitting (github.com)
- title image: by Myriams-Fotos, pixabay
- all other images: by author