Insurance Risk Pricing — Tweedie Approach

An illustrative guide to estimate the pure premium using Tweedie models in GLMs and Machine Learning

Ajay Tiwari
Towards Data Science
9 min readApr 20, 2020

Background

Insurance is a unique industry, probably one of the few where a company doesn’t know the actual cost of the product sold, they deal with the risks of unforeseeable events. Therefore, this industry has been relying on mathematics to understand the risk behavior of its customers and prospects for the last few centuries to estimate expected loss costs. Of course, methods have been evolving from Pascal’s triangle, application of probability, minimum bias procedures to generalized linear models and now machine learning.

Objective

We have discussed the count models (used in claim frequency) and the gamma model (used in claim severity) in previous articles in detail. The pure premium, also known as ‘loss cost’ is just the product of both the model estimates.

Claim Frequency = Claim Count / Exposure

Claim Severity = Claim Cost / Claim Count

Loss Cost = Claim Frequency x Claim Severity

In the current article, we are going to discuss a very important and interesting distribution called Tweedie, this distribution is interesting because it delivers many other distributions from Gaussian to inverse Gaussian by just changing a single parameter.

This distribution will help us in modeling pure premium directly without any need for two different models.

Tweedie Distribution

Tweedie distribution is a special case of exponential dispersion models and is often used as a distribution for generalized linear models. It can have a cluster of data items at zero and this particular property makes it useful for modeling claims in the insurance industry. This model can also be applied in other use cases across industries where you find a mixture of zeros and non-negative continuous data points.

If you see a histogram as below with a spike at zero, it’s a possible candidate to be fitted to a Tweedie model.

This family of distributions has the following characteristics

A mean, E(Y)= µ

A variance, Var(Y)= ϕµᵖ

Let’s understand with this expression “Y ~ Twₚ(µ,ϕ)”, where Y denote the response variable, Twₚ(µ,ϕ) denotes a Tweedie random variable with mean µ and variance ϕµᵖ and ϕ>0 and p ∈ (-∞,0 ] ∪[1,∞).

The p in the variance function is an additional shape parameter for the distribution.

An important point to be noted, this distribution is not defined for values of p between 0 and 1.

  • If 1< p <2, the distribution are continuous for Y>0, with a positive mass at Y=0
  • If p > 2, the distributions are continuous for Y > 0

Let’s look at Some of the commonly used members of the Tweedie family of distributions with their index parameter (p), variance function (V (µ)) and dispersion (ф)

Exponential Dispersion Models

One distribution enables fitting many other distributions. Setting p = 0 gives a normal distribution, p = 1 is Poisson, p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian.

Parameters conversion from one form to another.

We know Tweedie as a compound Poisson-Gamma distribution where a count N~Poisson(λ) and continuous number Z ~ Gamma(α,θ), in such case parameters can be easily translated into Tweedie parameters as given below:

Poisson-Gamma to Tweedie

Variance can also be translated — Var[Y] = ϕµᵖ =λ⋅θ² ⋅α (α+1)

Similarly, Tweedie parameters can also be converted into Poisson and Gamma parameters as follows:

Tweedie to Posson-Gamma

Now, we got some understanding of this interesting distribution and its relation with its close family members like Poisson and Gamma.

Next, we will use a dataset and check this distribution’s applicability.

Recall our insurance dataset which we used in previous articles. We are hoping Tweedie distribution is the ideal candidate for this dataset, let’s explore this data to confirm our assumption.

This dataset (dataCar)can be downloaded from an R package called “insuranceData”.

A brief overview of the dataset

This dataset is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of which 4624 (6.8% notified claims) filed claims.

Data Snapshot

Only 7% of observation has positive values for the response variable, the rest of the values are zero. Let’s look at the distribution of claim count and cost independently.

As we have seen above

  • Loss Cost = Claim Frequency x Claim Severity

Let’s re-write in terms of original variables

  • Loss Cost = Claim Cost / Exposure

Response Variable — We are going to use ‘Claim Cost’ as a response variable keeping ‘exposure’ as an offset which is a suggested approach for modeling rates and averages.

Let’s have a look at the distribution of this variable.

We can see this data can not be fitted to the normal distribution, Poisson can also be discarded because this is not a count data. Another option we can think of Gamma, but this distribution does not take zero values. So, finally, we are left with Tweedie distribution which might be the best fit for this data.

Now, look at the independent variables. For detailed exploratory analysis and transformation of these variables, you can refer to the source code shared along with this article.

Independent Variables — We are going to use Vehicle body, vehicle age, driver’s age category, gender, area, and vehicle value as predictor variables. Claim occurrence indicator and the number of claims cannot be used as they are related to the dependent variable and can be known only after the event take place.

Derived Variable and Transformation — Vehicle body type by low frequencies can be grouped, also created an alternate categorical variable for vehicle value which can be tested for significance.

Selecting an optimal “p” for Tweedie model

Based on this exploratory analysis, we can confirm Tweedie distribution is an ideal candidate for fitting the model, but we are yet to find an optimum value for “p” which we know varies between 1 to 2.

Approach 1 — Manual selection of the variance power.

  • Test a sequence of “p” between 1 to 2 by running iterative models.
  • The Log-likelihood shows an inverse “U” shape.
  • Select the “p” that corresponds to the “maximum” log-likelihood value.
  • Fit the final model using that “p” value.

Approach 2 — Automated selection of variance power using statistical packages

Statistical packages (macros) are available in various software such as R, SAS, Python, and Emblem (Industry-specific software) to calculate maximum likelihood estimate for a series of “p” value. These are a kind of improvement to approach 1 with many other added features such as visualization. We can see the sample output below in the form of a graph generated from R using the “Tweedie” package.

This graph shows that optimum “p” values lie somewhere between 1.5 and 1.6. We can use the suggested value for fitting the Tweedie model.

In Python, statsmodels has a function called estimate_tweedie_power for obtaining an optimum value.

Now, let’s train the model

We are using R and Python for fitting the Generalized Linear Models. R is quite mature in statistical models, python is also catching up with its statsmodels package. I feel a lack of proper examples and documentation for statsmodels is the only disadvantage.

In machine learning models, the Tweedie loss function is available in many algorithms such as GBM, LightGBM, XGBOOST, etc. Here, I will walk you through XGBOOST implementation.

Generalized Linear Models in R and Python

Implementation in R

Implementation in Python, Statsmodels

Model validation and goodness of fit in GLMs

Sample Output

We can analyze the results like any other regression, by looking at the results summary.

  • If the p-value is less than or equal to the significance level, you can conclude that the variable is significant
  • Other metrics such as AIC, Deviance, and Loglikelihood are useful for comparing related models. Lower the AIC and Deviance, better the model, whereas a higher value of the likelihood is better.

Implementation in Python, XGBOOST

Model validation in XGBOOST

Evaluation metrics like root mean squared error (rmse), mean absolute error (mae) and k-fold cross-validation techniques are some of the methods of validating and comparing the models.

Please feel free to access complete code on my gist path.

Summary

We have discussed Tweedie distribution, one of the most commonly used distributions in the Insurance industry and its relation with other distributions of exponential dispersion models such as Poisson and Gamma. We also learned to estimate the Tweedie variance power which is the most important parameter required to fit an accurate model.

Though we discussed Tweedie implementation examples in some of the popular open-source packages, there are few more software that has this implementation which can be explored such as H2O, SAS, and Emblem.

Thanks for reading, hope you found the article informative. Please feel free to reach me for any queries or suggestions.

References

[1]Tweedie, M. C. K., “An Index which Distinguishes between Some Important Exponential Families,” in Statistics: Applications and New Directions, Proceedings of the Indian Statistical Golden Jubilee International Conference, J. K. Ghosh and J. Roy (Eds.), Indian Statistical Institute, 1984, 579–604

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Responses (2)