Notes from Industry

Background information on the writer
I’m a qualified actuary with 10 years of experience practising in the life insurance industry in Australia. For those of you who may not be familiar with what an actuary does day-to-day, at a high level, an actuary’s role spans: setting premium rates for mortality and morbidity products, calculating claims obligations and associated capital requirments for the life insurer, calibrating economic value for the life insurer’s shareholders and understanding deviation in the actual experience (e.g. for claims and discontinuance) against the expected experience.
About 3 years ago I embarked on a journey to self-learn Data Science/machine learning techniques after coming across a documentary on AlphaGo. I found it particularly fascinating how machines can be trained to learn abstract concepts, in this instance, human intuitions for making a move in the game of Go.
With a reasonable background in Statistics by education, I was quick to pick up a number of data science techniques and apply them in my line of work. I’ll expain how these techniques help improve actuarial analyses later in this article with two (2) use cases.
I consider data science techniques good supplements to the traditional actuarial techniques which are mostly Statistics-based, and have been fortunate enough not having to jump onto the data science bandwagon from a completely unrelated field.
Current state of data science in the life insurance industry
Data tools
In my day-to-day work (and I can speak for most life insurance actuaries in Australia), Excel has been the primary tool for analysing data, and VBA macro for automating some of the Excel-based ETL works.
From a programming perspective, most life insurance actuaries would have a bit of exposure to R while at university.
Excel comes with its pros and cons, which I’ll discuss in a bit more detail later on.
Penetration of data science in life insurance companies
Data science is at the infancy stage for most life insurance companies in Australia, according to a consultancy I spoke to. ‘Machine Learning‘ and ‘Data Science’ are buzz words executives have heard of, but may find them hard to relate to. As an example, on one occasion, I have heard executives commenting on machine learning as if it was the same as the fairly distant artificial general intelligence.
To me, the cause of the low penetration of data science techniques in the life insurance industry is two-fold:
- Data collected by life insurance companies have historically been scarce compared to the general insurance companies (e.g. motor insurers). This is due to the fact that most of the population own at least one car, whilst not necessarily having a life insurance policy. In addition, it’s much more likely to claim on one’s motor insurance policy, compared to a life insurance policy. This has implicitly made machine learning models which rely on high volume of training data (e.g. neural networks) less appealing.
- Life insurance executives have not (yet) been presented with practical use cases of how data science techniques can add value to an organisation. Given it’s a relatively new development, few people in the organisation are equipped to build and at the same time promote a business case to secure buy-ins and investments. As an example, few people in the Product team would be aware that we can apply a bit of web-scraping to gather updates of competitors’ product features, as opposed to manually collecting this information by going on each competitor’s website.
This article aims to demonstrate how data science techniques can add value to a life insurance company, particularly the actuarial workflow, with two (2) use cases based on my own experience. I do have more use cases which I’m hoping to share in future articles.
How data science techniques can help an actuary— use cases
Use case 1 – Natural Language Processing on free-text claims causes
Most life insurers record claims causes for claims lodged by policyholders. Some examples of claims causes recorded are:
- Dislocated, Fractured (L) Hip
- Metastatic Melanoma
- Cardiac Arrest
Claims causes at times are recorded in the form of free-texts. These are unstructured data which present challenges for actuaries who would like to analyse claims at the claims cause level.
The three examples of claims causes above have the same meaning as Broken Hip, Skin Cancer, and Heart Attack respectively. However, by simply getting the computer to read the texts without a human understanding, these claims would all be treated as unique claims.
For example, the computer may recognise 15 unique levels in the ‘Claims cause’ column below, whereas there should really be just one unique level (Neoplasm, or Cancer) actuaries are concerned with.

Before using the NLP technique I’m about to describe below for this task, I have been filtering in Excel, for example, claims containing similar string of words like "Cancer" or "Tumor", and manually classifying them to the Neoplasm category. You can imagine how manual and laborious this is, at times requiring me to look up medical terms such as "Cholangiocarcinoma", "Seminoma" and others claim by claim.
The task in this use case is to automatically classify claims causes into the pre-defined categories using some Natural Language Processing ("NLP"). Some categories at which to aggregate the claims causes may be Accident/Injury, Neoplasm, Musculoskeletal, Circulatory Diseases, Respiratory Diseases, Mental Disorder, just to name a few. This can be achieved in the following five (5) steps.
Step 1
After defining the categories at which to aggregate the claims causes, manually label the free-text claims causes into these categories. This is to provide some training data to the NLP model, but I promise this would be the last laborious step in this task!
In practice, the labelling was done for 366 claims, out of a dataset of approximately 6,000 claims. Personally this did not add much to the effort as I have been doing this already previously.
Step 2
For the (366) claims now with a label, perform text cleaning on claims causes such as removing special characters, removing double white spaces and lower-casing. It’s easy to expand this step to include other cleaning steps of choice such as rectifying spelling errors.
Some sample Python codes for text cleaning are provided below:
import re
#remove special character
text_remove_special_character = []
for texts in raw_claims_cause:
text_remove_special_character.append((re.sub(r'[^ws]', '', str(texts).strip())))
#remove double white space
text_remove_white_space = []
for texts in raw_claims_cause:
text_remove_white_space.append(" ".join(texts.split()))
#lower-casing
text_lowercase = []
for texts in raw_claims_cause:
text_lowercase.append(texts.lower())
Step 3
Create word embedding for the cleaned texts for claims causes. I have chosen a form of BERT vecotrisation per the sample Python codes below.
from sentence_transformers import SentenceTransformer
bert_input_claims_with_label = text_claims_register['text_claims_causes_cleaned'].tolist()
model_claims_with_label =
SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings_claims_with_label = model_claims_with_label.encode(bert_input_claims_with_label)
print('Shape of vector outputs:', embeddings_claims_with_label.shape)
The Bidirectional Encoder Representations from Transformers ("BERT") model is a NLP model developed and released by Google in 2018. It’s a pre-trained model using text passages on Wikipedia and BookCorpus (just to ensure the training data are grammatically sound).
In essense, each free-text claims cause has been ‘translated’ by BERT into a language the computer can understand, represented by a vector of length 768.
The output of this step is effectively vectors for each free-text claims cause with a label, again of length 768.
The shape of the overall word embedding is as expected:
Shape of vector outputs: (366, 768)
Step 4
Repeat Steps 1–3 for claims causes without a label. The output of this step is effectively vectors for each free-text claims cause without a label (i.e. the claims causes we want to label), again of length 768. If you have n claims causes without a label, you should expect the shape of the word embedding to be of dimension (n, 768).
Step 5
Vector output for claims causes without a label is compared against the vector output for claims causes with a label. The claims cause label of the closest match is returned based on cosine similarity.
In simple terms, as the training data consists of 366 claims, for each claims cause we want to automatically label (i.e. for each row in the similarity_df dataframe in the sample code below), there exists 366 columns which numerically score how similar the particular claims cause is with each of the claims in the training data, and the position of the claim which gives the highest score is returned with the pandas idxmax method.
from sklearn.metrics.pairwise import cosine_similarity
embedding_array_claims_with_label = np.array(embeddings_claims_with_label)
embedding_array_claims_without_label = np.array(embeddings_claims_without_label)
similarity =
cosine_similarity(embedding_array_claims_without_label, embedding_array_claims_with_label)
similarity_df = pd.DataFrame(similarity)
#return position of closest match, with which the label can be looked up
similarity_index = similarity_df.idxmax(axis = 1)
Below is a snippet of the end outputs for a number of claims, which definitely passed the ‘eye-test’.

Implications for actuaries
There are a number of implications for actuarial analyses if claims causes are able to be mapped and aggregated at this level. Some examples include:
- Investigating trend by claims cause (e.g. trend of subjective claims such as Mental Disorder or related claims, during the COVID-19 pandemic)
- Comparing company claims experience by claims causes against the industry experience (which are published by Government auhtorities in Australia)
- Understanding top n claims causes by rating factors such as Occupation, Gender and etc.
- Triaging claims to be assessed to suitable claims assessors by specialty (e.g. certain claims assessors may be more specialised in Mental Disorder claims than others)
Separately, using similar methodology, this use case can be extrapolated to Occupation classes. For example, mapping occupation descriptions such as "Plumber, 3 years of experience" to the Blue Collar category and "Lawyer, degree qualified" to the Professional/White Collar category, which can then feed actuarial analyses further.
Summary – use case 1
To summarise, this use case presents a way for actuaries to automatically classify free-text claims causes data into pre-defined categories for further analyses. Ultimately, with the help of BERT, computers are able to understand human language. For this instance, computers are able to understand and compare medical terms or description of a claims event, which can be messy at times.
The alternative which is manual filtering in Excel is not practical, especially for large number of claims.
Use case 2 – Complex transformation of data at large scale
As mentioned previously, Excel has been the primary ETL tool for most life insurance actuaries. Whilst Excel offers users transparency in transforming data, it has a number of crippling disadvantages in the context of actuarial applications:
- Excel has a row and column limit. More often that not, even before this limit is reached, navigating through the spreadsheets would be slow and difficult given the requirements on computing resources.
- Excel is not a scalable tool as it’s often difficult to efficiently connect data in one workbook with another, not to mention to perform calculations, especially when datasets are large in size.
The task this use case is based upon is not (practically) achievable with Excel, but easily achievable with a bit of dataframe manipulation with PySpark.
To describe the task using a simple example, imagine a life insurance policyholder has purchased a policy with two (2) covers, a cover for the event of Death ("Death cover") and a cover for the event of a pre-defined medical condition ("Trauma cover").
The two covers are associated with different payout amounts, usually more on Death cover than the Trauma cover. In addition, in most cases in Australia, paying out the Trauma cover would reduce the amount originally purchased with the Death cover. For example, let’s assume the policyholder has purchased a Death cover with the payout amount of $100k and a Trauma cover with a payout amount of $50k. If the policyholder claimed on the Trauma cover and gets paid out $50k, the Death cover reduces to $50k (i.e. $100k – $50k).
The readers don’t need to understand the technicalities but this product structure, which is common in Australia, warrants different mortality assumptions for the Death cover. In simple terms, as illustrated in Chart 2.1 below, the blue and green components of the Death cover need to be modelled with different actuarial assumptions. In essence, this requires a ‘split’ of the Death cover into two components for which we’ll model claims costs differently.

Effectively, this requires identifying policyholders having a Death cover and a ‘linked’ Trauma cover, and in this ‘frame’ of the data for the particular policyholder, adding an additional row for the Death cover, and calculating the payout amounts for the two rows of the Death covers based on the payout amount for the Trauma cover. You should quickly see that this would be extremely difficult to achieve with Excel, in part because the transformation is not one-to-one. In tabular format, the transformation required is illustrated in Chart 2.2 below:

This is when PySpark and its built-in SQL functions become very powerful, and with it, the transformaton required can be achieved in the following four (4) steps:
Step 1
Create a new column "Linked Amount", which returns the payout amount of the Trauma cover at a Policyholder ID level as illustrated in Chart 2.3 below.

This can be done by creating a window or ‘frame’ over the Policyholder ID field (which I’ve used interchangeably with Policyholder_ID in the sample codes below):
from pyspark.sql import functions as F
from pyspark.sql import Window as W
def compute_Linked_Amount(Policyholder_ID: List[str]) -> Column:
window = W.partitionBy(*Policyholder_ID)
.rangeBetween(W.unboundedPreceding, W.unboundedFollowing)
.orderBy("Policyholder_ID")
return F.when(
F.col("Cover") == "Death", F.lit(0.0)
).otherwise(F.col("Amount")).over(window)
Step 2
Create another new column "Split Type", specified with the array data type for rows of Death covers only, set the value for these rows to [Death Comp 1, Death Comp 2] (i.e. a 2 by 1 array with the same strings), as illustrated below:

Step 3
‘Explode’ each of these rows into 2 identical rows except for the value of the new column being respectively Death Comp 1 and Death Comp 2. This is illustrated below. The ‘explode’ operation can be achieved using the explode_outer SQL functions within PySpark.

Step 4
Create the final payout amount column, formulated with the following WHEN OTHERWISE statement.
from pyspark.sql import functions as F
F.when(col("Split Type") == "Death Comp 1",
F.col("Amount") - F.col("Linked Amount")
.otherwise(F.col("Linked Amount")

You should be able to see that after removing rows with a 0 value in the "Final Amount" column, the transformation is complete per Chart 2.2.
Summary – use case 2
To summarise, this use case represents a transformation which requires rows corresponding to the Death covers in a dataframe to be split into separate rows if the same policyholder also has a Trauma cover. This is not easily achievable with operations in Excel.
It has been demonstrated in the use case that such transformation is easily achievable with PySpark, which offers users the following:
- Ability to create a window over a subset of the data (e.g. for each policyholder using Policyholder ID field) and extract information at this level. For this instance, this identified whether the policyholder has both the Death and Trauma covers, and if the policyholder does, populated the payout amount for the Trauma cover in the row of the Death cover per Step 1 of the use case.
- Ability to insert a column in a dataframe of the array data type, which makes operations like splitting possible with built-in SQL functions such as ‘explode_outer’.
- Efficiency at which the transformation is done compared to Excel, as PySpark is known for its speed for operations in large datasets/dataframes. This is critical as we extrapolate the use case to cover types apart from Death and Trauma, as well as to all the policyholders for commercial use within a life insurance company.
Conclusion
Data science, machine learning may still seem like distant words to a life insurance actuary in 2021, but they are closer to home than one might think. Personally (and commercially), it takes a bit of a effort to break the barrier to entry which will eventually be worthwhile.
At last, I hope you find this article a good read and please share it if you do.
As I ride the AI/ML wave, I enjoy writing and sharing step-by-step guides and how-to tutorials in a comprehensive language with ready-to-run codes. If you would like to access all my articles (and articles from other practitioners/writers on Medium), you can sign up using the link here!