
With the rise of chatGPT-like models, it has become accessible for a broader audience to analyze your own data set and, so to speak, "ask questions". Although this is great, such an approach has also disadvantages when using it as an analytical step in automated pipelines. This is especially the case when the outcome of models can have a significant impact. To maintain control and ensure results are accurate we can also use Bayesian inferences to talk to our data set. In this blog, we will go through the steps on how to learn a Bayesian model and apply do-calculus on the data science salary data set. I will demonstrate how to create a model that allows you to "ask questions" to your data set and maintain control. You will be surprised by the ease of creating such a model using the bnlearn library.
Introduction
Extracting valuable insights from data sets is an ongoing challenge for data scientists and analysts. ChatGPT-like models have made it easier to interactively analyze data sets but at the same time, it can become less transparent and even unknown why choices are made. Relying on such black-box approaches is far from ideal in automated analytical pipelines. Creating transparent models is especially important when the outcome of a model is impactful on the actions that are taken.
The ability to communicate effectively with data sets has always been an intriguing prospect for researchers and practitioners alike.
In the next sections, I will first introduce the bnlearn library [1] on how to learn causal networks. Then I will demonstrate how to learn causal networks using a mixed data set, and how to apply do-calculus to effectively query the data set. Let’s see how Bayesian inference can help us to interact with our data sets!
If you find this article helpful, you are welcome to follow me because I write more about Bayesian learning. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month.
The Bnlearn library
Bnlearn is a powerful Python package that provides a comprehensive set of functions for causal analysis using Bayesian Networks. It can handle both discrete, mixed, and continuous data sets, and offers a wide range of user-friendly functionalities for causal learning, including structure learning, parameter learning, and making inferences [1–3]. Before we can make inferences, we need to understand structure learning and parameter learning because it relies on both learnings.
Learning the causal structure of a data set is one of the great features of bnlearn. Structure learning eliminates the need for prior knowledge or assumptions about the underlying relationships between variables. There are three approaches in bnlearn to learn a causal model and capture the dependencies between variables. Structure learning will result in a so-called Directed Acyclic Graph or DAG). Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy. Read more details regarding structure learning in the underneath blog.
- Score-based structure learning: Using scoring functions BIC, BDeu, k2, bds, aic, in combination with search strategies such as exhaustivesearch, hillclimbsearch, chow-liu, Tree-augmented Naive Bayes (TAN), NaiveBayesian.
- Constraint-based structure learning (PC): Using statistics such chi-square test to test for edge strengths prior the modeling.
- Hybrid structure learning: (the combination of both techniques)
- Score-based, Constraint-based, and Hybrid structure learning. Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy. Read more details regarding structure learning in the underneath blog [2].
A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python.
Parameter learning is the second important part of Bayesian network analysis, and bnlearn excels in this area as well. By leveraging a set of data samples and a (pre-determined) DAG we can estimate the Conditional Probability Distributions or Tables (CPDs or CPTs). For more details about parameter learning, I recommend the following blog:
A step-by-step guide in designing knowledge-driven models using Bayesian theorem.
Bnlearn also provided a plethora of functions and helper utilities to assist users throughout the analysis process. These include data set transformation functions, topological ordering derivation, graph comparison tools, insightful interactive plotting capabilities, and more. The bnlearn library supports loading bif files, converting directed graphs to undirected ones, and performing statistical tests for assessing independence among variables. In case you want to see how bnlearn performs compared to other causal libraries, this blog is for you:
The Power of Bayesian Causal Inference: A Comparative Analysis of Libraries to Reveal Hidden…
In the next section, we will jump into making inferences using do-calculus with hands-on examples. This allows us to ask questions to our data set. As mentioned earlier, structure learning and parameter learning form the basis.
Interrogating data sets requires making inferences using do-calculus.
When we make inferences using do-calculus, it basically means that we can query the data set and "ask questions" to the data. To do this, we need two main ingredients: the DAG and the CPTs that are assigned to each node in the graph. The CPTs contain the probabilities of each variable and capture the causal relationships given to its parents. Let’s move on and create an example where we can see how it really works.
Application with the Data Science Salary Dataset
For demonstration, we will use the data science salary data set that is derived from ai-jobs.net [5]. The salary data set is collected worldwide and contains 11 features for 4134 samples. If we load the data, we can explore the columns and set features as continuous or category. Note that the model complexity increases with the number of categories which means that more data and computation time is required to determine a causal DAG.
# Install datazets.
!pip install datazets
# Import library
import datazets as dz
# Get the Data Science salary data set
df = dz.get('ds_salaries')
# The features are as following
df.columns
# 'work_year' > The year the salary was paid.
# 'experience_level' > The experience level in the job during the year.
# 'employment_type' > Type of employment: Part-time, full time, contract or freelance.
# 'job_title' > Name of the role.
# 'employee_residence' > Primary country of residence.
# 'remote_ratio' > Remote work: less than 20%, partially, more than 80%
# 'company_location' > Country of the employer's main office.
# 'company_size' > Average number of people that worked for the company during the year.
# 'salary' > Total gross salary amount paid.
# 'salary_currency' > Currency of the salary paid (ISO 4217 code).
# 'salary_in_usd' > Converted salary in USD.
Complexity is a major limitation
When features contain many categories, the complexity grows exponentially with the number of parent nodes associated with that table. In other words, when you increase the number of categories, it requires a lot more data to gain reliable results. Think about it like this: when you split the data into categories, the number of samples within a single category will become smaller after each split. The low number of samples per category directly affects the statistical power. In our example, we have a feature job_title
that contains 99 unique titles for which 14 job titles (such as data scientists) contain 25 samples or more. The remaining 85 job titles are either unique or seen only a few times. To make sure this feature is not removed by the model because lack of statistical power, we need to aggregate some of the job titles. In the code section below we will aggregate job titles into 7 main categories. This results in categories that have enough samples for Bayesian modeling.
# Group similar job titles
titles = [['data scientist', 'data science', 'research', 'applied', 'specialist', 'ai', 'machine learning'],
['engineer', 'etl'],
['analyst', 'bi', 'business', 'product', 'modeler', 'analytics'],
['manager', 'head', 'director'],
['architect', 'cloud', 'aws'],
['lead/principal', 'lead', 'principal'],
]
# Aggregate job titles
job_title = df['job_title'].str.lower().copy()
df['job_title'] = 'Other'
# Store the new names
for t in titles:
for name in t:
df['job_title'][list(map(lambda x: name in x, job_title))]=t[0]
print(df['job_title'].value_counts())
# engineer 1654
# data scientist 1238
# analyst 902
# manager 158
# architect 118
# lead/principal 55
# Other 9
# Name: job_title, dtype: int64
The next pre-processing step is to rename some of the feature names. In addition, we will also add a new feature that describes whether the company was located in USA or Europe, and remove some redundant variables, such as salary_currency
and salary
.
# Rename catagorical variables for better understanding
df['experience_level'] = df['experience_level'].replace({'EN': 'Entry-level', 'MI': 'Junior Mid-level', 'SE': 'Intermediate Senior-level', 'EX': 'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT': 'Part-time', 'FT': 'Full-time', 'CT': 'Contract', 'FL': 'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S': 'Small (less than 50)', 'M': 'Medium (50 to 250)', 'L': 'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0: 'No remote', 50: 'Partially remote', 100: '>80% remote'}, regex=True)
import numpy as np
# Add new feature
df['country'] = 'USA'
countries_europe = ['SM', 'DE', 'GB', 'ES', 'FR', 'RU', 'IT', 'NL', 'CH', 'CF', 'FI', 'UA', 'IE', 'GR', 'MK', 'RO', 'AL', 'LT', 'BA', 'LV', 'EE', 'AM', 'HR', 'SI', 'PT', 'HU', 'AT', 'SK', 'CZ', 'DK', 'BE', 'MD', 'MT']
df['country'][np.isin(df['company_location'], countries_europe)]='europe'
# Remove redundant variables
salary_in_usd = df['salary_in_usd']
#df.drop(labels=['salary_currency', 'salary'], inplace=True, axis=1)
As a final step, we need to discretize salary_in_usd
which can be done manually or using the discretizer
function in bnlearn. For demonstration purposes, let’s do both. For the latter case, we assume that salary depends on experience_level
and on the country
. Read more in this blog [6] for more details. Based on these input variables, the salary is then partitioned into bins (see code section below).
# Discretize the salary feature.
discretize_method='manual'
import bnlearn as bn
# Discretize Manually
if discretize_method=='manual':
# Set salary
df['salary_in_usd'] = None
df['salary_in_usd'].loc[salary_in_usd<80000]='<80K'
df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=80000, salary_in_usd<100000)]='80-100K'
df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=100000, salary_in_usd<160000)]='100-160K'
df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=160000, salary_in_usd<250000)]='160-250K'
df['salary_in_usd'].loc[salary_in_usd>=250000]='>250K'
else:
# Discretize automatically but with prior knowledge.
tmpdf = df[['experience_level', 'salary_in_usd', 'country']]
# Create edges
edges = [('experience_level', 'salary_in_usd'), ('country', 'salary_in_usd')]
# Create DAG based on edges
DAG = bn.make_DAG(edges)
bn.plot(DAG)
# Discretize the continous columns
df_disc = bn.discretize(tmpdf, edges, ["salary_in_usd"], max_iterations=1)
# Store
df['salary_in_usd'] = df_disc['salary_in_usd']
# Print
print(df['salary_in_usd'].value_counts())
The Final DataFrame
The final data frame has 10 features with 4134 samples. Each feature is a categorical feature with two or multiple states. This data frame is going to be the input to learn the structure and determine the causal DAG.
# work_year experience_level ... country salary_in_usd
# 0 2023 Junior Mid-level ... USA >250K
# 1 2023 Intermediate Senior-level ... USA 160-250K
# 2 2023 Intermediate Senior-level ... USA 100-160K
# 3 2023 Intermediate Senior-level ... USA 160-250K
# 4 2023 Intermediate Senior-level ... USA 100-160K
# ... ... ... ... ...
# 4129 2020 Intermediate Senior-level ... USA >250K
# 4130 2021 Junior Mid-level ... USA 100-160K
# 4131 2020 Entry-level ... USA 100-160K
# 4132 2020 Entry-level ... USA 100-160K
# 4133 2021 Intermediate Senior-level ... USA 60-100K
#
# [4134 rows x 10 columns]
Bayesian Structure Learning to estimate the DAG.
At this point, we have pre-processed the data set and we are ready to learn the causal structure. There are six algorithms implemented in bnlearn that can help with this task. We need to choose a method for which we do not need to have a target variable, and it needs to handle many categories. The available search strategies are:
- The hillclimbsearch algorithm is a heuristic search method. It starts with an empty network and iteratively adds or removes edges based on a scoring metric. The algorithm explores different network structures and selects the one with the highest score.
- The exhaustivesearch performs an exhaustive search over all possible network structures to find the optimal Bayesian network. It evaluates and scores each structure based on a specified scoring metric. While this method guarantees finding the best network structure, it can be computationally expensive for large networks due to the exponential growth of possibilities.
- The constraintsearch incorporates user-specified constraints or expert knowledge into the structure learning process of a Bayesian network. It uses these constraints to guide the search and restrict the space of possible network structures, ensuring that the learned network adheres to the specified constraints.
- The chow-liu algorithm is a method for learning the structure of a tree-structured Bayesian network. It calculates the mutual information between each pair of variables and constructs a tree by greedily selecting the edges that maximize the total mutual information of the network. This algorithm is efficient and widely used for learning the structure of discrete Bayesian networks but requires setting a root node.
- The naivebayes algorithm assumes that all features in a dataset are conditionally independent given the class variable. It learns the conditional probability distribution of each feature given the class and uses Bayes theorem to calculate the posterior probability of the class given the features. Despite its naive assumption, this algorithm is often used in classification tasks and can be efficient for large datasets.
- The TAN (Tree-Augmented Naive Bayes) algorithm is an extension of the naive Bayes algorithm that allows for dependencies among the features, given the class variable. It learns a tree structure that connects the features and uses this structure to model the conditional dependencies. TAN combines the simplicity of naive Bayes with some modeling power, making it a popular choice for classification tasks with correlated features. This method requires setting a class node.
The scoring types BIC, K2, BDS, AIC, and BDEU are used to evaluate and compare different network structures. As an example, BIC balances the model complexity and data fit, while the others consider different types of prior probabilities. In addition, the independence test
prunes the spurious edges from the model. In our use case, I will use the hillclimbsearch
method with scoring type BIC
for structure learning. We will not define a target value but let the bnlearn decide the entire causal structure of the data itself.
# Structure learning
model = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')
# independence test
model = bn.independence_test(model, df, prune=False)
# Parameter learning to learn the CPTs. This step is required to make inferences.
model = bn.parameter_learning.fit(model, df, methodtype="bayes")
# Plot
bn.plot(model, title='Salary data set')
bn.plot(model, interactive=True, title='method=tan and score=bic')


Chat With Your Data Set.
With the learned DAG (Figures 1 and 2), we can estimate the conditional probability distributions (CPTs, see code section below), and make inferences using do-calculus. Let’s start asking questions. Note that the results can (slightly) change based on the stochastic components in the model.
Question 1.
What is the probability of a job title given that you work in a large comany? _
P(job_title | company_size=Large (>250))
_
After running the code section below we can see that engineer scientist is the most likely outcome (P=0.34)
followed by data scientist (P=0.26)
.
query = bn.inference.fit(model, variables=['job_title'],
evidence={'company_size': 'Large (>250)'})
# +----+----------------+-----------+
# | | job_title | p |
# +====+================+===========+
# | 0 | Other | 0.031616 |
# +----+----------------+-----------+
# | 1 | analyst | 0.209212 |
# +----+----------------+-----------+
# | 2 | architect | 0.0510425 |
# +----+----------------+-----------+
# | 3 | data scientist | 0.265006 |
# +----+----------------+-----------+
# | 4 | engineer | 0.343216 |
# +----+----------------+-----------+
# | 5 | lead/principal | 0.0407967 |
# +----+----------------+-----------+
# | 6 | manager | 0.0591106 |
# +----+----------------+-----------+
Question 2.
What is the probability of a salary range given a full time employment type, partially remote work, have the data science function at entry level and live in germany (DE)?
In the results below we can see our five salary categories for which the strongest posterior probability P=0.7
is a salary of <80K under these conditions. Note that other salaries also occur but they happen less frequently.
By changing the variables and evidence we can ask all kinds of questions. For example, we can now change experience level, residence, job title, etc, and determine how the probabilities are changing.
query = bn.inference.fit(model,
variables=['salary_in_usd'],
evidence={'employment_type': 'Full-time',
'remote_ratio': 'Partially remote',
'job_title': 'data scientist',
'employee_residence': 'DE',
'experience_level': 'Entry-level'})
# +----+-----------------+-----------+
# | | salary_in_usd | p |
# +====+=================+===========+
# | 0 | 100-160K | 0.0664068 |
# +----+-----------------+-----------+
# | 1 | 160-250K | 0.0424349 |
# +----+-----------------+-----------+
# | 2 | 80-100K | 0.117463 |
# +----+-----------------+-----------+
# | 3 | <80K | 0.707087 |
# +----+-----------------+-----------+
# | 4 | >250K | 0.0666078 |
# +----+-----------------+-----------+
Final words.
In this blog, we learned how to create a Bayesian model and how to ask questions to a mixed data set using inferences with do-calculus. With the use of bnlearn it becomes straightforward to setup such models and the models offer understandable and explainable results that can be easily embedded in data science pipelines.
Be Safe. Stay Frosty.
Cheers E.
If you find this article helpful, you are welcome to follow me because I write more about Bayesian learning. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month.
Software
Let’s connect!
References
- Taskesen, E. (2020). Learning Bayesian Networks with the bnlearn Python Package. (Version 0.3.22) [Computer software].
- Taskesen E, A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python, Medium, 2021
- Taskesen E, A step-by-step guide in designing knowledge-driven models using Bayesian theorem, Medium, 2021
- Taskesen, E. (2020). The Power of Bayesian Causal Inference: A Comparative Analysis of Libraries to Reveal Hidden Causality in Your Dataset, Medium 2023.
- https://ai-jobs.net/salaries/download/ (CC0: Public Domain)
- Kay H. et al, Inferring causal impact using Bayesian structural time-series models, 2015, Annals of Applied Statistics (247–274, vol9)
- Taskesen, E (2023), Create and Explore the Landscape of Roles and Salaries in Data Science. Medium.