
This article might interest several audiences:
- Data scientists curious about Dataiku;
- Students of applied data science & machine learning;
- People learning Python.
Jupyter Notebooks has been the de facto tool for data science prototyping for years, however requires a certain level of Python coding and data science experience.
Dataiku‘s Data Science Studio (DSS) platform offers simple-to-use visual recipes for data preparation alongside suite of AutoML capabilities. Non-coders can import and clean data, and train and launch ML models into production – all in a GUI environment without writing a single line of code.
This can be both good and bad, but I’ll leave that for another day.
Update: I now post analytics tutorials on YouTube.
Platforms like Dataiku DSS have dramatically democratised data analytics and machine learning at the enterprise level. This is not dissimilar to how Microsoft Windows opened up personal computing to the masses after years of the shell-based MS-DOS.
Dataiku – now a unicorn startup worth over $1 billion – is backed by a conglomerate of investors, including CapitalG – Google’s venture capitalist arm.
I use Jupyter on a regular basis for personal and professional data science work and Dataiku in my job at a large financial services company.
To keep this taste-test comparison concise and focused, we’ll run through some standard steps in the data science workflow on a classification problem.
- Problem formulation
- Importing data
- Exploratory Data Analysis (EDA)
- Data preparation
- Feature engineering
- Modelling
- Deployment
We’ll use the well-known Titanic dataset available on Kaggle. A free version of Dataiku can be downloaded [here](https://www.anaconda.com/products/individual). The latest version of Jupyter Labs & Notebooks is here.
New to AI or ML? Check out my explainer articles [here](https://medium.com/swlh/differential-equations-versus-machine-learning-78c3c0615055) and here.
1. Formulating a Problem
Your data science analysis should be driven by a well-stated aim or problem.
In industry, the majority of a data scientist’s time is spent on clarifying a Business problem, procuring the right data and preparing it for modelling.
Here’s an example of a business problem relevant to the Titanic dataset:
"As a PR executive for White Star Line, I need to prevent deaths at all costs."
White Star Line was RMS Titanic’s owner. A problem statement like this implies we wish to train a high precision model with the target variable passenger survival.
High precision models minimise false positive predictions (Type 1 errors), while high recall models minimise false negatives (Type 2 errors).
For the Titanic, predicting a passenger will survive and being wrong is unacceptable, both ethically and professionally – hence the need to maximise precision over recall.
I’ve written a dedicated article on popular ML metrics here.
Here’s another example of a business problem:
"I want to identify the factors most responsible for survival on the Titanic."
In this situation, we’re more interested in the list of feature importances after our model training and validation.
2. Importing Data
Jupyter Notebooks
We typically start by importing some standard Python Data Science packages.

We can then import a .csv using pandas.

Dataiku DSS
In DSS, I first create a blank new project.

I can then import data from a world of options.

After selecting Upload your files, I can now drag and drop files, including my titanic_dataset.csv.

3. Exploratory Analysis
Jupyter Notebooks
Here’s what the data looks like.


We’ve got 891 passengers with 12 attributes.

Let’s check the data types and get an idea for missing values for each column.

Our target variable is Survived. This is what our Machine Learning model will predict.
Distribution-wise, our dataset is mildly imbalanced with 38% of the passengers making it out alive.

Extremely imbalanced datasets are notorious in the health industry, where there are often very few ‘infected’ people relative to those uninfected. This necessitates mindful use of sampling and the careful selection of performance metrics.
Anyway, back to the Titanic.
Let’s check out the distributions and correlations for the 7 numerical attributes.




Let’s check out the number of classes for each of the 5 categorical attributes.

The data for Cabin and Ticket are noisy with hundreds of classes, many of which are scarce with few rows.
Meanwhile, there are only a few classes for Embarked and Sex each.



It seems most passengers boarded from Southampton (S).
Moreover, the majority of Titanic’s passengers were men.
That’s interesting. Those with domain knowledge on the Titanic’s fateful journey may recall that the majority of survivors were women.

This is probably because we know that in the competition for the limited number of lifeboats, women and children were given priority.
All of this means we should be careful. One could train a model with high accuracy just by predicting all women survive and men do not!
Dataiku DSS
After drag-and-dropping the csv into Dataiku, I can immediately explore the dataset in a GUI-environment.

The bits automatically flagged in red tells me that there might be data integrity issues. I can also easily see which rows have missing data.
Clicking on Column view button near the top right corner gives me a rundown of my attributes, their data types and where there are missing rows.

I can go to Charts and Statistics tabs to perform a wide range of descriptive and visual analytics on my dataset.

We can create a Univariate analysis ‘card’ that will display something similar to .hist() and .describe().

Looks great, doesn’t it!

Let’s also create a Correlation matrix card to reproduce the seaborn correlation heatmap in our Jupyter Notebook.

You can do a lot of EDA very quickly on Dataiku’s GUI interface, none of which requires coding.
If you do happen to prefer coding, Dataiku offers full Jupyter Notebooks support.


Need to showcase your visual findings in a presentation to stakeholders?
Make sure to leverage data storytelling techniques – an integral career skill in order to be able to deliver impact and sell the value of your work.
Insights don’t sell themselves.
4. Data Preparation
Jupyter Notebooks
Recall that the Age, Cabin and Embarked attributes have missing data.

Let’s clean this up. We’ll replace the NaN’s in the categorical attributes Cabin and Embarked with ‘Missing’. And we’ll create an indicator variable to track which passengers had their Age missing, while replacing those NaN’s with 0 for the algorithms.

We need to create dummy variables for our categorical variables. I’ll also drop the Name column from our feature set, since all passenger names are unique.

Note that since Cabin and Ticket have hundreds of classes between them, our one-hot-encoding has bloated our dataset up to over 800 columns.

Generally one should consolidate sparse classes and have categorical features with a good number of rows in each class, but we’ll keep things simple here.
Dataiku DSS
In Jupyter, my workflow manifests as a set of cells in a notebook. In DSS, I’m constructing a series of recipes on a flow, which is a visual map of my logic and pipelines. Recipes can be either Visual recipes or Code recipes.
Let’s start with a visual recipe called Prepare. Visual recipes are easy to use and don’t require coding. They are great for non-programmers or users who want to quickly prototype ideas without spending hours on coding.

In my initial Prepare recipe, I’ll clean up the missing rows in my categorical attributes Cabin and Embarked. Note that actions I take are reflected in the dataset instantly. Great!

Next, I’ll make a Code recipe to clean up the missing rows under Age. This is exactly Jupyter Notebooks!

Recall the final two steps were dropping Name from the dataset and creating dummy variables for the 4 remaining categorical features.
Well, it turns out I don’t have to do this inside Dataiku.
That’s because Dataiku automatically one-hot encodes categorical features for ML algorithms, and I can easily switch features on-and-off later during the modelling step. Cool!
Alright, so far, my flow has two recipes: a Visual Prepare recipe that cleaned up my categorical features and a Code Python recipe that cleaned up my numerical features.

5. Feature Engineering
Jupyter Notebooks
Better data almost always beats more complicated algorithms. One way to squeeze more out of your data is through the engineering of new features. You can go wild here, especially if you have domain expertise.
Here, I’ll engineer one extra feature _Totalfam, which is the total number of relatives for a given passenger. The hypothesis here is perhaps the size of the family one brings with them has an effect on their survival?

We now have a dataset ready for modelling. This is called an analytical base table (ABT).
Dataiku DSS
In Dataiku, I can do this with a visual Prepare recipe. Again, it’s nice that I can see my new feature reflected instantly in the dataset!

With the latest transformation attached, my flow now looks like this.

6. Modelling
Enjoying this story? Get an email when I post similar articles.
Jupyter Notebooks
Let’s train some common classification algorithms on our Titanic ABT. First, we’ll split the dataset up into training and test sets.

We’ll then set up some pipeline objects to tune and train a set of models all at once: regularised logistic regressions, random forests and gradient boosted trees.

Model evaluation comes next after training.
The following code outputs the scores for a variety of classification metrics for each tuned algorithm.
I’ve written a visual guide on popular ML metrics here.

For example, here’s the output for the tuned random forest.

As noted at the beginning of the article, predicting a passenger will survive and being wrong is a disaster. Thus we wish to minimise false positives, which is equivalent to maximising precision.
Our random forest gives a precision of 80%. Specifically, looking at the confusion matrix above – 52 out of the 13 + 52 = 65 folks in our test set we predicted will survive actually did.
That’s not bad, but nearly not good enough. If White Star Line used this model, they’d be leading 20% of passengers to their deaths!
Moving on. If our priority is to identify the top predictors for survival, we might code up a plot ranking feature importances.


As we speculated earlier, the sex of passengers is a powerful predictor of survival on the Titanic. Notice that our engineered feature _Totalfam was a reasonable predictor.
And as expected, the hundreds of sparse class features (for Cabin and Ticket) we obtained through the one-hot encoding of our categorical variables end up being just noise.
Dataiku DSS
ML is where DSS really shines. Everything I just did can be done by clicking buttons.

Grabbing my ABT – here I’ve called this df_feat_engineer – I’ll click the Lab button on the right. Let’s now create some ‘Quick Prototypes’ on our target Survived.

Dataiku immediately chooses some algorithms and features for you. You can click Train to get going immediately.

Digging into the Design, we can see the decisions Dataiku has made on your behalf.
It turned off PassengerId and Name, because this data doesn’t really bring any predictive power to the table, only noise.

Algorithm-wise, Dataiku chose Logistic Regression and Random Forest.

You know what, I’ll also flick on the Gradient-boosted Tree and XGBoost too! Why not.
After training is complete, I get a summary of how my tuned models performed under a variety of metrics. For example the GBT took the cake on the Area Under ROC score.

Similar to Jupyter, I can dig in to find more information, such as my confusion matrix stats and feature importances.


7. Deployment
Jupyter Notebooks
Trained models on sklearn can be saved as a .pkl.

Individuals can use this to score new data on a batch-run basis, or machine learning engineers can plug it onto an automated production pipeline in a larger organisation.
Dataiku DSS
In Dataiku, I can deploy any trained model into my flow by pressing Deploy in the top right corner above.

I can then use the model to score any new dataset imported into the project.
The project itself can be exported as a Dataiku bundle and put into a production environment by machine learning engineers.
At the moment, our platforms team is building a Dataiku path to production that will allow users across the organisation to prototype in Dataiku, then deploy their model into our Hadoop big data cluster with the click of a button.
Exciting stuff!
Find me on Twitter & YouTube [[here](https://youtube.com/@col_shoots)](https://youtube.com/@col_invests), here & here.
My Popular AI, ML & Data Science articles
- AI & Machine Learning: A Fast-Paced Introduction – here
- Machine Learning versus Mechanistic Modelling – here
- Data Science: New Age Skills for the Modern Data Scientist – here
- Generative AI: How Big Companies are Scrambling for Adoption – here
- ChatGPT & GPT-4: How OpenAI Won the NLU War – here
- GenAI Art: DALL-E, Midjourney & Stable Diffusion Explained – here
- Beyond ChatGPT: Search for a Truly Intelligence Machine – here
- Modern Enterprise Data Strategy Explained – here
- From Data Warehouses & Data Lakes to Data Mesh – here
- From Data Lakes to Data Mesh: A Guide to Latest Architecture – here
- Azure Synapse Analytics in Action: 7 Use Cases Explained – here
- Cloud Computing 101: Harness Cloud for Your Business – here
- Data Warehouses & Data Modelling – a Quick Crash Course – here
- Data Products: Building a Strong Foundation for Analytics – here
- Data Democratisation: 5 ‘Data For All’ Strategies – here
- Data Governance: 5 Common Pain Points for Analysts – here
- Power of Data Storytelling – Sell Stories, Not Data – here
- Intro to Data Analysis: The Google Method – here
- Power BI – From Data Modelling to Stunning Reports – here
- Regression: Predict House Prices using Python – here
- Classification: Predict Employee Churn using Python – here
- Python Jupyter Notebooks versus Dataiku DSS – here
- Popular Machine Learning Performance Metrics Explained – here
- Building GenAI on AWS – My First Experience – here
- Math Modelling & Machine Learning for COVID-19 – here
- Future of Work: Is Your Career Safe in Age of AI – here