IN-DEPTH ANALYSIS
Welcome to the promising age of Data Science and machine learning platforms! These platforms have the advantage of being usable right out of the box so that teams can start analyzing data pretty much from day one. With open-source tools, you need to assemble a lot of the parts by hand, so to speak, and as anyone who’s ever done a DIY project can attest to, it’s often much easier in theory than in practice. Choosing a data science and ML platform wisely (meaning one that is flexible and allows for the incorporation and continued use of open source) can allow the best of both worlds in the enterprise: cutting-edge open source technology and accessible, governable, control over data projects.

Ultimately, data science and ML platforms are about time. That means time savings in all parts of the processes (from connecting to data to building ML models to deployment), of course. But it’s also about easing the burden of getting started in AI and allowing businesses to dive in and get started now – not waiting for years until technologies sure up and the world of AI becomes more clear (because spoiler alert: that may never happen). Getting started on the AI journey is intimidating, but data science and ML platforms can ease that burden and provide a framework that allows companies to learn as they go.
Saturn Cloud
Saturn Cloud allows Data Scientists to easily provision and host their work on the cloud, without the need for specialized DevOps. The data scientists can then work within a Juptyer Notebook which is hosted on the server(s) specified by you and created by the system. All of the setups for software, networking, security, and libraries are automatically taken care of by the Saturn Cloud system. Data Scientists can then focus on the actual Data Science and not the tedious infrastructure work that falls around it.
You can also share your Juypter Notebooks with the public or team members using links. This eliminates the need to understand how to work with GitHub for basic data science projects. If you do know how to use GitHub, it still offers a fast and convenient way of testing and developing code with others. As a result, data scientists can focus on data science and finish their projects more quickly than they would have been able to do otherwise.

This is the final part of a 3-part article series where I give an example of how I use Saturn Cloud to work on the Instacart Market Basket Analysis challenge. In part 1, I explored the dataset to understand in fine-grained details about the customer shopping behavior on the Instacart platform. In part 2, I segmented the customers to figure out what factors separate them from one another.
Motivation
I was looking to run association analysis in Python using the Apriori algorithm to derive rules of the form {A} -> {B}. However, I quickly discovered that it’s not part of the standard Python machine learning libraries. Although there are some implementations that exist, I could not find one capable of handling large datasets. "Large" in my case was an orders dataset with 32 million records, containing 3.2 million unique orders and about 50K unique items (file size just over 1 GB).
Therefore, I decided to implement the algorithm my own to generate those simple {A} -> {B} association rules. Since I only care about understanding relationships between any given pair of items, using Apriori to get to item sets of size 2 is sufficient. I went through various iterations, splitting the data into multiple subsets just so I could get functions like cross-tab and combinations to run on my machine with less than 10 GB of memory. But even with this approach, I could only process about 1,800 items before my kernel would crash… And that’s when I learned about the wonderful world of Python generators.

Python Generators
In a nutshell, a generator is a special type of function that returns an iterable sequence of items. However, unlike regular functions that return all the values at once, a generator yields one value at a time. To get the next value in the set, we must ask for it – either by explicitly calling the generator’s built-in "next" method, or implicitly via a for loop.
This is a great property of generators because it means that we don’t have to store all of the values in memory at once. We can load and process one value at a time, discard that value when we finished, and move on to process the next value. This feature makes generators perfect for creating item pairs and counting their frequency of co-occurrence.
Here’s a concrete example of what we’re trying to accomplish:
- To get all possible item pairs for a given order
order 1: apple, egg, milk --> item pairs: {apple, egg}, {apple, milk}, {egg, milk}
order 2: egg, milk --> item pairs: {egg, milk}
- To count the number of times each item pair appears
eg: {apple, egg}: 1
{apple, milk}: 1
{egg, milk}: 2
Here’s the generator that implements the above tasks:
def get_item_pairs(order_item):
# For each order, generate a list of items in that order
for order_id, order_object in groupby(orders, lambda x: x[0]):
item_list = [item[1] for item in order_object]
# For each item list, generate item pairs, one at a time
for item_pair in combinations(item_list, 2):
yield item_pair
The _get_itempairs() function written above generates a list of items for each order and produces item pairs for that order, one pair at a time.
- The first item pair is passed to Counter which keeps track of the number of times an item pair occurs.
- The next item pair is taken, and again, passed to Counter.
- This process continues until there are no more item pairs left.
- With this approach, we end up not using much memory as item pairs are discarded after the count is updated.
Apriori Algorithm
Apriori is an algorithm for frequent itemset mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those itemsets appear sufficiently often in the database.
Apriori uses a "bottom-up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Apriori uses a breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate itemsets of length k from itemsets of length (k – 1). Then it prunes the candidates which have an infrequent subpattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent itemsets among the candidates.
Here’s an example of Apriori in action, assuming a minimum occurrence threshold of 3:
order 1: apple, egg, milk
order 2: carrot, milk
order 3: apple, egg, carrot
order 4: apple, egg
order 5: apple, carrot
Iteration 1: Count the number of times each item occurs
item set occurrence count
{apple} 4
{egg} 3
{milk} 2
{carrot} 2
=> {milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.
Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1
item set occurence count
{apple, egg} 3
=> Only {apple, egg} remains and the algorithm stops since there are no more items to add.
If we had more orders and items, we can continue to iterate, building item sets consisting of more than 2 elements. For the problem we are trying to solve (i.e.: finding relationships between pairs of items), it suffices to implement Apriori to get to item sets of size 2.
Association Rules Mining
Once the item sets have been generated using apriori, we can start mining association rules. Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:
1 – Support
This is the percentage of orders that contain the item set. In the example above, there are 5 orders in total and {apple, egg} occurs in 3 of them, so:
support{apple,egg} = 3/5 or 60%
The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.
2 – Confidence
Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:
confidence{A->B} = support{A,B} / support{A}
Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:
confidence{B->A} = support{A,B} / support{B}
In our example, the percentage of times that egg is purchased, given that Apple was purchased is:
confidence{apple->egg} = support{apple,egg} / support{apple}
= (3/5) / (4/5)
= 0.75 or 75%
Here we see that all of the orders that contain egg also contain apple. But, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.
3 – Lift
Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occurring together in the same orders simply by chance (ie: at random). Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}:
lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})
In our example, we compute lift as follows:
lift{apple,egg} = lift{egg,apple}
= support{apple,egg} / (support{apple} * support{egg})
= (3/5) / (4/5 * 3/5)
= 1.25
One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same order if there was no relationship between them. In the example above, if apple occurred in 80% of the orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 48% of the time (ie: 80% * 60%). The numerator, on the other hand, represents how often apple and egg actually appear together in the same order. In this example, that is 60% of the time. Taking the numerator and dividing it by the denominator, we get to how many more times apple and egg actually appear in the same order, compared to if there was no relationship between them (ie: that they are occurring together simply at random).
In summary, lift can take on the following values:
- lift = 1 implies no relationship between A and B. (ie: A and B occur together only by chance)
- lift > 1 implies that there is a positive relationship between A and B. (ie: A and B occur together more often than random)
- lift < 1 implies that there is a negative relationship between A and B. (ie: A and B occur together less often than random)
In this example, apple and egg occur together 1.25 times more than random, so I conclude that there exists a positive relationship between them. Armed with the knowledge of apriori and association rules mining, let’s dive into the data and code to see what relationships I can unfold!
Association Rules Mining on Instacart Data
Getting started with Jupiter Notebook in the cloud is very intuitive using Saturn Cloud. Once a notebook is running, I can easily share it from within the notebook with the public. To demonstrate this, you can view the complete code from my Association Rules Mining notebook on Saturn Cloud here.
After the data pre-processing steps, I wrote a couple of helper functions to assist the main association rules function:
- freq() returns the frequency counts for items and item pairs.
- order_count() returns the number of unique orders.
- get_item_pairs() returns the generator that yields item pairs, one at a time.
- merge_item_stats() returns the frequency and support associated with the item.
- merge_item_name() returns the name associated with the item.
The big association_rules() function is displayed in the GitHub Gist below:
At this point, I was able to leverage Saturn Cloud‘s GPU capability for my workloads. Essentially, Saturn Cloud enables me to deploy a Spark or Dask cluster with just one click. This simplifies issues of dealing with very expensive computations by making distributed computing available with a single click. Considering that this association rules mining algorithm is quite expensive, access to GPU is a clear winner here.
After calling the association_rules() function on the orders data frame with minimum support of 0.01, I got the results below (7 samples of the rules being mined):

From the output above, I observe that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc). As mentioned, one common application of association rules mining is in the domain of recommender systems. Once item pairs have been identified as having a positive relationship, recommendations can be made to customers in order to increase sales. And hopefully, along the way, we can also introduce customers to items they never would have tried before or even imagined existed!
Conclusion
DevOps can be really difficult when trying to get data science group projects off the ground. Behind the scenes, there’s a lot that goes into setting up the actual platform that Data Scientists use for their work: creating servers, installing the necessary software and environments, setting up security protocols, and the like. Hosting Jupyter Notebooks with Saturn Cloud while also taking care of versioning and the ability to scale in or out as needed can tremendously simplify your life, decrease your time to market, decrease cost and the need for expert cloud skills.
I hope you enjoy this 3-part series on the benefits of using Saturn Cloud via the Instacart Market Basket Analysis challenge. As part of the data science and machine learning platform movement, frameworks like Saturn Cloud can open up the door to true data innovation when teams don’t have to spend precious time on administrative, organizational, or repeated tasks. The reality is, in the age of AI, businesses of any size can’t afford to work without a data science platform that enables and elevates not just their data science team, but the entire company to the highest level of data competence for the greatest possible impact.