The world’s leading publication for data science, AI, and ML professionals.

Mining the ‘2020 Kaggle Machine Learning & Data Science Survey’ data

This article applies Apriori algorithm to the '2020 Kaggle Machine Learning & Data Science Survey' data to find out the associations…

This article applies Apriori algorithm to the ‘2020 Kaggle Machine Learning & Data Science Survey’ data to find out the associations among the technologies used by the respondents. This article assumes the reader to have a working knowledge of Apriori algorithm and its implementation in Python.

Photo by William Iven on Unsplash
Photo by William Iven on Unsplash

2020 [Kaggle](https://www.kaggle.com/) Machine Learning & Data Science Survey was a survey conducted by Kaggle in 2020. The survey was conducted in October 2020 online. After data curation, the survey had 20,036 responses. The survey had 39+ questions asking the respondents about their demographics, technologies (programming languages, IDEs, algorithms, libraries and cloud products) used for data science and machine learning, technologies they plan to learn in future, etc.

Kaggle also launched an annual Data Science Survey Challenge from November 19, 2020 to January 7, 2021 for the community to use this data to tell a story about the data science and machine learning community. In this article, we’ll try to find out the association rules among various technologies used by the respondents of the survey. Let’s get started with mining the ‘2020 Kaggle Machine Learning & Data Science Survey’ data. The complete code for this article can be accessed from my Kaggle Notebook.

Data import and preprocessing

Removing ‘None’ and ‘Other’ from the responses as they carry no information and also cause duplication across the dataset. For example, we cannot differentiate between ‘None’ or ‘Other’ in the ‘Programming languages used’ question and ‘IDEs used’ question.

There are two responses, namely ‘MATLAB’ and ‘Shiny’, which appear in two different questions. ‘MATLAB’ is present in both ‘Programming languages used’ and ‘IDEs used’ questions. While, Shiny is present in ‘Data visualization libraries used’ and ‘Products used to publicly share machine learning applications’ questions. Hence, we’ll replace ‘MATLAB’ under ‘IDEs used’ question with ‘MATLAB IDE’ and ‘Shiny’ under ‘Products used to publicly share machine learning applications’ question with ‘Shiny (Publicly share)’.

The data is stored in a data frame, but the ‘Apriori‘ method of the ‘apyori’ package in Python expects a list of lists instead of a data frame. Hence, we’ll convert the data frame into a list (let’s call it ‘main list’) of lists, where each list inside the ‘main list’ represents a row/index of the data frame.

A list in the 'main list' (Image by author)
A list in the ‘main list’ (Image by author)

Applying Apriori algorithm to the data

The data is processed and ready for association rule mining. We’ll set the ‘minimum support’ to 0.02, ‘minimum confidence’ to 0.6, ‘lift’ to 3 and ‘maximum length’ to 2.

Top 5 records of rules generated (Image by author)
Top 5 records of rules generated (Image by author)

We’ll filter the rules with support greater than 0.02, confidence greater than 0.8 and lift greater than 3 and frame them into sentences, which helps the reader to easily understand them.

Image by author
Image by author

These are a few important association rules generated from the Kaggle survey data. The rules having a high confidence sound very obvious. There are few rules (like the rule 40) that show a strong association between ‘Colab’ and ‘Github’ which was not expected prior to this association rule mining exercise. However, further analysis by tuning the support, confidence, lift and max_length hyperparameters may uncover additional insights which may sound interesting.

Know more about my work at https://ksvmuralidhar.in/


Related Articles