The world’s leading publication for data science, AI, and ML professionals.

A Data Science Template

A template for practicing data science with 6 simple steps.

Photo by Myriam Jessier on Unsplash
Photo by Myriam Jessier on Unsplash

Practicing Data Science

Developing a skill is much easier if we are able to streamline the process of practicing it. For me, improving my Data Science skills involves weekly and sometimes daily practice of the different aspects of doing data science work: data collection, data cleaning and preprocessing, data visualization, modelling and more.

In this post, I will share my basic template for practicing data science with a practical example.


A Practice Cycle

One good way to simplify the cycle of practicing your data science skills is to clarify what are the steps that take you from data to insight. We can see it as a six-step process:

  1. Find a dataset containing some problem or challenge
  2. Pick a question about such dataset that can be answered with basic data science
  3. Clean up the data
  4. Do basic analysis on the data to answer the question you had initially
  5. Present the results in some standard format like a jupyter notebook
  6. Reconsider if the results you presented indeed answer your question and what could be improved

Practical example

To illustrate these steps, let’s pick a dataset and go through each step mentioned before.

1. Find a dataset

The main issue I found while practicing data science in this step is actually deciding which dataset to use. The reason why I think this is a problem is because we can either always use the same dataset and get bored over time or get overwhelmed by the amount of potential datasets that we can use.

To avoid issues, let’s automate this part by using the kaggle api and a simple rule to randomize our pick of a dataset in such a way that we won’t have to spend energy deciding and can get down to the actual coding quicker.

I do this by following a principle of automating all decisions that can be automated effectively.

The point is to have an awareness of the time and energy you spend on decisions during your practice to see if those can be automated without disregarding the learning potential involved in taking such decisions.

Now, my process here is simple:

  • Pick a word related to something I am interested in. (this time I chose learning)
  • Use the kaggle api to search the datasets available involving this particular word and download it

It is not 100% random, but is just random enough that I don’t feel the friction to start that would make me less productive. The command to search a dataset from a keyword with the kaggle api is:

kaggle datasets list -s learning

The output:

ref    title    size    lastUpdated    downloadCount    voteCount    usabilityRating  
kaggle/kaggle-survey-2018                                       2018 Kaggle Machine Learning & Data Science Survey    4MB  2018-11-03 22:35:07          14892        965  0.85294116       
kaggle/kaggle-survey-2017                                       2017 Kaggle Machine Learning & Data Science Survey    4MB  2017-10-27 22:03:03          22297        821  0.8235294        
alopez247/pokemon                                               Pokémon for Data Mining and Machine Learning        715KB  2017-03-05 15:01:26          10102        234  0.85294116       
rocki37/open-university-learning-analytics-dataset              Open University Learning Analytics Dataset           84MB  2018-10-27 22:30:00           3595        124  0.7058824        
rohan0301/unsupervised-learning-on-country-data                 Unsupervised Learning on Country Data                 5KB  2020-06-17 07:45:45           3956         79  0.8235294        
.
.
.
.
.

From the options that I got, I chose the open-university-learning-analytics-dataset. To download it, I used:

kaggle datasets download rocki37/open-university-learning-analytics-dataset 

Then I unzipped the data and saved it to a folder called: learning_analytics

unzip -d open-university-learning-analytics-dataset.zip learning_analytics

2. Pick a question

This part is tricky because you can either have a question and find a dataset to answer it, or you can have a dataset and discover a question while exploring it.

Let’s start by importing our dependencies, loading the data and selecting the relevant tables. Here, I am already starting from some knowledge about the data and which tables to use.

Now let’s take a look at the data:

df_scores
df_assessments
df_student_info

This dataset is the "Open University Learning Analytics Dataset" which is an anonymised dataset on courses and virtual learning environment data. The tables I chose were:

  • Student’s performance: Containing the scores of the students in each assesment or exam performed.
  • Student information: Containing information about the students such as the region they are from, how many credits they had studied up to taking the exam, whether or not they pass and more.
  • Assessments: metadata about the assessments, like type, weight and id.

A detailed description of the dataset and each table can be found here.

Upon looking at the data I came up with a few questions:

  • What is the average score of the students?
  • What is the average score per region?
  • What is the distribution of passed and failed per assessment?
  • What is the ideal amount of time to submit the work to increase chances of passing the exam?
  • Is there a relationship between age and score performance?
  • Could I write a binary classifier that predicts if the student will pass given all of the available information about the student and the course?
  • Is there a relationship between the number of credits studied up to that point and the student’s performance?

As you can see, there are many possible questions that can be asked about a given dataset. So how to start? Should I start answering all of them? Should I start with the easiest ones?

My rule for this is simple, I pick one question or group of questions that best fit my practicing goals and available time for a given day.

What I mean is, if I only want to do a quick practice to keep basic data analysis fresh in my mind, I won’t pick a question that will involve tasks that take too long like writing a custom neural network, or exploring complex machine learning models.

But, if on a given day I want to write something more interesting and complex, I will pick a question or group of questions that form a consistent narrative and allow me to practice more complicated modelling and analysis. It will all depend on you setting clear goals at the beginning!

The idea here is to pick a narrative that can be summarized by a question, and fill in the gaps of the narrative (like the building blocks of a history) with smaller questions that form a cohesive structure.

For the purpose of this article, let’s keep it simple. My question will be: What are the main factors responsible for score performance in a course?

3. Clean up the data

Now that I chose my question, I can start investigating the data to find the answers that I care about. In order to do that effectively, I want to clean the data and do some basic preprocessing.

The most important points here will be to remove NaNs and columns that will not contribute to the analysis and unify the tables into one that has all the columns with the relevant information. The helper functions I will use are:

Let’s start by finding how many NaN entries we have in each table:

# Output:
df_student_info
Column: code_module
Number of NaNs: 0
Column: code_presentation
Number of NaNs: 0
Column: id_student
Number of NaNs: 0
Column: gender
Number of NaNs: 0
Column: region
Number of NaNs: 0
Column: highest_education
Number of NaNs: 0
Column: imd_band
Number of NaNs: 1111
Column: age_band
Number of NaNs: 0
Column: num_of_prev_attempts
Number of NaNs: 0
Column: studied_credits
Number of NaNs: 0
Column: disability
Number of NaNs: 0
Column: final_result
Number of NaNs: 0
***
df_scores
Column: id_assessment
Number of NaNs: 0
Column: id_student
Number of NaNs: 0
Column: date_submitted
Number of NaNs: 0
Column: is_banked
Number of NaNs: 0
Column: score
Number of NaNs: 173
***
df_assessments
Column: code_module
Number of NaNs: 0
Column: code_presentation
Number of NaNs: 0
Column: id_assessment
Number of NaNs: 0
Column: assessment_type
Number of NaNs: 0
Column: date
Number of NaNs: 11
Column: weight
Number of NaNs: 0
***

Now let’s remove the NaN entries from their respective columns as well as remove the irrelevant columns: code_presentation and is_banked .

I am removing the code_presentation column because it refers to a simple identification code of the presentation to which a given assessment belongs, so it does not matter to understand the factors involved in the student’s performance, similar for the is_banked column.

Let’s take a look at what we have so far:

image by the author
image by the author

Now, let’s create a column called assessment_type to relate the scores with their respective assessment categories:

  • CMA: computer marked assessment
  • TMA: tutor marked assessment
  • Exam: the exam or exams of the course

    Let’s take a look at df_scores:

df_scores
image by the author
image by the author

Great, now let’s merge everything together into one dataframe using the merge() method and removing the resulting NaN entries from the score and studied_credits columns:

# Output
5768
0
7480
0
df_merged
image by the author
image by the author

To finish let’s change the imd_band column to numerical:

4. | 5. Basic analysis and present the results

Let’s remember the question I set for myself in the beginning: What are the main factors responsible for score performance in a course?

Given that my interest is on the factors that are associated with the score of the students, let’s begin by looking at a distribution of scores:

Now, let’s look into the scores for the different types of assessments:

image by the author
image by the author

Let’s consider the average score per type of assessment:

# Output:
Average score for TMA
72.5633924663878
Average score for CMA
81.02705346888426
Average score for Exam
63.546800382043934

The average score is higher for the computer marked assessments, followed by the tutor marked assessment and the exams have the worst scoring performance.

Let’s start digging, by looking at the potential relationship between student credits and score performance. I would expect that students with more studied credits would have better performance.

image by the author
image by the author
pearsonr(df_merged["score"],df_merged["studied_credits"])
# Output
(-0.05601315081954174, 1.2536559468075267e-134)

As we can see from the scatter plot and by running a simple pearson correlation, the number of credits and the score performance are slightly negatively correlated, so nothing too striking to take notice but our initial expectation does not go in accordance to the data.

Now, let’s look at the number of days to submit and the performance. I would expect that students that deliver too early don’t do as well, but students that take too long also don’t do as well with the peak score being somewhere in the middle.

image by the author
image by the author
pearsonr(df_merged["date_submitted"],df_merged["score"])
# Output
(-0.020750337032287382, 6.150386473714662e-20)

Again we find a very small negative correlation, showing that the best grades were not necessarily obtained by taking longer to submit nor the opposite.

Let’s look now at the profile of students that pass the course versus students that fail the course or withdraw from the course. First we create two separate dataframes:

Let’s begin with the number of people that pass versus the number of people that failed or withdrew:

print(f"Number of people that passed the course: {len(df_pass)}")
print(f"Number of people that failed the course: {len(df_fail)}")
Number of people that passed the course: 137503
Number of people that failed the course: 56546

We see that considerably more people passed the course than failed. Let’s visualize different aspects of their profile: time taken to submit assessments, education levels, age, number of credits, region and the imd_band which measures the poverty levels of the area where that student was living while taking the course (more details on this metric can be found here).

Let’s start with looking at a comparative distribution of the performance using percentage scores.

image by the author
image by the author

As expected, the distribution of grades for people that fail shows a peak at the 60–70 interval while for the people that passed the peak is at a 100.

Let’s look now at the submission time profile of the two groups. Our expectation here is that the students that fail probably deliver sooner, because we saw earlier that there was a slight negative correlation between the number of days to submit and score performance.

Indeed that is what we see here, with the students that fail or withdrawn delivering assessments much faster, and we can clearly see a big difference in the 2 distributions, with the peak for the failed students skewed to the left at around 0 and 100 days and for the students that passed the peak is at 100 to 200 days. Let’s confirm by printing the mean number of days to submit for both groups.

# Output
Ttest_indResult(statistic=91.87803726695681, pvalue=0.0)

A significant 30 days difference. Cool, now let’s look at education levels.

The biggest difference that we observe here is that there are proportionally more students with lower than A level education on the group that fails (42.8%) in comparison to the group that pass the course (33%).

Now let’s look at possible age differences:

image by the author
image by the author

In the age department, both groups seem to have a similar profile with the fail group having more young people between 0 to 35 years while the pass group has more older people at ages between 35 and 55 years.

Now let’s look at geographical information – The region where the student lived while taking the course.

image by the author
image by the author

Here when we look at the differences between regions we see small differences with 1.6% more people living in the South Region doing better, 1.7% more people from North Western Region doing worst.

Finally, let’s look at the imd_band which is in essence a measure of poverty for small areas, that is widely used in the UK. According to this report by the department of communities and Local Government the index reflects the percentage of most deprived to least deprived areas, so as the number increases that means that the area is less deprived, so 10% most deprived will be the poorest area and 90-100% will be the least poor area.

image by the author
image by the author

Here we see that the biggest difference is on the 0–10% reflecting the poorest area where there are proportionally 4.1% more people that fail coming from the poorest area of the UK in comparison with the people that pass.

Similar for the 10–20% group where for the people that failed, 11.7% were from the second poorest region of the UK versus 9.4% in the pass case.

The other interesting aspect is that on the other side of the spectrum, in the less deprived areas these percentages change and we see that 10% of the people who passed were from the less poor area of the country versus 7.3% in the fail group. Similar for the second less poor area where we see 10.2% of the people who pass versus 8.2% in the fail case.

For me, this seems to point to the potentially best answer for my initial question regarding the factors responsible for score performance so, to try and get a sense for whether or not these apparent results actually have some possible effect, let’s look at the relationship between score performance and poverty.

pearsonr(df_merged["imd_band"],df_merged["score"])
(0.08166717850681035, 2.3543308016802156e-284)

In the overall group of students we see a small 8% positive correlation between score performance and poverty measure.

Let’s look at this correlation for the pass and fail cases separately:


pearsonr(df_pass["imd_band"],df_pass["score"])
(0.0642002991822996, 1.6041785827156085e-125)
pearsonr(df_fail["imd_band"],df_fail["score"])
(0.052778778980020466, 3.5499376419846515e-36)

Again weak correlations, to investigate this further let’s look at the plots of the average score performances per index value of the imd_band rank to have an overview of the trends.

image by the author
image by the author

We can observe that if we average the scores for each index group there is a very clear upward trend on the score performance as the conditions of the area improve, in other words, it seems that the score will be higher for the richer areas.

When we look at both groups we see a similar trend but interestingly in the fail group we see more volatility in the 20–70% interval and a bigger jump in performance between 70–80% interval when compared to the pass group.

To investigate this further let’s fit a regression line to the overall score performance against the imd rank.

print(f"Slope of the Regression Line: {coef}")
# Output 
Slope of the Regression Line: [[0.05571216]]

To confirm these results let’s use the statsmodels api to get a p-value for the regression test:

image by the author
image by the author

We observe that we get a good fit and that for each increase of one point on the index of deprivation we see an increase of about 0.055 on the score performance, and we also get a significant t-test with a p-value < 0.05.

Since we are fitting the regression line to the means of scores we lose information on the variability of the scores across imd ranks but for the purposes of this analysis this is enough evidence to at least suspect that the index of deprivation might be an important factor influencing the performance of the students.

We started with a question and attempted to build a narrative around it, which in this case was to try to understand the factors involved in the scoring performance.

Upon exploring the data we saw that one factor that might play a significant role was the rank of poverty in the area where the students lived while taking the course, and we showed that there was a positive correlation between mean score performance in the course and the index of multiple deprivation in the region, with the least deprived regions performing better than the most deprived.

6. Reconsider

Before wrapping up a data science practice session I like to reconsider what I did and ask myself whether or not the statistical arguments or models I used indeed were grounded in the data and reflected relevant information about it.

In the case of this dataset I think for the purposes of an informal practice session they did, although if this was intended for research or for a professional report, there would be much more to do with respect to solidifying the arguments for the statements I made. Elements that could be reconsidered:

  • Plotting a regression line against the entire score and imd rank values to get information on the variability of the relationship between poverty and score performance
  • Investigating more detailed information on the statistical differences between pass and fail groups
  • Running machine learning models like random forest and xgboost to to classify students into pass or fail given the information available minus the actual score performances

This final point in your data science practice is crucial to develop the necessary critical thinking skills about data analysis and statistics.

The notebook with the source code for this post can be found here.


Thoughts on practicing

As with everything, if you practice you get better. The issue most people have is the time available to them to actually practice these skills everyday, given that a notebook like this could take a few hours to put together.

However, I would say that one good strategy is to practice this template and use the step about formulating a question for your data to narrow down your analysis to fit the time slot you have available, so if you only have 30 minutes, you could ask a very simple question and answer it quickly even if the arguments won’t be as good as they could.

The most important thing is to practice the pipeline of gathering the data, exploring, cleaning, running analysis, presenting it in a way that tells a history, even if it is a really simple one.


If you liked this post connect with me on Twitter, LinkedIn and follow me on Medium. Thanks and see you next time! 🙂

References


Related Articles