The world’s leading publication for data science, AI, and ML professionals.

From Trace Data to Time Series With Panda Magic

This article will guide you through quickly transforming millions of rows of trace data into a wide format time series table

Photo by Ilona Froehlich on Unsplash
Photo by Ilona Froehlich on Unsplash

This article will guide you through quickly transforming millions of rows of trace data into a wide format time series table using the power of mighty Pandas.

The Data

For this demonstration, I’ll be using the OULAD dataset from Open University. The full dataset, as well as a lovely data description, is available at the link above. This is a dataset about student activity in a virtual learning environment in online classes. It spans 2 years and four cohorts for each of 7 modules.

I’ve written about this dataset before, as it’s one of my favorites. You can read about some of my previous work building predictive models using this data in my article, Predicting Student Outcomes by How They Interact With Their Learning Environment.

Today we will be exploring the student_vle.csv file in the zipped collection. This file contains trace data for student interactions with online activities accessible in the learning environment. Each row represents one activity that the student completed and includes the number of clicks they made in that activity.

The Goal

In my previous approach to this dataset I used aggregations of this trace data to make predictions on whether students would pass the class or not. I used the total number of clicks the student had made.

But now I use student behavior on a day by day basis to make my predictions. This means using Time Series Data to describe how the student’s engagement changes over the course of time. For this article, I’ll just record the number of clicks the student made each day, though for my actual model I use more variables, and have multiple columns for each day of the course as input variables.

Since I want time series data for each student, I will create a wide format time series table from the individual activity interaction observations. Each row will be a single student in a given cohort of a given course. Each column describes the number of clicks a student made on activities in the learning environment on that day.

The dates are relative to the start date of that cohort. Some student started working up to 25 days before the start of the course and the courses are 270 days long. Therefore our final table will have 295 columns.

This is what we want it to look like:

Image by Author
Image by Author

Process Overview

The first time I tried to transform my data from the trace log to the wide form time series table I tried to use a nested for loop to read each entry in the trace log and place in the correct place on the new table. Don’t do this. I watched the process progress for a bit, using a counter, and did some quick math. With a little more than 10 million rows of trace data, this was going to take me over 5 days of run time to complete. There’s a saying, "If you are using a nested for loop with Pandas, there is probably a better way."

Instead, I used boolean masks to create what the Pandas documentation calls ‘views’, or filtered looks into the data without mangling the original. I did this by filtering the data for each day individually and merging the results of that view into a new dataframe in chronological order.

Process Details

The trace data starts a just a list of activity interactions with variables describing the activity, the module, the cohort, the student, and the number of clicks. I combined the module, cohort, and student into one ‘registration’ column for easier indexing.

Image by Author
Image by Author

As you can see, there are multiple entries per student per day, and even per activity, since each entry is an individual interaction with an individual activity. Students often seem to interact with the same activity several times on the same day. The first thing we need to do, to get data on clicks per day, is to group the data. To learn more about how to use Pandas Groupby functions, you can read my article on the subject.

We can group this table by student and day with the following code:

Group by Student and Day

def group_by_day(student_vle):
   #group by registration and day
   vle_group = student_vle.groupby(by = ['registration', 'date'])
   #sum up the clicks for each day for each student
   sum_clicks = vle_group.sum().reset_index()
   sum_clicks = sum_clicks.drop(columns=['id_site'])
   #sort by student and day
   sum_clicks = sum_clicks.sort_values(by=['registration','date'])
   return sum_clicks

The function group_by_day() uses the df.groupby() Pandas method to group the data by student and day, and the df.sum() aggregation method to make a new dataframe with each student’s clicks summed up for each day.

I also do a completely unnecessary sort, because it’s nice to see the data sorted by student and day to get a brief sense of it, but that is not at all needed for the next steps. If you are processing an even larger dataset, or putting code into production, definitely skip this step.

This code returns the following table:

Image by Author
Image by Author

We now have just one entry per student per day. The sums of their clicks per activity have been added together to represent the sum of their clicks for the whole day, rather than per activity interaction instance.

Thanks to my need for order, the entries are organized by student and day. I think it’s nice to see each student’s workflow through the days of the module.

Now we can start cutting apart this table, day by day, and gluing those pieces back together side by side with the following code:

Filter and Merge

def create_time_series(sum_clicks):
   #Create a new dataframe to hold our timeseries data.
   time_series = pd.DataFrame()
   #Create one row for each student registration
   time_series['registration'] = sum_clicks['registration'].unique()
   start_date = sum_clicks['date'].min()
   end_date = sum_clicks['date'].max()
   date_range = range(start_date, end_date)
   counter = len(date_range)
   #Iterate through the days of the course:
   for date in date_range:
      #create a views of the data, one day at a time.
      single_date_df = sum_clicks[sum_clicks['date'] == date]
      single_date_df = single_date_df.drop(columns=['date'])
      #rename columns to describe date and data.
      new_cols = ['registration'] +   [f'clicks_on_day_{date}']
      single_date_df.columns = new_cols
      #merge into the time series dataframe.
      time_series = time_series.merge(single_date_df,
                                      how='left',
                                      on='registration',
                                      validate = '1:m')
      print('Days remaining to be processed: ', counter)
      clear_output(wait=True)
      counter -= 1
   #Missing data represents no clicks that day, so fill with 0.
   time_series = time_series.fillna(0)
   time_series = time_series.set_index('registration', drop=True)
   return time_series

Filter

The above function iterates through each day of the module and uses a boolean array, an array of Trues and Falses, to filter for data from just that day. It checks the ‘date’ variable for each row and returns only the rows whose dates match the current date for that iteration.

For example:

sum_clicks['date'] == 0would return a series the same length as sum_clicks, but filled a True for each row whose data variable is ‘0’ and False for the rest.

sum_clicks[sum_clicks['date'] == 0] will return a view of only the rows where the boolean array contains True.

When we apply the above filter to the sum_clicks table, we get:

Image by Author
Image by Author

Notice that we only have the sum of the clicks for each student for day 0, or the first day of the course. You might also realize that most students probably don’t do work in the virtual learning environment for every day. So each daily view will not contain every student. The left join in the next step will help resolve that.

The next part of the above function merges each daily view onto a dataframe we already prepared that has a row for each student registration, but no data.

Merge

time_series = time_series.merge(single_date_df,
                                how='left',
                                on='registration',
                                validate = '1:m')

df.merge() is one way to merge one dataframe to another. This is equivalent to joining tables in SQL another data management tools. The magic in this approach is here in the merge. I prepared the empty dataframe named ‘time_series’ specifically to join each of these views onto in chronological order. This dataframe has one column called ‘registration’ that represents the unique registrations present in the data. One row per student registration, no more, no less. This is important.

When I merge each daily view in my for loop, I use ‘registration’ as a key. This ensures that the right data goes into the right row, even if the rows are not in the same order in both ‘time_series’ and the daily view we are merging. We also need to think about our data integrity when we are transforming data like this. Merging with keys is the best way to make sure your values don’t get mixed up between your rows.

how='left' tells Pandas to preserve the records of the left dataframe, in this case, ‘time_series’. Don’t drop any, don’t add any, don’t rearrange them. Just take the rows in the right dataframe, in this case, ‘single_date_df’, and attach the ones that match. Drop any that don’t. We want to make sure there is exactly one record per registration for each day.

on='registration' tells Pandas to use that column as the key to match the rows of each dataframe. Otherwise it won’t know which rows in the single_date_df dataframe go where when they merge! Finally, validate='m:1' means ‘Many-to-One’. This tells Pandas to check that there is only one value in the right dataframe for each day. One more way to make sure our data doesn’t get messed up. If there is no record in the right dataframe, ‘single_day_df’, that matches a registration in the left, ‘time_series’, Pandas will fill that value with np.nan, or ‘no data’.

Once this loop finishes, the dataframe will have all the click counts for each day, for each student, arranged chronologically left to right. Anyplace that no clicks were found for a student on a particular day will have an np.nan. We will use time_series = time_series.fillna(0) to replace those np.nans with zeros. After all, that’s what that np.nan really means: the student didn’t click anything that day.

I also set ‘registration’ as the index, because Pandas lets me, it’s fast, and that’s really what that column is, the index of student registrations. I didn’t do it earlier because it simplifies the merging syntax a little to have it as its own column.

Here is our final table. It’s very sparse.

Image by Author
Image by Author

Bonus: Sparsify!

Just because I love you all, I want to show you one more trick. The time series dataframe we created above is mostly full of zeros. This is known in the biz as a ‘sparse dataframe’. This one is just a little sparse, but other dataframes we might use are overwhelmingly filled with zeros. If we save them like this it wastes a lot of space recording each zero separately. Once again, Pandas to the rescue!

We can compress the dataframe by essentially removing the zeros while keeping track of where they should go with:

sparse_time_series = time_series.astype(pd.SparseDtype('int', 0))

The above code compresses the time series dataframe by about 1/3 the size, by removing the zeros. In the (pd.SparseDtype('int', 0) part of the code, we tell Pandas that the resulting sparse dataframe will have values of type ‘int’ or integer, and that the zeros are what we want to compress.

In other applications, especially Natural Language Processing, data can be compressed to MANY times smaller using this method.

Summary

You learned how to convert a trace data record, which could be user interactions, or Internet of Things pings, or anything the returns a value periodically, into a wide form time series data frame aggregated by a chosen period, in this case by day. You learned how to use df.groupby to aggregate to the data, to pass a boolean mask to a dataframe, such as df[df['date'] == 0]] to create a view of just the data you want in a column, and df.merge() to iteratively attach each period’s data to the previous to make that wide format time series dataframe.

As a bonus, you learned a compression method to story sparse dataframes more efficiently. This becomes very important when you have very large, very sparse data sets you want to use for modeling. Many standard machine learning tools handle sparse dataframes efficiently.

In the case of the above example, these strategies reduce processing times from 5 days to under 1 or 2 minutes, depending on your machine.

Full Code:

Here’s a Github Gist of the full code that will download, unzip, and transform this particular dataset. Please feel free to copy it and use it as a template for your own projects.

References:

Kuzilek J., Hlosta M., Zdrahal Z. Open University Learning Analytics dataset Sci. Data 4:170171 doi: 10.1038/sdata.2017.171 (2017).


Related Articles