The world’s leading publication for data science, AI, and ML professionals.

Introduction to Graph Models for Clickstream Data Pt. 2

From data preparation to model evaluation with code in R

Photo by Robynne Hu on Unsplash
Photo by Robynne Hu on Unsplash

Introduction

In a previous Article "Introduction to Graph Models for Clickstream Data" I introduced a way to analyze clickstream data with Exponential Random Graph Models (ERGM) [1]. This approach can be helpful when we try to model data from interactions with digital content. The resulting clickstream data is not only a reflection of people’s thoughts. The digital content and its layout have an equally important influence on the way people interact with it. Disentangling those two sources of influence is critical if we want to build more meaningful and more enjoyable digital experiences.

To illustrate the core concept I use data from a research project at Bamberg University where we wanted to find out how teachers form their Judgement about student achievement. We used a simulated classroom environment (see picture below) where participants had to choose items (ranging from very easy to very hard) and assign them to the simulated students. Each click was recorded and stored in a data base.

Screenshot of the Simulated Classroom Software. Image by author.
Screenshot of the Simulated Classroom Software. Image by author.

Additionally, six student descriptions were presented indicating the students’ names, age, gender, number of siblings and their parents’ occupations, in order to give potentially relevant and less relevant information. Two example descriptions are presented in the figure below.

Image by author.
Image by author.

The clickstream data from the interaction with the simulated classroom is represented as networks where each node represents two choices: One for the item and one for the student. Additionally, we calculated the rank-order correlation between empirical student abilities and our participants’ ability estimates to create two groups: One group with a high judgement accuracy (r >= .70) and one group with low judgement accuracy (r < .70). This results in n(high) = 25 participants and n(low) = 12 participants.

In the following I will walk you through my analysis and go into detail of specifying network parameters and the process of model estimation and model evaluation.

Data Preparation

Loading and cleaning

First of all we need to load the required packages and the data. In this case we have one data frame per participant with each step recorded in one row.

By loading the data this way I end up with one data frame containing all data but it needs to be cleaned. As set in the _bindrows argument we keep an id variable to identify each participant in the data. This id variable contains the path as well as the file name which is a unique participant number. We only need to keep the number from the filename as an identifier. Also, we only need every second row with a valid entry for both item and schueler (student).

Following these steps results in a data frame with only relevant entries and well formatted ids.

Turning row-wise data into matrices

Creating network objects requires the data to be in the form of an edge list or an adjacency matrix. You could go either way here but I chose to turn the individual clickstream data in to adjacency matrices. The basic structure to read a matrix is always from row to column. Each entry in a cell represents that there is a transition from the node in the row to the node in the column. That entry can be 0 (not present) or 1 (present) or even higher than 1 if we have a transition multiple times.

In the simulated classroom we have 12 items and 6 students resulting in 72 item student combinations. For some network analyses and especially visualizations it can be beneficial to have a separate start and end node to explicitly include those steps.

The matrix follows a basic structure: The first row and column represent the additional start node. Rows 2–13 represent student 1 and all 12 items being assigned to student 1. In rows and columns 14 to 25 are all items being assigned to student 2. That pattern goes on until row and column 74, which represents the separate end note.

Creating a nested data structure

I mentioned two judgement accuracy groups in the introduction. To create two group network objects we have to create individual adjacency matrices first. To keep the data easily manageable we can turn our overall data frame into a nested data frame with the id as indicator variable. Then we just have to apply the function to every nested data frame in order to create one adjacency matrix per participant and store it in another variable in the overall data frame.

Creating two judgement groups

The calculations for the overall judgement accuracy are not shown in this project. I chose to create two index vectors containing the ids for participants with a high and low judgement accuracy. Then we just need to add up all adjacency matrices in each group. Even though we introduced a separate start and end node in the adjacency matrices above, we don’t need those two nodes in the network objects for our network models. This results in two 72 x 72 adjacency matrices, one per judgement accuracy group.

Preparing attribute objects

To be able to add network attributes we need to load that information first. I created a separate attributes data frame containing information for item, gender, student and social status (ses) for each node. Actually, I forgot to add the student variable in the attributes file (let’s keep it real here) but we can just create a vector for that and add it as a variable to the attributes data frame. Both ways work equally well. Adding attributes to each node is the main reason why it is beneficial to ignore the separate start and end nodes. We could hardly set any attributes for those two nodes which can be a problem when we build a model. All connections of the start and end node cannot be connected to any node attributes and will thus be unattributed random variance.

Creating Network Objects

Now we can finally create network objects and start building models. But as always, data preparation takes some time and is important.

Model Specification & Interpretation

Creating a baseline

There are a few different packages to build network models. In this case I worked with statnet [2]. We start with building a baseline model including only an edges term. It won’t get us very far but allows us to see how our model improves when we add other variables. And again, we do that for both network objects. To have an indicator of overall model performance across different model versions we start with an AIC of 4147 for the high and an AIC of 2729 for the low judgement accuracy group.

Model summary for high judgement accuracy model.
Model summary for high judgement accuracy model.

Adding variables

By including different nodefactor terms we can test if nodes with specific factor levels are connected more often than others. The interpretation of those nodefactor terms is analogous to variables in a logistic regression where one level is set as the baseline and the other levels are compared to the baseline level.

We start to see differences between the two models. Important to notice here: Model parameters cannot be interpreted across different models. We can only interpret the parameters in relation to other parameters within the same model!

We see that the parameters for the nodefactor terms of student 5 and six can not be estimated. That is a sign that the model is not yet well specified.

Model summary with edges and different nodefactor terms.
Model summary with edges and different nodefactor terms.

First Differences

When we look at the student related variables (student, gender and social status) there is a significantly positive coefficient for nodefactor.ses.low in the high judgement accuracy group (Model 1). That means that nodes with the attribute of low social status are generally more often connected than nodes with a high social status attribute.

An interpretation of that could be that participants in the high judgement accuracy group actually want to take a closer look at those students with a low social status information (as indicated by their parents’ jobs in the student description at the beginning). It is a well-documented fact that the social status has a negative effect on perceived ability and thus it would be plausible to specifically focus on those students. The participants that make very accurate judgements are aware of that effect and want to be careful here to not fall victim to it as well. For the low judgement accuracy group we don’t find any significant effects.

Looking at the items, we don’t find a differentiated item choice in Model 2 (low judgement accuracy group). All item coefficients are not significantly different from 0 when choosing item 2 (which has a medium difficulty) as a baseline category. In Model 1(high judgement accuracy group) we have quite the opposite with almost all coefficients being significant in comparison to item 2. This supports the interpretation that participants in the high judgement accuracy group progress more thoughtfully. They seem to focus on some students more than others and also seem to choose their items more deliberately.

Looking at the AIC we see minor improvements. For the high group we have an AIC value of 4144 and for low group an AIC value of 2748. So even though we included additional variables both models did not improve significantly.

Adding behavioral variables

In the final models I added variables that function as a proxy for how participants might navigate through the simulated classroom. In network terminology they are nodematch terms. Generelly speaking nodematch terms check if nodes have a higher probability of being connected to other nodes that share the same attribute. I added nodematch terms for student, item, gender and social status (ses) which are described as Assessment Strategies in the summary table below. Technically, those are not behavioral variables per se but they capture effects that represent behavioral patterns very well in this case.

Full model summary with all variables.
Full model summary with all variables.

We have highly significant coefficients for both nodematch student (same student) and nodematch item (same item) terms in both models. That is largely due to the way the simulated classroom is designed. Participants had to choose items and students alternatingly. If you don’t do that completely at random there is no way to avoid choosing an item or a student twice (or more often) in a row.

In both models we find a significant and negative coefficient for the _nodematch gende_r term (same gender). That appears to be an artefact of the simulated classroom’s architecture: Boys sit next to girls and vice versa. Connecting students with different gender could reflect intentional assessment processes or could just mean ‘ask students in a row that sit next to each other’. The last interpretation is supported by the positive coefficient for nodematch ses (same social status) __ in the low judgement accuracy group. This means participants choose students in a row that share the same social status. The students in the simulated classroom were set up to all have a high social status in the upper row and a low social status in the lower row. So again, it seems likely that this _nodematc_h term captures an effect of how participants navigate through the simulated classroom: Closely along the seating arrangement.

Interestingly, the nodefactor ses low (low social status) term, that was significantly negative in the model without behavioral variables is no longer significant.

Looking at the AIC values for both models tells us that both actually improved this time. For the high judgement accuracy group we have an AIC value of 3066 and for the low group we have an AIC value of 2212.

Where things get really network-y

If we look at the model specifications, we are not too far away from a logistic regression so far. That changes by adding a triangle term. This triangle term is an indicator of local clustering in the networks . For the high judgement accuracy group, we find a small negative effect which means the network has a slight tendency to form local clusters. This again could be an indicator of a more thoughtful and planned assessment procedure in the simulated classroom.

Model Evaluation

So far I have only talked about AIC as an indicator of model performance. Assessing the goodness-of-fit of network models is very much a visual task. The basic concept is to use the estimated coefficients, sample new networks with those coefficients and measure a certain set parameters describing the networks. Those parameters from the sampled networks are then compared to the parameters of our empirically observed network (for which we built our model). If they match we can be certain to have build a model that captures the data generating process reasonably well. If there are larger discrepancies we would have to go back to theorizing and thinking about how the models could be more precise.

In the btergm package [3] there is a nice function to estimate and plot the goodness of fit. Visually inspecting the graphic below shows that there are no significant differences between the empirical model parameters and those estimated from the sampled networks (in the boxplots). That is always a good sign to be on the right track.

Goodness of fit plots for the high judgement accuracy model. Source: Author
Goodness of fit plots for the high judgement accuracy model. Source: Author

Outlook

A major limitation of ERGMs is that they can only handle binary edges. Since I decided to group multiple participants and only look at their data on an aggregated level, I lost some information. It would certainly make a big difference if a transition between two sequential actions appears once or five times. But there is a solution to that problem: Generalized Exponential Random Graph Models (GERGMs) [4]. This generalized from of ERGMs is able to handle weighted edges and use the total amount of information. But they are computationally very demanding. Estimating a GERGM for a network of moderate size with 20–30 nodes can easily take days if not weeks on a regular machine. Choosing a different Likelihood estimation (Maximum Pseudo Likelihood instead of Markov Chain Monte Carlo Maximum Likelihood Estimation) can speed up things significantly but introduces the risk of a biased estimation.

I will explore that in my next post. If you have questions or ideas for new models to build feel free to reach out on Linkedin or twitter (@MMuckshoff). The complete code is also on github.

References

[1] G. Robins, P. Pattison, Y. Kalisher & D. Lusher, An introduction to exponential random graph (p*) models for social networks (2007), Social networks

[2] M. S. Handcock , D. R. Hunter, C. T. Butts, S. M. Goodreau & M. Morris, statnet: Software tools for the representation, visualization, analysis and simulation of network data (2008) Journal of statistical software, 24(1)

[3] P. Leifeld, S. J. Cranmer & B. A. Desmarais, Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals (2017), Journal of Statistical Software 83(6)

[4] J. D. Wilson, M. J. Denny, S. Bhamidi, S. Cranmer & B. Desmarais, Stochastic weighted graphs: Flexible model specification and simulation (2017), Social Networks, 49


Related Articles