Evaluating the success of a game-based math program
Background: ST Math is a game-based digital math program for students in grades preschool through 8. Developers and educators invested four years and $3M to evaluate how 50 minutes of engagement per week over the course of one school year would impact players’ math skills. An academic research team looked at players’ standardized test scores, comparing math proficiency pre-ST Math and post-ST Math across treatment and control groups – and found zip-a-dee-doo-dah. You can read more about it here.
No statistically significant results is never a good look. But for a company trying to sell its software to schools, this kind of outcome can be a death knell.
Additionally, this kind of outcome is ambiguous, like a black box or a silent sphinx. Clearly, we didn’t get it "right." But were we partially right? On the right track? What, if anything, needed to be optimized? Overhauled? Left alone? A universal "null" provides no clues. We needed to pop the lock on the black box, beguile the sphinx – tease out more data so we could understand WHY we didn’t find the results we expected and decide HOW to move forward.
The Challenge: Discover what was going on inside players’ ST Math experiences.
Luckily we had additional information!

When creativity meets data…
Other researchers had looked for effects on overall standardized math scores (and found none). An overall score was the average across all of a student’s math assessments, or "reporting clusters." Each reporting cluster generated its own score, and the type and number of reporting clusters varied by grade. These reporting clusters describe specific math skills like number sense, algebra & functions, measurement & geometry, and statistics, data analysis & probability.
A Eureka! moment
What if we dove deeper into the overall score and analyzed each reporting cluster? Maybe the effects of ST Math were localized to one or a handful of clusters, and these significant outcomes were "drowned out" by other clusters’ scores…
It could be that players improved a lot on measurement & geometry skills but didn’t improve at all on the other reporting clusters. When you average all that together, you might still get an overall null result. Like adding a few drops to a larger bucket, you might not pick up on any overall differences in the volume of water in that bucket.
We compared the pre and post scores across treatment and control groups for every reporting cluster – and we were right! At post, treatment group players’ number sense scores were significantly higher than they had been at pre.

Now what?
Our next task was to figure out why. Why were players’ number sense scores skyrocketing? And why didn’t students’ scores increase across other reporting clusters?
Number sense is a construct – a group of skills that allow individuals to work with numbers. Skills in this construct include (but are not limited to): understanding concepts like more and less; understanding symbols that represent quantities; arranging numbers in order; etc. Number sense is often described as a fundamental building block for math learning
What if we looked at every game in the ST Math program, identifying which skills they introduced and how they scaffolded players’ mastery? Maybe we could understand how ST Math supported number sense. Then we could recommend that developers replicate effective methods. This would cut down on time spent spinning wheels in future game development.

Qualitative coding of the games
Four researchers (and lots of coffee!) coded ST Math’s 1000+ levels of curriculum by skill. For example, when a game used numbers as objects, we checked the "number sense" box because that supports players’ understanding that symbols represent quantities. When a game provided a numberline, we checked the "number sense" box because that supports players’ ability to arrange numbers in order. Or, when a game contained representations of numbers as objects, we checked the "number sense" box (as seen in the image above).
In the end, we found that every game in the program, including games intended to support other reporting clusters, cultivated number sense skills. So that’s how ST Math supported number sense – by integrating it into every facet of the program.
And the results!
By digging deeper into our client’s data and product, we were able to uncover actionable insights – and reframe an initially dismal narrative of zero efficacy.
In fact, our project was such a success that the Institute for Education Sciences mentioned us in their statement of funding replication studies. That third-party recommendation brought a lot of visibility to ST Math and, crucially, conferred what the previous evaluation’s results denied: legitimacy.
Looking at specific design elements in games can tell you a lot about how your product is working and what intended effect your product may have. Pairing these multiple methods can save you a lot of time and headache
Onward!
Katerina Schenke, PhD. is Founder and Principal at Katalyst Methods and cofounder of EdTech Recharge, where she works with educational media companies to design and evaluate Games, software, and assessments. She also works with organizations that care about learning, like Facebook, the Connected Learning Lab, and UNICEF, to run research projects that help them to improve educational policy and practice. Learn more at katalystmethods.com