10 Years of Data Science Visualizations

Ben Weber
Towards Data Science
6 min readNov 4, 2017

--

My career in data science started a decade ago, when I took my first machine learning course at UC Santa Cruz. Since then, I’ve worked on a variety of teams and used a number of different visualization tools. While a lot of my experience in industry has been focused on business and product metrics, I wanted to share some of the visualizations I’ve put together over the past 10 years. I’ve picked a single visualization from each year over the past decade, with the goal of showing how my preference for different tools has changed over time. These aren’t necessarily the most impactful visualizations from each year, because that list would be too excel focused and might expose sensitive information. Here’s the timeline for my visualization tool preference over the past 10 years:

2007

Back before deep learning dominated the task of image classification, AdaBoost was a popular approach for machine learning tasks, such as face detection. During my first year of graduate school at UC Santa Cruz, I took Manfred Warmuth’s machine learning course. His course introduced me to a variety of tools including MATLAB and Mathematica. For my term project, I implemented Viola and Jones’ image detection algorithm in MATLAB and applied it to the task of detecting aircraft in images, with the results shown in the image below. The classifier didn’t perform well, mostly due to a lack of quality training data and a limited set of features (Haar basis functions).

Aircraft Image Detection — MATLAB- UC Santa Cruz (CMPS 242)

The banner image for this blog post was generated as part of this project. The image shows detected faces for some of the validation data set images.

2008

For my dissertation project, I wanted to build a system that learns from demonstration, and was inspired by professional StarCraft players. Before the Brood War API was released, making it possible to write bots for StarCraft, it was possible to analyze data from replays using tools such as the LM replay browser. The image below shows some of the preliminary work from my analysis of professional replays. I wrote a Java application that drew dots and lines corresponding to troop movements during a StarCraft game.

StarCraft: Brood War Gameplay Traces —Java — UC Santa Cruz

2009

After collecting and mining thousands of replays, I trained various classification algorithms to detect different build orders in StarCraft 1. The results from this analysis are shown in the excel chart below. One of the interesting results from this analysis was that some of the algorithms outperformed the ruleset classifier used to label the images. I published this result and other findings at IEEE CIG 2009.

StarCraft 1 Build Order Prediction — Excel — UC Santa Cruz (EISBot)

A few months ago, I wrote a blog post about how I would have would have approached this analysis today, given my current preference for scripting and visualization tools.

2010

During my internship at Electronic Arts in fall 2010, I had the opportunity to data mine terabytes of data collected from Madden NFL 11. One of the analyses I performed was looking at different player archetypes in Madden, based on preferred gameplay modes. I used Weka to perform this task and the results are shown in the PCA visualization below. During my internship at EA, I also working on modeling player retention and pulled data on some interesting business metrics.

Player Archetypes in Madden NFL 11 — Weka — Electronic Arts

2011

One of the challenges I faced in building a StarCraft bot was dealing with the fog of war, which restricts a player’s visibility of the map. I created a particle model to track enemy units that were previously observed by no longer within the bot’s vision. I wrote a script using Java to create the video below that visualizes the bot’s observations. The results were published at AIIDE 2011.

State Estimation in StarCraft — Java — UC Santa Cruz (EISBot)

2012

The last visualization I’ll present from my dissertation project is a graph showing the actions in the agent’s behavior tree, which it uses for decision making. The bot is authored in the ABL reactive planning language, and I wrote a script in Java to parse the code and output a graph format. I then used the protoviz library to create a rendering of the graph.

EISBot’s Behavior Tree — Protoviz— UC Santa Cruz (EISBot)

2013

While working on the Studio User Research team at Microsoft, I helped the team build visualizations for playtesting of Ryse: Son of Rome and Sunset Overdrive. Deborah Hendersen presented some of the results of this work at GDC 2017. The image below shows enemy death locations for a playtest session visualized using Tableau. We found that including visualizations in addition to feedback from participants resulted in game studios paying more attention to the reports.

Ryse: Son of Rome Playtesting — Tableau — Microsoft Studios User Research

2014

At Sony Online Entertainment, I continued using Tableau for reporting business metrics and for putting together visualizations of player activity in our MMOs. Some of the most impactful visualizations I put together were player funnels, which showed where players were dropping off during gameplay. I also made a bunch of heatmaps, which didn’t result in as many actionable results, but did get more attention from game teams.

Activity heatmap in DC Universe Online — Tableau — Sony Online Entertainment

2015

At Electronic Arts, I made the switch to R for the majority of my analysis, and started exploring a variety of visualization tools in the R ecosystem, including Shiny and htmlwidgets. One of the projects I worked on at EA was a server for running the analytics team’s R scripts. I wrote a shiny script to create a web app that visualized server load, shown in image below. I presented the RServer project at useR! 2016. However, my current recommendation would be for teams to use a more mature project such as Airflow.

RServer Load — R(Shiny) + dygraphs package — Electronic Arts (Global Analytics & Insights)

2016

The science team at Twitch does a fair amount of A/B experimentation to determine which features to rollout to the user base. We also encountered a situation where A/B testing was not practical, and instead used a staged rollout to analyze the impact of a significant app rewrite. I used the CausalImpact package in R to evaluate the impact of the staged rollout on the number of app sessions, with the results shown below. I previously discussed this approach in more detail here.

Staged rollout impact on mobile sessions — R + CausalImpact package — Twitch

2017

I’m continuing to use R for visualizations, even though I recently switched from the gaming industry to finance technology. One change I have made is doing more of my analysis in Jupyter notebooks rather than RStudio. In the coming months I’ll be writing more blog posts about the type of analysis I’m doing at Windfall Data. For now, here’s a US population heatmap generated from census data.

US Population Map — R + ggplot2 package — Windfall Data

I’ve used a variety of different tools over the past decade, and have found that there are trade-offs with each of these. My focus has shifted over the years to tools that are open source and enable reproducible research.

--

--