The Office — Season 5, Episode 17 — Golden Ticket

DATA VISUALIZATION

A designer’s guide to visualize a text dataset

Create a data-driven visualization with Python and R

Ruta Gokhale
5 min readSep 13, 2019

--

In this article, I explore one way of visualizing a text dataset using Python and R code. As a user experience and information designer, I wanted to build on my technical skills. I achieved this by combining my “interests” and picking an unusual yet interesting dataset.

After a lot of searching, I came across this wonderful interactive dataviz project, ‘The Office’ Dialogue in Five Charts by The Pudding. This project was a great starting reference and it also pointed me to an awesome dataset that I could visualize. This dataset is essentially a collection of all the lines spoken in every episode of The Office.

Great! I had found data, and now I could finally begin!

Step I

First, I decided to set up some goals and constraints for this project based on my technical abilities and the skills I wished to learn.

I needed to know the WHY of my own project.

Why did I start this project in the first place? What did I hope to learn? What did I want to accomplish?

My goals for this project were:

  • Make the visualization simple enough to code, yet one that creates interesting patterns.
  • Use existing python knowledge.
  • Learn graphics rendering in R studio.
  • Develop a design mindset for non-numeric data.
  • Manage a side project while working full-time.

Step II

Next, to guide my design, I started thinking about the objectives of this visualization. My approach was to start asking questions that could be explored.

The goal wasn’t to answer all the questions, but use them to brainstorm visualization concepts and ideas.

For example, some questions were

  • What is the journey of each character?
  • What is the screen time for each character?
  • How do all the characters interact with each other?
  • Which are the main characters in each episode?

Step III

I brainstormed some ideas and viable solutions based on the available data and my goals. My ideas revolved around representing numerical values derived from a text-based dataset.

I decided to sketch out my final design, before beginning to code. This proved to be a useful artifact, for figuring out the math and the logic.

Here, I have visualized a linear conversation in a non-linear manner.

Every dialogue spoken in an episode is denoted as one line segment. To convey the continuous nature of a conversation all the line segments are connected to each other. However, they are connected at a right angle i.e. at the end of one line segment, the next one begins, but with a 90° pivot.

As the design suggests, I encoded data in the following manner -

  • dialog (number of words spoken) → length of a line segment
  • character → color

And the beauty of a non-linear data visualization is,

A really really really really (!) long piece of text can be contained within a finite space.

Step IV

I needed to convert this dataset into a more defined structure to be able to do something useful with it. Now, my idea was to use python to clean the data and use R to render it into a visual form.

The output of python program became the input to R program.

So, what output did I expect from the python program?

A CSV file, which can be imported into R studio, that contains information about co-ordinates for each line segment and its color.

I started with one random episode - Office Olympics!
(It also happens to be one of my favorites!)

I began by copying dialogs from the website and pasting it into a .txt file. I imagine python code can be written to scrape necessary data from the website, as well. Data cleanup included getting rid of text that wasn’t dialogs. On this website, the storyline of an episode i.e. actions of each character were included in square brackets and were removed from the overall text.

The next task was, figuring out the math. As the design suggests, I computed color and coordinates for each line segment, based on the character who had spoken the dialog and its length.

Step V

After executing this block of python code, I had the necessary input. I used the CSV file generated from the python program to draw a line based data visualization using R studio. Even though it was a small and relatively easy piece of code, writing it was a useful learning experience of the ggplot2 library.

And voila! That’s how it’s done.

The FINAL RESULT

With a few modifications in Illustrator, below are some outputs from this project. I’ve tinkered with colors for different patterns.

The Office — Season 2, Episode 3 — Office Olympics visualization
The Office — Season 2, Episode 3 — Office Olympics
The Office — Season 1 visualization
The Office — Season 1
The Office — Some of my favorite cold opens
The Office — Some of my favorite cold opens

Final comments

I can assume that this may not be the most brilliant or complex thing you have read. However, I do hope that it is one of the more interesting things you’ve come across today.

With this post I want to encourage you to,

  • Start small, and then iterate.
    You don’t have to always go big on one “amazing” idea. It is better to create something with multiple tiny failure loops.
  • Think outside as well as inside the box.
    You don’t always need a new solution. Often times, an innovative solution is a new way of doing the same old thing.
  • Dive into projects that are outside your comfort zone.
    Take on challenges to get comfortable with the uncomfortable.

--

--