Visualization of 10 Years Twitter Data (Part 2 — Design)

Published in

Towards Data Science

5 min readApr 17, 2017

This is about my data visualization project http://tany.kim/twitter. Read previous post about data cleaning and processing.

With a quite massive data set of 102K tweets and 349 friends, I came up with two main views (Tweets and Friends). Each view has a series of interactive visualizations, some of which are linked and updated by user interaction happening in one of the views. Here I describe the decision making process — why I chose these visualization types and interaction. Discussing design process inevitably include insights found with the visualizations.

Stacked bar charts for tweets by categories

Stacked bar charts of tweets categorized with interaction type during a selected time range

For the analysis of tweets, I chose 4 categories (Interaction, Media, Language, Source). Tweets in each category belong to one of 2–3 types are color-coded. Those tweets that do not belong to any type are seen in light grey. To show the timeline of tweets, I counted tweets by month — tweets of a month are represented in a group of stacked bars. With this stacked bar chart, I can see the I rarely retweet and I have started quoting only since late 2015. Also I find I mentioned more (i.e., talked to friends) for the past one year than before.

Same technique is used to show the number of tweets by hour or day. In the figure below that show the source of tweets, I can see a clear pattern that I tweet with my phone as soon as I wake up, then do with a computer as I start to work, then the ratio of phone use increases in the evening.

Stacked bars to show the source of tweets (green: big screen, orange: small screen)

Pie chart for the ratio of types

Pie charts of four categories. Selected Category is highlighted.

The stacked bar chart is useful for investigating the change of tweets over time. In addition to that, I wanted to provide an easier way of showing the ratio of types of tweets in a given time. These four pie/donut charts aggregate all the tweets in the selected time range.

Matrix (Heat Map) view for day/hour

Number of all tweets (not categorized) by day and hour

I used Matrix view only for the “All Tweets” so I can encode only one property of dataset (number of tweets) into a visual attribute (opacity of area-filling color). For my past projects I liked to use heat map with limited number (usually maximum 7) of steps within two colors. This way, each color can present a discrete group of the entire range. I like the aesthetics of such distinctive use of colors, however for this project, I wanted to show the subtle difference among the 7x24 data points, which is also a simpler choice for coding.

Opacity is a great way to distinguish the relative difference with neighboring elements. But the exact value is hard to decode despite the legend of data range in a gradient. To support this, I added mouseover interaction that triggers a tooltip. This function is not supported for smaller screens.

All connected visualizations for friend search

In contrast to the Tweets view that has conventional sub-menu type of content structure, the Friends view is a search-driven single page style.

Four components (search/dropdown list of friends, line graphs, force-directed network graph, and a scatter plot) function as user interface through which a friend can be selected. When a friend is selected, all visualization in Friends view are updated specifying the status of the friend.

Line graphs for mentions to friends

Early prototyping of line graphs for friends

This is the very first visualization I made for this project. The figure is a screenshot from an early phase of development. This can be interesting but too cluttered, so I wanted to encode one more dimension of the dataset, possibly categorical data so colors can be used. The idea of tagging friends as real world and/or online friend that I explained earlier originally came from this needs. This is a great example of iterative data visualization design; datasets are not always complete before handed to a designer. Ideas for interesting visualization can trigger creation of new data attributes.

With color coding, I can see clearly that I talked with people that I already know in real world, then I expanded my network to people I’ve never met. Then I met some of them in person when I had a chance to visit the city where they live. I met some of them for multiple times and I appreciate the network that is created thanks to Twitter.

Scatterplot

One of the simplest visualizations to encode two dimensional data is scatterplot. I used this to encode the number of mentions (X axis) and the duration of conversation (Y axis). Each dot represents a single friend and is color-coded with the matching category used in the line graphs.

Force-directed graph

Tracing mentioned accounts of each tweet, I analyzed how my friends are connected to one another. A common visualization form is force-directed graph and I started to code this with this example. Each node (circle) is color coded same as line graphs and scatterplot, the thickness of edge (line between two circles) represents the number of appearing together in a tweet. When a friend is selected the corresponding node and edges are highlighted.

Histograms & distribution curves & ranking

Before I made scatterplot, I made two separate histograms to show the distribution of number of mentions, and that of duration of communication. Before I made these histograms, I had anticipated the look would be close to normal distribution. But the fact was very surprising. Many of my Twitter friends (+250 out of 349) have received mentions fewer than total 100 mentions whereas a few have more than 2000 mentions. The duration graph was not as extreme as the count one, but still showed similar pattern. In fact, these screwed distribution led me to make the scatterplot to display every single friend juxtaposing the two dimensions-count and duration. In addition, after parsing the network graph data, I thought showing the distribution of the number of shared friend would be fun too.

I put a curve of the distribution over the histogram, then listed top 10 friends in each category. Histogram itself isn’t equipped with interactive features but shows a selected friend’s status.