To see live versions of the charts included in this article, and the code used to create them, see my GitHub project repository here.
When working with text datasets, getting a handle on the "who, what, and where" of the content can be a challenge. This is especially true with large collections of online articles.
The unstructured text that articles present makes it difficult to examine the collection as a whole. Instead, machine learning and Data Science techniques can parse out key information – titles, dates, keyworks, names, locations – to create a structured dataset that is much easier to navigate.
In this article I’ll highlight five interactive charts used to visualize one of these structured datasets, with the goal of exploring the trends and relationships that exist within it.
The Dataset
The ISW Dataset contains data from over 1700 articles published by the Institute of the War. ISW articles cover a wide range of political topics from around the world, and are used by intelligence professionals to keep up to date on current events.
The creation of the dataset is documented in another article, where it describes the web scraping processes used to collect the articles, and the natural language processing techniques used to enrich them. See this article for information on exactly what fields exist for this dataset.
Much of the important information found in ISW articles is related to the people, places, and dates found in the newsworthy events being written about. The following charts aim to make this information discoverable in a simple to understand, visual way.
Searchable Word Bubble
When exploring a large text dataset, a first goal is figuring out which words are used most frequently throughout it. Knowing this can give insight into the overall subject matter of the content and any patterns in reporting.
A searchable word bubble chart combines the visuals of a word cloud with the results of a keyword search:

How It Works
- Individual words are presented as circular nodes sized according to how often they appear in the collection of articles.
- The word and number of times is displayed within the node.
- New words are added to the chart by simply typing them in the search bar.
- Clicking a node will display its search results, showing each article and individual sentences where the word appears.
The chart is initially populated with words that were identified as keywords in the dataset, extracted for each ISW article using NLP techniques.
Just by looking at the chart above, its possible to start inferring trends in the ISW dataset – it concentrates on stories involving the Middle East, Russia is an dominant topic, and many conflict related terms are used repeatedly.
With those first insights its possible to dive deeper into a topic. Clicking on the ‘Russia’ node will display all articles where that word was mentioned, and the highlighted sentences give instant context to what the reporting is about.
Its easy to see the benefit of this type of chart. It gives users the best of both worlds – initial clues about what the dataset contains, paired with the capability to search further on specific items of interest.
Co-occurrence Matrix
The ISW dataset contains names of people extracted from each article. A list of these names alone provides information about the subjects of the collection, but more interesting is the relationships between them.
A co-occurrence matrix visualizes how often two people are mentioned within an article:

How It Works
- The matrix is laid out with person names along both the X and Y axes.
- Each colored cell represents the number of times two corresponding people were mentioned within a same ISW article, with darker cells indicating a higher number.
- Clicking any of the cells will show the names and pictures of both people in the side menu, as well as a list of all the articles they appear in.
- Rows and columns can be reorderd alphabetically or by total number of mentions.
Knowing how many times two people are written about in the same article can help to understand their relationship. Frequent mentions together may indicate they are connected by topic, or are routinely involved in the same news events.
Its simple to trace the column of any one person to identify everyone they are associated with. Clicking each cell and reading the linked articles can give users better context to the nature of the relationship between the individuals.
People of interest, ones with connections to many others, stick out as densly filled columns and rows on the chart. For example, Vladmir Putin is written about extensively with others, and his cells on the chart are noticably filled.
Network Link Graph
The co-occurrence matrix is great for showing one to one connections, but may not be the best for following a string of connections that weave throughout the group.
A network link graph visualizes the connections between multiple people:

How It Works
- Each person is represented as a circular node that includes their picture
- Lines connecting the nodes represent that the two individuals appear in at least one ISW article together
- Hover over a node to highlight its connections. Click and drag to pin a node in place. Double click to unpin it.
A view where all the connections between everyone is displayed at once has its own benefits. Once a person of interest is found, seeing everyone they’re connected to is quick. Seeing secondary connections from that set of people is easy as moving the mouse cursor over them.
Pinning the positions of the nodes allows users to start organizing the network spatially. Pulling key nodes towards the edge, or clustering groups of similar people can help parse through the information visually.
World Map
The ISW dataset also includes names of countries extracted from each article. Much of the content is concerned with where events are happening, so having the ability to search for articles by location is extremely helpful.
An interactive world map allows users to see exactly which countries are written about the most:

How It Works
- Map countries are color coded based on the number of times that the country was mentioned within the dataset, with darker countries indicating more mentions.
- Mouse over any country to see its name and number of mentions, and click it to display links and details of articles it was extracted from.
This chart is simple but effective. Without any prior knowledge of the dataset its easy to see that most reporting is centered in the Middle East and Russia. Digging into the articles associated with a country of interest is only a few clicks away.
Binned Timeline
Its one thing to know ‘who’ is being written about, but knowing ‘when’ they are written about can provide a lot of insight as well.
A binned timeline visualizes the years and months when people are written about:

How It Works
- The Y axis of the chart lists people names. The X axis shows a binned timeline, where each cell represents a single month of a year.
- Cells are colored by count, with darker colors representing more articles the person was mentioned in that month.
- Clicking any cell will show the name and picture of the selected person in the side menu, as well as a list of articles they appear in.
- Names can be ordered alphabetically, or by count.
Using the same names extracted from articles mentioned above, along with the publication dates of the articles they appear in, its possible to track trends in reporting throughout the years.
Trends are immediately noticeable with a quick glance at the chart. For example, reporting on John Kerry primarily exists in 2016, when he served as Secretary of State. Sergey Lavrov has a two year window from 2017 to 2019 when he was not written about at all.
For any given person its apparent when reporting on them started, stopped, and any gaps that exist. This chart is beneficial to any users looking to explore longer term trends within their dataset.
Take Away
The above charts are all examples of visuals that aim to make data more accessible and easier to comprehend. They help transform a dataset that is too large to explore manually into smaller pieces that are more approachable.
Use these charts to discover information in your own text based datasets. Take a look at the source code for each of them here.