Exploring the Meaning of AI, Data Science and Machine Learning with the latest Wikipedia Clickstream

Published in

Towards Data Science

10 min readMay 2, 2018

Terms such as data science, machine learning and artificial intelligence have found a well deserved spot in the “pantheon” of tech buzzwords.

We hear and read about them almost on a daily basis. In fact they are often used interchangeably, even though depending on the source there are clear preferences (you might have noticed a certain fondness that say marketing and media have for the term AI).

https://medium.com/enabled-innovation/artificial-general-intelligence-too-much-or-too-little-too-soon-9c0dd7bd1c2d

The illustration above is an adaptation from PWC, one of the many Venn diagrams available on the internet attempting to explain the relationship between the terms, but it is not very straightforward.

Can we use data and analytical methods to capture the meaning and semantic context of these terms ? Ok, let’s give it a go and try to fight fire with fire. The Wikipedia clickstream dataset will be of great help for this.

The Wikipedia clickstream

The Wikimedia foundation has recently decided to open source its monthly clickstreams across several key languages. A typical clickstream in English language contains millions of distinct Wikipedia urls, requested billions of times by internet users.

Having such a comprehensive resource available is great news. It enables several sophisticated types of analysis around online user behaviour over one of the world’s busiest websites. Indeed there are several published academic papers that study the clickstream dataset from various scientific perspectives, especially behavioural.

This will be a practical approach using network analysis. Not only to answer the question above, but also to motivate the use of clickstream and network analysis for other applications and domains. Keyword research, content development, web traffic visualisation come to mind and I am sure there are many others.

The Dataset

We will use the latest available dataset from March 2018 which contains 23.3M pairs of referrer and target urls in English language, accounting for a total of 6.2B page requests to the Wikipedia servers.

The dataset makes it possible to visualise the relationship between different terms based on the way users navigate from one Wikipedia page to the other, either by following the links or using the built-in search functionality — sometimes falling victims to the Wikipedia rabbit holes.

Nodes, Edges and Neighbourhoods

The dataset is practically a gigantic network of nodes (the Wikipedia articles) which are connected via edges to other nodes (related articles) based on the sequence of pages requested during the user session.

The size of the edges can be thought of as a weight factor reflecting the traffic between two nodes. One of the benefits of working with network analysis is that its outputs can be graphed. This allows us to see the overall view of what all those clicks and page transitions might mean.

NOTE: The graphs that follow have been configured to enable decent viewing on mobile devices. However they are best viewed on a large screen.

Next we will examine clickstream graphs of the following terms:

Data mining and data science
Machine learning
Artificial intelligence and artificial general intelligence.

Warm-up with data mining

As a warm-up and in order to explain the method of analysis we ‘ll start with Data mining, a term that was heavily used in the past.

This initial graph has isolated the part of the network that involves “data mining” and its neighbourhood of adjacent nodes (articles). The edge between two nodes is the traffic between them and the colour and size of the nodes reflect the number of neighbours they are connected to. This is already useful to some extent. In order to make more sense of the related terms, as a second step, the nodes were grouped into relevant categories, in order to see the broader relationships more clearly.

The downside of this (beyond my own personal bias applying the labels) is that the graph can get overloaded with text and colours. Moreover, in some cases e.g. AI it can be really hard to contain the number of groups to a manageable level. In the case of artificial general intelligence the nodes are not homogeneous enough to create meaningful categories. In both cases the nodes were on purpose not placed in groups. Note also that in order to make the graph more clear, some of the nodes are not visible- either because of their low traffic or because of having a very low number of neighbours.

Data Science

One thing to note from the start is that the Wikipedia entry for data science is relatively limited. It has the least amount of hyperlinks among all the other terms reviewed in the article. The content however has been growing substantially. Since November of 2017 when the first dataset was released, the links on the page almost doubled.

After an observation of the graph, we can say that from a Wikipedia perspective Data Science revolves around machine learning (primarily) and statistics (secondarily) with a touch of computer science as well (that becomes more obvious when considering also the long tail of nodes).
In practice I think that most would agree that data science is broader than that. As the article content keeps growing hopefully there will be more evidence of this soon.
To the commonly asked question of whether machine learning should be considered an essential component of data science, the answer from a Wikipedia clickstream perspective is clearly yes.

(Fun fact about data science: one of the pages associated with it is actually the Wikipedia article for buzzword, even though its vertex degree was not sufficient to get it into the graph).

Data science — data mining comparison

The article for data mining is much more comprehensive as it has been around for much longer.

That said there is still a fair amount of similarity between the type of associated terms. Most of the terms in the data science graph appear in the data mining one too. This could give some ground to those suggesting that data science is a rebranded version of data mining as used in the ’90s and ‘00s.

For the moment however, data science lacks strong associations with business related terms such as business intelligence, analytics and maybe terms like OLAP that data mining has, but those and other associations might emerge as the data science article develops.

Machine Learning

Machine learning seems to be the most straightforward case of all. It is for the most part associated with terms referring to different scientific methods for knowledge discovery or prediction (labelled as machine or statistical learning methods).

We see many nodes of the following two types: classical methods like statistical classification and logistic regression and other more modern ones like Support Vector Machines, Random Forests and Artificial Neural Nets.
The concept of machine learning is primarily associated with methods, models and techniques. Many -but not all- have as objective to predict an outcome given a set of observations. In fact the list of ML associated nodes could easily be turned into a table of contents for a book around ML.
The presence of several “x” learning types such as deep learning, or reinforcement learning is also impressive. It’s not any more just about supervised and unsupervised learning. The finding suggests that ML itself might be evolving into an umbrella term to host an increasing number of learning families.
Does Machine learning “need” Stats ? Just like before the Wikipedia answer here is a yes, as there is statistical or statistical learning presence in many of the ML-associated methods.

AI

This is by far the most broad, diverse term but also the one hardest to convert into a graph. AI by itself is associated with more than 450 nodes and I had to set a high threshold for the number of neighbours so that the graph wouldn’t look like a massive ball of hair. The diversity of the nodes still made it hard to define meaningful categories.

Many of the nodes are of very general nature: names for sciences (Psychology, Philosophy) or scientific fields (Logic, Game theory etc) which is evidence of how multi-discipline AI is.
Most of the related nodes are essentially references to some particular aspects of AI e.g. applications of AI, glossary of AI, timeline, history etc. This is something that we didn’t observe in the previous terms.
Likewise, here for the first time we see the presence of a company: Deep Mind, which can be a result of the role that this organisation has played in applying AI into real world applications.
Finally, there are two nodes about Alan Turing, his own page and that of the eponymous test. It’s just impressive how the Turing test dating back to 1950 has stood the test of time and is still relevant in the context of AI today.

Artificial General Intelligence

To stay current with the spirit of the times the last clickstream network graph is for artificial general intelligence.

The nodes of the graph are not homogeneous enough to create multiple meaningful clusters. However, between nodes bearing names such as “mind uploading”, and “intelligence explosion” we can see two themes that editors and users seem to be interested in.

Let’s call the fist cluster “AI for evil” including pages like

Existential risk from artificial general intelligence
AI takeover
Global catastrophic risk

Let’s close with the “AI for good” cluster that includes:

Friendly artificial intelligence
Philosophy of artificial intelligence and
Ethics of artificial intelligence

The clusters by themselves cannot obviously define artificial general intelligence but they are indicative of popular content associated with them.

Artificial neural net ubiquity

We have explored the 5 terms and artificial neural net (ANN) was the only associated term present and prominent in all 5.

It’s hard to accept that this is a coincidence. ANNs combined with modern computational advances are highly associated with recent progress in the field of AI. In some ways they helped an AI comeback after many years of AI winter.

Currently state of the art deep learning models are based on ANNs. Tensorflow, Deep Mind’s most famous open source project is one of the most popular libraries for ANN applications in terms of Github interactions and Stackoverflow traffic. Taking everything into consideration it seems likely that ANNs are here to stay and we are going to hear more and more about them (my prediction).

Code

The code to generate the graphs using the R language is available in this github repo where some of the parameters and filters choices (e.g. number of edges filter, order, degree) are available.

You can use the easy start script, to quickly experiment with how other concepts of your choice relate to each other.

Motivation

This article was motivated by a combination of pieces of content, to which I was exposed in recent times and which I recommend strongly.

Mikhail Popov’s blog post on creating the net neutrality diagram
Mick Cooney’s training materials on network analysis with the Dublin data science meetup group
David Robinson’s blogpost and WinVector’s article explaining the difference between AI, ML, Data science (and company) from a practical point of view

Also I ‘d like to stress how open source libraries like ggplot and ggraph are really helpful when working with network data visualisation.

Final comments

As you are now judging the content you have just read, and maybe already having some objections, remember: the goal wasn’t to define any term but rather to look at them through the perspective of Wikipedia editors’ and users’ behaviour. From my end it was a bit like working on a jigsaw puzzle.

Of course there is an inherent bias in that articles reflect the biases of their own authors. It is positive however is that in the case of Wikipedia, thanks to the wiki editorial model and process of control we can count on the wisdom of the crowds to mitigate any strong biases or extreme/misinformed views. Just as an example the Wikipedia article for AI has a number of editors and watchers that’s in the thousands.

To conclude, an important benefit of performing network analysis with the Wikipedia clickstream is that it can shed light around what both users and contributors of the world’s largest encyclopedia consider important. Based on either how they write or how they click.

What do you think, are there any other terms worth to explore with the Wikipedia clickstream ? Do you see any other interesting uses for this dataset ? Data science practitioners: Is there a way to automate the grouping of the nodes in the graphs using e.g. NLP ?

Alex Papageorgiou

About me

I am an independent consultant in marketing analytics and data science, helping conversion-driven digital businesses to make informed marketing decisions. I share my stories about digital, marketing and data analytics -often combined- on my blog and via Twitter and LinkedIn.