The world’s leading publication for data science, AI, and ML professionals.

The Secret Network of Owls

A data-based tribute to the International Owl Awareness Day

Did you know that the 4th of August is International Owl Awareness Day? Me neither until idle browsing on cute owl memes led me to this website. Then, as we recently found a lovely owl family in our garden, I thought I would check them out on Wikipedia – and was stunned to see that there are 254 owl species recorded on Wiki, our favorite free open-access knowledge source. The only logical next step was to turn this into a Data visualization to better understand the international owl landscape.

Namely, I will automatically download the full list of owl species and then their Wikipedia profiles. Then, I will use text matching and the NetworkX graph analytics library to extract the similarity network of the owl species, which I will then visualize. This way, we will have a visual representation of the owl species, which makes it much easier to interpret the relationship between different species.

Additionally, while the topic owl is timely here, the methods and steps are easily adaptable to any other topic we would like to cover and turn into a knowledge graph relying on the publicly available Wikipedia database.

All images were created by the author.

1. Collect the list of owl species

The first step was to quickly collect the names of all listed owl species, which were neatly uploaded into this table on Wikipedia:

Source: https://en.wikipedia.org/wiki/List_of_owl_species
Source: https://en.wikipedia.org/wiki/List_of_owl_species

To get the content of this table, I first downloaded the HTML source code using the request library, and then quickly extracted all relevant informaiton using Beautiful Soup as follows. As the output of the code cell shows, I was able to get the names of all 254 owl species.

# import the necessary libraries
from urllib.request import urlopen
import bs4 as bs

# take the url from wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_owl_species'

# download and parse the html source
sauce = urlopen(url).read()
soup  = bs.BeautifulSoup(sauce,'lxml')

# extract the owl names from the relevant tags within the table
# note: the 'td' tags will also contain the binomial names, which I 
# managed to filter out by simply dropping all elements which as any
# year-looking objects in it (since no owl species were discoverd in the 2000s,
# filtering for '1' is just enough)
tags = soup.find_all('td')
tags = [t for t in tags if 'href' in str(t)]
owls = []

for t in tags:
    if '1' not in t.text:
        owls.append(t.text)

len(owls)

This cell should output 254 for the length of the list containing all owl names. Note: here, I only extracted their common names since, as you will see in the next section, that’s just enough to download their Wikipedia profile.

2. Download the Wiki profile of each owl species

Now that we have the complete list of owl species, each described by their common names, let’s use the Wikipedia API to download their profile pages and save their textual contents for later use.

Let’s get set up and do an example query:

# importing the api
import wikipediaapi

# setting up the api connection
wiki_wiki = wikipediaapi.Wikipedia('owl-miniproject')

# preparing a folder to save our colleted data
import os
folderout = 'wiki_page_content' 

if not os.path.exists(folderout):  
    os.makedirs(folderout)

# a test sample
page = wiki_wiki.page('Greater sooty owl')
page.text[0:1000]

The output is the first 1000 characters of the profile page of the Greater sooty owl:

For comparison, you may also find this on Wikipedia:

Source: https://en.wikipedia.org/wiki/Greater_sooty_owl
Source: https://en.wikipedia.org/wiki/Greater_sooty_owl

Now, let’s download all profiles:

for idx, owl in enumerate(owls) :

    if idx%10==0:
        print(idx)

    if not os.path.exists(folderout + '/' + owl ):
        page = wiki_wiki.page(owl)
        fout = open(folderout + '/' + owl , 'w')
        fout.write(page.text)
        fout.close()

This will result in 254 textual files, each containing the full textual profiles of the corresponding owl species. Now, in the last section, let’s turn this into a graph.

3. Build the owl network

Here, first, we will parse the previously downloaded textual data containing the owl profiles.

Then, in two for loops, we compare each owl species in a pairwise fashion and compute their similarity, the connection strength in our graph, by the total number of cross references between the profiles of the two species. This is emaszred by the number of times the first species is metionedin the second one’s profile, and the other way around. To build the graph, we use NetworkX.

# let's store the textual profile of each owl in the following dictionary
names_texts = {}
for name in [f for f in os.listdir(folderout) if '.DS' not in f]:
    with open(folderout + '/' + name) as myfile:
        names_texts[name] = myfile.read().split('External links')[0]

# import networkx as create an empty graph object
import networkx as nx
G = nx.Graph()

# now build the graph by pair-wise comparing each owl profile
edges = {}
for name1, text1 in names_texts.items():
    for name2, text2 in names_texts.items():

        if name1 != name2:

            weight = text1.count(name2) + text2.count(name1)
            if weight > 0:
                G.add_edge(name1, name2, weight = weight)

# finally, let's show the size of the graph we built
print(G.number_of_nodes(), G.number_of_edges())

Next, I used Matplotlib and NetworkX to quickly visualize this graph:

import matplotlib.pyplot as plt

# Define the node size proportional to the node degree
node_sizes = [100 * nx.degree(G, node) for node in G.nodes()]

# Define the edge width proportional to the edge weight
edge_widths = [G[u][v]['weight'] for u, v in G.edges()]

# Get positions for the nodes using a force layout
pos = nx.spring_layout(G, k = 0.6)

# Draw the nodes with labels
f, ax = plt.subplots(1,1,figsize=(10,10))

nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='lightblue', ax = ax)
nx.draw_networkx_labels(G, pos, font_size = 5, ax = ax)

# Draw the edges with widths proportional to the weights
nx.draw_networkx_edges(G, pos, width=edge_widths, ax = ax)

ax.axis('off')

However, as graphs are drawn in Python (with just a few lines of code), this graph is quite a mess, so I gave it another shot and visualized it in Gephi (a tutorial on how to do that is coming soon). The results look like this:

The figure we got is actually quite interesting and revealing in terms of how these great owl species form larger groups – even if we are ornithologists. The largest one is centered by the Eurasian eagle owl, which mostly contains species living in Europe and Asia. We may see another larger cluster with various scops in it – it turns out scops owls are small to medium-sized owls. We also see a separate graph component with various boobooks, inhabitant owls of Australia and New Zealand, while the network cluster on the bottom right mostly contains pygmy owls typically found in forests, from topical rainforests to dry forests and mountain regions.

While by conducting this quick knowledge graph analysis, I certainly did not become an owl expert. However, I certainly learned a great deal more about owls than I knew before.

This also highlights the power of network analytics and knowledge graphs on how to quickly explore literally any domain, from science to art, from business to HR, by combining data processing and visualization and opening the space for more advanced analytical steps as well as domain-expert collaborations.


Related Articles