Using Network Science to explore hashtag culture on Instagram

A practical walk-through of how to scrape hashtags from Instagram and model their relationships with one another using NetworkX

Sam Ho
Towards Data Science

--

Abstract

This project proposes that Network Science and graph theory can be used to effectively analyse hashtag culture on Instagram. The relational characteristics of Instagram hashtags can be realised through these techniques and this allows for deeper understanding of conversations and themes found across hashtags.

Approach

To demonstrate this idea I have written a practical and technical walkthrough of how to do this using Python and NetworkX.

Around 37,000 Instagram posts all containing at least one mention of #happiness were scraped from Instagram. All unique hashtags in the corpus (other than #happiness) represented nodes in the graph. Edges were formed between hashtags if they were mentioned together in the same post.

This yielded an undirected ‘Happiness’ graph which was then analysed using community detection and various centrality measures.

Findings

Community detection identified three clusters, interpreted as:

  • Aspects of happiness that seem to be about what people do and experience (e.g. ‘#photography, #summer, #travel,#family )
  • Aspects of happiness that broadly seem to be about how people think and feel (e.g #life, #motivation, #inspiration, #quotes)
  • A third very distinct cluster was entirely about #weddings, #celebrations and #parties

Conclusion & Applications

This approach has shown that graph theory can be a very intuitive and useful way to explore social media metadata such as hashtags. Its application could be in two immediate areas:

1. We might want to graph thematic entities e.g people posting about sustainability, Brexit, Coronavirus

2. We could use graphs to model and understand the conversation around specific brands, events and places

(We need to bear in mind the nature of content on Instagram as this will guide what could be a suitable application)

All code for the functionality to replicate this task can be found on my GitHub:

Contents

Photo by Joanna Kosinska on Unsplash

1) Background & Context

  • Instagram is huge
  • Hashtag culture: making life easier for a Data Scientist!
  • A picture is worth a thousand words ….or a handful of hashtags?
  • What does ‘happiness’ mean to people on Instagram?

2) Network Science & Graph Theory

  • What is a Graph?
  • What aspects of Graph Theory can we use in our analysis?
  • Further Reading on Graphs

3) Building Functionality in Python

  • Search for hashtags on Instagram using Selenium WebDriver
  • Capture unique Instagram URL’s
  • Parse hashtags and other content from html using multi-thread processing
  • Feature Generation & Exploratory Analysis
  • Data Selection
  • Graph Building
  • Visualisation / Analysis & Interpretation

1) Background & Context

Photo by Noiseporn on Unsplash

Instagram is huge

Instagram began life in 2006 as a niche place for creative expression. Early adopters flocked to the app keen to share their pictures of urban graffiti, soy lattes and bowls of ramen. iPhone 1 + Instagram was a stamp of coolness — a golden era for the digitally hip.

Alas, this didn’t last. Instagram’s gentrification was inevitable and when Facebook bought Instagram in 2012, they swiftly monetised it through advertising and turned Instagram into a cash cow of gigantic proportions.

In 2020 Instagram is no longer cool, it’s a commercial behemoth.

Source: https://www.statista.com/statistics/253577/number-of-monthly-active-instagram-users/

With an estimated 1 billion active monthly users and 500 million daily stories, the rise and reach of Instagram is nothing but impressive.

Subsequently, Instagram has in many ways become a reliable barometer of mainstream consumer culture and this means that it lends itself as a potentially useful tool to find stuff out about people.

Whilst it would be reasonable to assume that Instagram’s images and videos would be the primary ‘go-to’ sources of data, I’ve found that analysing Instagram’s hashtag culture is a more fruitful way to get under the skin of themes and topics that occur on Instagram. Coupled with analytical techniques borrowed from Network Science means that the relationships between themes and topics can be more easily understood.

Hashtag culture: making life easier for a Data Scientist!

Photo by Brina Blum on Unsplash

One of the more important parts of a Data Scientist’s job is being able to commit the time to ensure that our data is fit for purpose and this involves thorough cleaning of our data. This can take up a lot of time, especially with social media data that tends to be extremely unstructured and frequently littered with ‘noise’.

Without sufficient pre-processing, the natural language processing techniques we use with social media data can lead to outputs that are hard to interpret and light on insight.

There are various methods we can use to clean and organise the raw language we find on social media but it’s made infinitely easier when users do it for us. Hashtag culture on Instagram — which by its very nature is organised, normalised and less noisy — is a cultural phenomenon that does all this hard work for us.

A picture is worth a thousand words ….or a handful of hashtags?

Photo by George Pagan III on Unsplash

“A hashtag — introduced by the octothorpe symbol (#) — is a type of metadata tag used on social networks such as Twitter and other microblogging services. It lets users apply dynamic, user-generated tagging that helps other users easily find messages with a specific theme or content”

People often end an Instagram post with a series of hashtags. Hashtags can provide a succinct summary of what that person was thinking, feeling and observing in that moment. Hashtags provide context and meaning that the image alone is unable to.

Let’s find an example to illustrate this.

What does ‘happiness’ mean to people on Instagram?

Photo by Tim Mossholder on Unsplash

Social media is increasingly likely to get bashed as being something that makes people miserable….and sadly there is truth in this.

Let’s be contrarian, flip this truth and explore HAPPINESS on Instagram.

The following Instagram post contains #happiness in the body of the post. What can we interpret from this post by the image alone?

https://www.instagram.com/thatdesivegangal/ — public post containing #happiness

Not much. Maybe something to do with music?

Other than the obvious there is little to go on from the image alone. What else were they thinking/feeling when they posted this?

Now when we look at all the hashtags this person made alongside their post, we learn a lot more about this person and what they associated with #happiness at that moment in time.

https://www.instagram.com/thatdesivegangal/ — public post containing #happiness

For this person, happiness has an association with positivity, motivation, change and weight loss.

Whilst Instagram images can tell us something, the hashtags people use tell us considerably more.

Scaling up the hashtag analysis

Now, this is just one Instagram post containing #happiness. What happens when we look at the surrounding hashtags for other people’s posts mentioning #happiness?

What happens if we scale this analysis up and look at thousands upon thousands of posts containing #happiness?

Instagram posts all containing #happiness

Pretty soon we would be in a position where we have a vast number of hashtags to navigate. What scalable approach will allow us to make sense of all this data?

This is where we can use Network Science and Graph Theory to help us.

2) Network Science & Graph Theory

Photo by Anastasia Dulgier on Unsplash

Network science is a thriving and increasingly important cross-disciplinary domain that focuses on the representation, analysis and modelling of complex social, biological and technological systems as networks or graphs

Source: http://tuvalu.santafe.edu/~aaronc/courses/5352/

I like to think of Network Science and graph theory as methods that allow us to understand how things are connected. We can borrow some basic principles from Network Science and graph theory to understand how hashtags on Instagram are connected.

The subject matter relating to Network Science and graph theory is incredibly dense and would take a very long time to cover. This would distract from the practical focus of the project. There are however some simple but fundamental concepts that need explaining.

What is a Graph?

A graph (G) is the abstract representation of a network. Graphs are made up of vertices (V) and edges (E) where 𝐺=(𝑉,𝐸)

Source: https://www.computerhope.com/jargon/n/node.htm
  • Nodes (V) in a graph represent the unique data points that exist in our data. In the case of hashtag analysis, it makes sense for hashtags to represent the nodes in our network. So if our #happiness dataset only had two hashtags — #A and #B we would have two nodes
Two nodes representing two hashtags
  • Edges (E) are connections that represent some kind of relationship between nodes. In the case of analysing hashtags, it might make sense to represent this relationship as some kind of occurrence i.e. if the hashtag #A was mentioned in the same post as #B, we would assume that there is a relationship between #A and #B and therefore we would create an edge between these two nodes
An edge is formed between two hashtags if they are mentioned together in the same post

As we add hashtags from other posts to the graph and model their relationships with all previous posts we begin to see structure form.

As more hashtags are added to the graph and we see where there are connections, we can begin to see structure form

What aspects of Graph Theory can we use in our analysis?

  • Community Detection. We can use algorithms to identify and label clusters of topics/themes that are associated with #happiness. In this example we have 14 hashtags, all connected in various ways but with distinct clusters forming.
Identifying communities of hashtags in the graph
  • Degree Centrality / Betweenness Centrality. We can calculate what hashtags in the network are particularly important in linking the whole network. Much like Heathrow Airport links up a lot of the world, what hashtags link up the #happiness ‘landscape’?
Identifying which hashtags have a central role in the graph
  • Visualisation. If we plot the network using scatterplots, it’s a very compelling way to visualise a huge amount of information about #happiness that would be cumbersome to do otherwise

Further Reading on Graphs

Maël Fabien has written a whole series which covers the content around graph theory really well. I strongly recommend you have a read of this at some point if you find this topic interesting.

Let’s now look at the practical steps we need to take to build our network.

3) Building Functionality in Python

Photo by Kevin Ku on Unsplash

I have built two Python classes that handle all the processes needed to go from capturing the data all the way through to building, visualising and analysing the #Happiness graph. The key pipelines within each class are outlined below.

class InstagramScraper()
  • Search for hashtags on Instagram using Selenium WebDriver
  • Capture unique Instagram URL’s
  • Parse hashtags and other content from post html using multi-thread processing
class InstagramGraph()
  • Feature generation & exploratory analysis
  • Data selection
  • Graph building
  • Visualisation

You can access the full code for these classes and their relevant readme documentation here:

https://github.com/kitsamho/Instagram_Scraper_Graph

class InstagramScraper()

Source: https://hackernoon.com/how-to-scrape-a-website-without-getting-blacklisted-271a605a0d94
class InstagramScraper()
  • Search for hashtags on Instagram using Selenium WebDriver
  • Capture unique Instagram URL’s
  • Parse hashtags and other content from html using multi-thread processing

This stage uses a combination of automated web browsing and html parsing to scrape content from Instagram posts. I have built a custom Python class InstagramScraper that contains a pipeline of scraping and data extraction methods that returns a Pandas DataFrame of scraped Instagram data.

I will walk you through the main processes below with a few example code blocks to highlight the main steps involved.

Search for hashtags on Instagram using Selenium WebDriver

There is functionality on Instagram where a user can search for specific hashtags and Instagram returns all the posts that contain that hashtag. This is our first step in getting the data we need; get Instagram to do the work for us using Selenium webdriver.

Search for Instagram posts containing #Happiness hashtags on Instagram

Capture unique Instagram URL’s

Use Selenium WebDriver to capture unique Instagram URLs

Every post on Instagram has its own unique URL. After we have searched for all the posts containing a specific hashtag, the first step is to capture all of these unique URL’s for posts that contain that hashtag.

I found the best way to do this is through using a combination of Selenium WebDriver and parsing the html on the fly, extracting each posts href attributes to a data structure as we scroll through all the content.

As this part of the Instagram site is highly dynamic requiring consistent scrolling to load new images, using some kind of browser based automation such as Selenium WebDriver is the only feasible way of capturing the data. This means it can take a little while to get all the links.

Capture unique Instagram URL’s — Example Code

Use Selenium WebDriver & Beautiful Soup to capture unique Instagram URL’s. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Parse hashtags and other content

Once you have a list of all the links, you can then use a combination of urlopen and html parsing (I use Beautiful Soup) to get the data for each Instagram post.

Once you have the html for each post and parse its content, you will find that data for Instagram posts has a consistent structure resembling a dictionary / JSON style format

Parsing hashtags and other content — Example Code

Each Instagram post contains data in a dictionary / JSON type structure. We use individual functions to iterate over each post extracting the data we want. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

InstagramScraper Output

The main output of the InstagramScraper class is a Pandas DataFrame containing a host of variables by post.

Pandas DataFrame output from InstagramScraper class

Most of these variables are self explanatory and offer all kinds of options for exploratory analysis beyond hashtags.

Now that we have our scraped data and have a nice and tidy Pandas DataFrame we can begin to think about processing our data in preparation for building our network and visualising our graph, which is where things get really fun.

class InstagramGraph()

Source: https://techcrunch.com/wp-content/uploads/2017/12/instagram-hashtags.png?w=730&crop=1

I have built a second python class called InstagramGraph which also contains a pipeline of methods that allow you to analyse an Instagram dataset and ultimately model the hashtag data as an instance of a NetworkX graph object using Plotly visualisations.

class InstagramGraph()

The key processes within this class are:

  • Feature Generation & Exploratory Analysis
  • Data Selection
  • Graph Building
  • Visualisation / Analysis & Interpretation

Feature Creation & Exploratory Analysis

Photo by Vadim Sherbakov on Unsplash

The code block below outlines a pipeline from the InstagramGraph class which contains a suite of methods that create new features in our data.

A series of methods generating new features in our scraped Instagram data. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

There are several methods here, some are linguistic (e.g. determining the language of a post) although most of the processes within this method generate extra descriptive metrics such as median post count by user, total hashtag count by post and median hashtag use by user.

Whilst some of these new features/metrics aren’t directly related to graph building it’s alway worth generating more data points where you can as it offers a chance to do some exploratory analysis of the subject matter which might provide more context about what our graph model reveals. It also might inspire some other ideas for analysis that you hadn’t initially considered.

Let’s explore some of these.

Language

The langdetect library in Python allows us to access the Google Translation library over an API and gives us the ability to identify language in our text.

Note — due to a limit on Google Cloud API requests this can run quite slow when processing a large volume of data. For the purposes of example, this is fine; if you want to productionise this you may want to think about a faster alternative and consider paying.

The distribution of languages among posts containing #happiness

As we can see the majority of the Instagram posts for #happiness are in English — over 80% of our data set.

User Post Frequency

A look at some distributions show that the data on user post frequency is extremely positively skewed; there are handful of outlier accounts that post very frequently (up to 140 times) whereas the majority of posters are posting only once.

The frequency of posting among users in our data — normalised to 1

Extracting Hashtags

To extract our hashtags I have created two methods that do the job for us. One method takes a string input and returns a list of hashtags. If people have failed to include a space between their string of hashtags, the code can account for this and ‘unpack’ them into individual hashtags.

Extracting hashtags from a string. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Analysis

Analysis of hashtag frequency by post indicates that whilst frequency by post tends to follow a uniform distribution , there are some outlier posts that contain an unusually high number of hashtags. Further exploration shows that this partially seems to be driven by people who have posted multiple times across the dataset (high frequency posters). So broadly speaking, those that post more often also tend to use more hashtags in their posts.

Whilst the distribution of hashtag use is mainly uniform, high frequency posters use more hashtags in their posts

Lemmatising Hashtags

A second method then takes this list of hashtags and looks to lemmatise the input, where appropriate. For those unfamiliar with lemmatisation, this is the process where we can reduce a word to its root lexical form. For instance if we have the hashtags #happiest and #happier — these are both inflections of the verb #happy. Therefore lemmatisation reduces these inflections back to #happy. I’ve used an instance of a spaCy model to perform the lemmatisation.

Lemmatising Hashtags — Example Code

Lemmatising hashtags using spaCy. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Data Selection

Photo by Victoriano Izquierdo on Unsplash

Ok, so the next step is to select the data we want to model. The code outlines a pipeline from the InstagramGraph class which contains a suite of methods that allow us to select the data we want.

Select subsets of the data before we build the graph. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Select Language

Whilst Instagram is an international platform, for simplicity we might want to consider filtering out non-English posts. As we saw previously, over 80% of the data for #happiness is in English so it makes sense to focus on this subset of the data.

Removing verified users

Verified users tend to be either brands, celebrities or online shops i.e. ‘people’ who likely don’t represent the average Instagram user. Depending on your objectives for this analysis, you might want to consider removing verified users from your data.

Removing high frequency posters

A cursory inspection of high frequency posters reveals that these are sometimes unverified accounts that are using Instagram to sell stuff under a non-verified status. Do we feel this represents that audience we are interested in understanding? If so we can leave them in, if not — we can filter them out prior to building our graph.

Use lemmatised hashtags?

I thought it made sense to have the option to lemmatise the scraped hashtags where possible as it reduces the amount of data needed to build the graph — we don’t necessarily need all that extra data.

Graph Building

Photo by Randy Fath on Unsplash

Now that we we have selected the data we want to model, we can build the graph. The pipeline we use for this is below:

Method in class InstagramGraph containing a pipeline of processes that allow for the creation of a graph object containing edge and node data from a hashtag dataset. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

There are three main processes that take place in this pipeline:

Compile a list of lists

This simply extracts hashtags from our DataFrame into a list of lists — each sub-list representing a post from our #Happiness data set. By having the input as a simple list of lists, we can easily re-purpose this graph building approach to other sources of data in future where we can wrangle the raw data in a similarly structured / list of list format

Compile a list of lists — Example Code

Compiles a list of n lists where each list contains the hashtags from an individual and n is total number of Instagram posts. Default stop words are removed with the option of adding more. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Compile a list of lists — Example Output

Extract a list of lists from a series of hashtags

Identify edges and nodes

The next step is to take our list of lists and use this to generate the nodes and edges that exist across the entire data set. Remember the nodes are the unique hashtags in the dataset and an edge is created if any two hashtags are mentioned together in the same post.

Identify edges and nodes — Example Code

Calculate all the nodes and edges that exist in the hashtag data — several DataFrames are retained as attributes. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Identify edges and nodes — Example Output

The method saves two generated DataFrames as attributes — a DataFrame containing all the nodes and a DataFrame containing all the edges.

Node DataFrame

The node DataFrame — each row represents a unique hashtag that exists in #Happiness dataset

Edge DataFrame

The edge DataFrame — each row represents a unique edge that exists in the #Happiness dataset. Edge frequency is a count of how many times that pairing exists in the #Happiness dataset

Build the graph

The next stage is to add these nodes and edges into an instance of a NetworkX graph object (G).

Build the graph — Example Code

Edge and Node data is taken and added to an instance of a NetworkX graph object. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Build the graph — Example Output

Once we have created our NetworkX graph object — we can then use various calculated graph attributes to enhance our node data even further.

  • Betweeness Centrality
  • Adjacency Frequency
  • Clustering Coefficient
  • Community

Node DataFrame Enhanced with Graph Attributes

Node DataFrame enhanced with new metrics from NetworkX

Visualisation

Photo by russn_fckr on Unsplash

Now we have an instance of a NetworkX graph object that has been built with our node and edge data that we extracted from our #Happiness dataset.

Graph Plot

We can now use Plotly to visualise the hashtag graph as a connected scatter plot.

Plotting the hashtag graph as a scatter plot in Plotly with various arguments that allow customisation of what data is shown. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

There are a few arguments that I have set up that allow for a degree of customisation to the visualisation.

def plotGraph(node_size=’adjacency_frequency’,colorscale=’Viridis’)

I think it makes sense to use node size to convey some other variable — the default is ‘adjacency_frequency’ i.e. how many other nodes a node has edge connections with. In this instance a smaller node would represent a hashtag with fewer edges and contrastingly a larger node would have more edges. We can emphasise this point even further by applying a colour scale that is correlated with node size.

def plotGraph(layout=nx.kamada_kawai_layout)

NetworkX has a few different graph layouts. The kamada kawai layout arranges nodes in a way so that there are as few overlapping edges as possible. Therefore this has a tendency for the nodes to be arranged in a way that conveys clusters …but most importantly tends to be the easiest layout to interpret. There are alternatives such as the circular layout or random layout but these are harder to make sense of in my experience.

def plotGraph(light_theme=True)

There are two colour themes, light and dark.

Call the plotGraph function…

Network map visualisation of the #Happiness graph — the size of nodes are correlated positively to the frequency of edges they have to other nodes. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

We can see quite easily how there are several well connected hashtags in the #happiness graph(hashtags that have more node connections). Smile, life, family, motivation, party…these are all hashtags that are frequently mentioned across the network.

As we can see, the force directed layout has a tendency to push groups of hashtags together that are well connected . If we plot the graph again but colour nodes using their community label we can better visualise how communities fall out.

def plotGraph(community_plot=True)
Network map visualisation of the #Happiness graph — the colour of nodes represent communities as calculated using NetworkX and the node size represents adjacency frequency. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Sunburst Plot

There seems to be three easily identifiable communities in the #happiness network. We can explore which specific hashtags contribute to each of the communities using sunburst visualisations. Click on the wheel to interact with the clusters.

Interactive sunburst visualisation of communities detected in the #Happiness graph. Source: https://github.com/kitsamho/Instagram_Scraper_Graph

Community detection identified three clusters interpreted as:

  • Aspects of happiness that seem to be about what people do and experience (e.g. ‘#photography, #summer, #travel,#family ) (Segment 0)
  • Aspects of happiness that broadly seem to be about how people think and feel (e.g #life, #motivation, #inspiration, #quotes) (Segment 1)
  • A third very distinct cluster was entirely about #weddings, #celebrations and #parties (Segment 3)

Closing Notes

Photo by Artem Beliaikin on Unsplash

I hope you enjoyed this practical walk through on how to extract hashtag data from Instagram and analyse it using principles borrowed from Network Science.

My main aims with this project were:

  1. To try and demonstrate that there is an abundance of accessible and potentially insightful content buried in social media data.
  2. To show you that there are almost always interesting patterns in places you might not have expected

I chose #happiness as a case study because hopefully this is something that we can all relate to. However the code works for any hashtag. As long as you can gather sufficient volumes of data to model — there are observable relationships for anything. You could consider brands, places, events and even people.

You don’t even need to use Instagram; the code can easily be repurposed to use all kinds of meta data beyond Instagram….Twitter, LinkedIn even TikTok could be a possibility.

As I mentioned before, the code is available on my GitHub so please feel free to take, use, evolve and improve. I’d love to see what people do with this.

Until next time! 🚀

Cheers
Sam

References & Links

--

--