The world’s leading publication for data science, AI, and ML professionals.

Exploring My LinkedIn Journey Through Data Analysis

Uncovering Patterns in My Posts and Engagement – A data related one-year journey

Hashtags network graph visualization - pic by author
Hashtags network graph visualization – pic by author

Introduction

The leading professional networking platform today is LinkedIn. I began my journey there several years ago sharing information about my work and job title. However, I decided to focus more intensely on creating content related to my new work experience in data & Analytics over the past year. Specifically, I have been posting and sharing stories about leadership, team development, and geospatial analytics, including visualization of data and graph theory.

From LinkedIn (LI), you can extract various statistics like impressions, interactions and daily follower growth. Additionally, there is a LI API that can be used to obtain more detailed statistics. Over the past year, I have collected data on my own LI posts, with the aim of demonstrating how data analytics can be applied on such datasets. In this article, I will share what I have learned through one year of tracking my LI activity.

In the first part, I will discuss soft factors such as audiences, measurements, collecting data, tools and standards. Then, I will provide a more detailed descriptive analysis with several data-oriented outcomes. How will a post perform over weeks and how can one find out how hashtags work? These will be the topics for the last two sections. If you find this interesting, please consider clapping, following, or sharing it on medium.


Audience – Interaction – Measurement

On LI, you can measure a post’s success through metrics such as passive impressions (i.e., how many times your post has been displayed to others) and active engagement metrics like likes, comments, and shares. As an example, I have shared a post in the past year about code quality and readability, which you can see represented in the following screenshot. The LI algorithm affects how many people will see your post, but the numbers of likes you receive depends on your audience. To better understand this algorithm and my audience’s preferences, I have collected my own dataset over the past year and analyzed it to identify patterns and trends. Let me now describe this dataset in more detail.

Scrrenshot of a post about code readability with an picture generated by midJourney.ai - pic by author.
Scrrenshot of a post about code readability with an picture generated by midJourney.ai – pic by author.

Data Description

Starting point: I have collected panel data from my personal social media posts, which includes information such as time and date of each post, the number of followers I had at that time, the content of that post (e.g. what happened in the first hour, the main topic category, whether a picture was included), and my subjective impression of how successful the post was. This dataset is presented in a table format for analysis.

In addition to the panel data I collected from my personal social media posts, I also gathered timestamp, likes, comments, and impressions at various times throughout the year. In total, this dataset consists of 155 posts over the course of approximately one year. For the dynamic data, I collected 589 rows of information on these posts, including columns for weekday, engagement, cumulative impressions, cumulated followers, cumulative followers, and more. Using Python’s pandas library, we imported this data and performed basic feature engineering operations to prepare it for analysis. The resulting dataframe includes all the features described above, as well as an additional dataframe that tracks the evolution of these features over time. These datasets can be connected via the ind_num feature.

Snapshot of pandas DataFrame including all posts of the past year with static and initial information— pic by author
Snapshot of pandas DataFrame including all posts of the past year with static and initial information— pic by author
Snapshot of pandas DataFrame including dynamical data - pic by author
Snapshot of pandas DataFrame including dynamical data – pic by author

Data Exploration and Visualization

To begin exploring the data, let’s start with some simple statistical methods and data transformation. Let us start with a simple cumulation:

614,692 impressions | 5,337 reactions | 939 comments

That is 3.9K impressions per post, 34 reactions per post, 6 comments per post and every 4th post will be shared. Static parameters always raise more questions than they answer. For example, these parameters have been achieved by applying sum-function to the specific columns in the main dataframe.

To get a better understanding of the data, let’s create histograms using Seaborn. The picture below shows a histogram plot for impressions and reactions incorporating also median and mean value. My LI-posts seem to follow a skewed distribution. An example code snippet that creates a histogram plot along with median and mean can be found below.

Histogram plots created by seaborn showing distrubition of impressions and reactions - pic by author.
Histogram plots created by seaborn showing distrubition of impressions and reactions – pic by author.

So, there is just so much more to analyze. In this article, I want to dive deeper into the following three aspects of my LI-posts

  1. Time-dependency: Above posts were condensed to one point of time – let’s investigate details how time influences posts.
  2. Feature inspection: Are there specific features of a post that make it more successful than others?
  3. Graph investigation: Can we take graph theory as a tool to explore the dependency of posts from different hashtags?

Time-dependency of posts

To further analyze the data, we take the initial 153 posts and left-join all time-stamps resulting in a data frame named df_t. We visualize impressions and reactions in the chart below. It showcases how an evolution can be visually presented by piecewise linear functions. Again, an example code snippet for this chart is provided below.

Impressions and reactions increasing over time - pic by author.
Impressions and reactions increasing over time – pic by author.

A visualization can sometimes serve as a conclusion to a story about data. However, in this case, it raises more questions, which is why we develop new techniques to explore the data further. Let me first explain the visualization itself. The breaks on the x-axis around Christmas 2022 and July highlight my personal breaks where I did not post on LI. You can easily identify outliers in this visualization just as in the histogram above. Interestingly, these outliers are not produced by the same posts since they appear at different timestamps. What’s more, after an initial increase, every reaction or impression rate decreases to a steady plateau level throughout the next timestamp. This observation prompts us to dive deeper in it.

The slope of the linear function between launching a post and first measurement is always steep. However, between first and second measurement, this tracking results in a much lower slope. To better understand this observation, we will present the findings in a table below. Unfortunately, the time difference of 12 days tracking median is an issue on my end, which makes it difficult to determine how long a post remains ‘active’ in any case.

First learning: The way in which posts were measured was not sufficient to determine how long it circulates on LinkedIn. The main action takes place in the first 12 days, and posts receive the highest attention during this period.

Second learning: LinkedIn re-posts some content after two weeks, but this behavior does not significantly impact growth rates compared to the first two weeks.

However, there is another feature that I have tracked, which provides valuable additional information – impressions and reactions after the first hour of a post. To examine this feature more closely, I created a scatterplot between the impressions and reactions after the first hour of a post and the corresponding impressions and reactions of the last timestamp. I eliminated posts without a measurement during the first hour as well as posts having total outliers. In the same graph, I applied a simple linear regression between using the built-in function from Seaborn called sns.regplot(), which returns an R-squared score using scikit-learn.

Scatterplots of impressions (left) and reactions (right) after the first hour and in total after last time-stamp. Additional linear regression - pic by author
Scatterplots of impressions (left) and reactions (right) after the first hour and in total after last time-stamp. Additional linear regression – pic by author

Et voilà! There is a correlation between first hour’s reaction and impression rate and how well the post will perform. The R-squared value for first hour impressions vs. last measured is 0.225, while the R-squared value for the regression of first hour reactions vs. last measured we have 0.642. The closer the R-squared value is to 1, the higher the likelihood that this linear model holds true.

Third Learning: If a post has a high reaction and impression rate in the first hour, it is more likely to perform well.

Now, let’s take a closer look at the features in the dataset.


Feature inspection

We have 22 features in total, but not all of them are relevant here. We can leave Title ,Topic, and HashtagsIncluded aside because they are of string type, and others like SharedPost , PostedInGroup, or PicOthers simply do not have enough observations to analyze further. Additionally, we removed the features from the last time step taken and how a post performs since we have already found out that there is a linear correlation to the first hour’s impressions and reactions.

Let’s investigate the remaining features and apply Seaborn’s pairplot function to find the following picture:

Applying seaborn pairplot function to some features of the given dataframe - pic by author.
Applying seaborn pairplot function to some features of the given dataframe – pic by author.

A pairplot is an easy Visualization function integrated in Seaborn’s visualization package. Pairplot basically returns a symmetric matrix of histograms and scatterplots. Let’s take a closer look at some points here:

  • NumFollowers doesn’t seem to have any effect on any other feature, or vice versa.
  • Having more impressions gives the audience the opportunity to react to a post. However, this is just an observation and not causation.
  • Focusing on the row with FirstHourReactions, we find that having a Picture of oneself in a post, tagging other people, tagging one’s company and setting the post to focus one’s profile seems to have a positive effect.
  • The feature SubjectiveSuccess is highly biased by oneself and wouldn’t currently taken into account.
  • A pairplot between only boolean features doesn’t seem to be leading in the right direction, as one doesn’t see the confusion matrix between these features. Let’s try another way of representing these.

Boolean confusion matrices Let’s move on to visualizing the correlation between boolean features using confusion matrices. The confusion matrix the amount of rows where both features are ‘True’, the percentage where both features are ‘False’, and other fields symbolize the mixed cases. This technique allows you to find out if boolean features are related or not. The higher the values on the main diagonal of these matrices, the higher the correlation between the features. Here’s a function that returns a 2×2 numpy array, your confusion matrix, having a pandas dataframe and two boolean features as input:

Let’s use Seaborn’s heatmap to create a chart overview. Since there are six possible ways to match the four features, we will create a 2×3 chart panel and loop through all features twice. Be careful handling three indices, here! Two indices for the loops and one for counting the loop the can be mapped to the right position in the chart panel via np.divmod here. You will find the whole code below.

Interaction of boolean charts in a panel visual - pic by author.
Interaction of boolean charts in a panel visual – pic by author.

Great! The heatmap provides a clearer view of the interaction between the boolean features than the pairplot did earlier. However, there is still no clear pattern of interaction between these features. To further analyze the relationship between these features, we could apply methods like Pearson correlation or Spearman correlation to find any linear correlations.

Another useful tool for visualizing and analyzing the data is ydata-profiling, which was previously known as pandas-profiling. This tool provides an easy overview of the data and can help identify patterns and trends.

In addition, we could try more advanced methods such as machine learning algorithms to identify complex patterns in the data. However, based on the information provided so far, it doesn’t seem like there is a clear and consistent pattern of interaction between these features.

Fourth learning: Setting a post to focus, mentioning your company, mentioning other people, and/or including a picture of oneself could lead can have positive effects on your post success.


Graph investigation

In the previous section, we excluded hashtags, topics and title from our analysis. In this section, I would like to introduce a graph-based approach for tackling string-type features such as hashtags. A graph is a mathematical object that consists of nodes connected by edges. The nodes and edges can have different colors or sizes to represent different values or properties. Before setting up the graph, we need to clean and prepare our data appropriately. The data was stored in a feature called HashtagsIncluded in a comma-separted way with spaces. You can see an an example of the data in the picture below. The challenges here are that capital letters, spaces, and commas are included in the data. To address this, we can use the following code snippet to calculate the number of distinct hashtags used in each post. With this information, we start with summary statistics on hashtags, i.e. there are 357 distinct hashtags included. In detail, #analytics has been used 65 times followed by #1und1versatel, #data, #datascience, #python and #team.

Leading 5 items of 'HashtagsIncluded' showing comma-separated hashtags with spaces— pic by author
Leading 5 items of ‘HashtagsIncluded’ showing comma-separated hashtags with spaces— pic by author

Let’s dive deeper and explore how to use a graph-based approach. One way could be treating every hashtag as a node in the graph, and connecting these nodes with edges whenever two or more hashtags are included in the same post. To construct such a graph, we can use libraries like pyvis and Networkx in Python. Here’s an example of how you could implement this using the code block below. We first prepare the hashtags for network construction by creating sueful dictionaries that store for example how many times a hashtag appears in all posts. Then, we add nodes and edges to the network passing parameters on the physics for this network:

At first glance, we can see that hashtags like #analytics and #data are central nodes in the graph, corresponding to their frequency of use (i.e. the annotated picture below) However, there may be more insightful ways to analyze this graph, such as examining the success of each post cumulatively for every hashtag. For every hashtag we store information about reaction rate in the first hour by looping through every post.

Annotated graph-based model - hashtags connected by posts - pic by author
Annotated graph-based model – hashtags connected by posts – pic by author

When it comes to coloring in visualizations, you may often find yourself needing to create dynamic lists of colors. One convenient way to do this is by using matplotlib.pyplot function to generate a colormap, such as Blues, and then mapping these colors to a list of hexadecimal values. You can see an example of this in the code below. Beforehand, we calculate relative sizes of each node by dividing the cumulative sum of the first hour’s reactions by the number of posts that contain each hashtag. The final dictionary containing hashtag and a color from the colormap will by stored in the variable dict_hashtag_react_col.

Again, we crafted a graph similar to the statements above.

A snapshot of the corresponding graph can be found in the picture below. We observe that while formerly highlighted nodes such as #analytics, #data, and #datascience do not come with dark blue colors, but rather lighter shades of blue. In contrast, 90% of all nodes are grayish and of little importance according to this measure. However, there are some smaller nodes that appear much darker.

Modified graph with coloring by reactions in the first hour - pic by author
Modified graph with coloring by reactions in the first hour – pic by author

These slightly darker nodes are referring to hashtags that were mentioned in higher-ranked posts. We can see, for example, #sankeyplot, #event, or #b2runduesseldorf. By using a graph as a tool, we can uncover more complex reltionships that cannot be easily displayed in two-dimensional tabular data. This graph raises the question of which hashtags are most relevant for success. Based on the graph alone, my initial guess is that hashtgs related to events or visualizations tend to be higher ranked than others. To investigate this further, we can sort the dictionary based on relative value to identify the top items.

Zooming into graph to find that not only the strongly connected hashtags are successful - pic by author
Zooming into graph to find that not only the strongly connected hashtags are successful – pic by author
Relative values of hashtags measured by first hour reactions - pic by author
Relative values of hashtags measured by first hour reactions – pic by author

The code snippet above and the corresponding snapshot of the top items support the hypothesis mentioned earlier:

Fifth learning: Posts about data events and social events, including collaborations, are likely to be more successful.

One could argue about the measurement here, as success can be difficult to define. Homever, simply looking at the such as appearance can provide same insight, but it may not fully capture the complexity of success. As an example, even the analytics hashtag includes successful post, and this metric only reveals hashtags with high relevance but probably low appearance. I will not delve deeper into this topic here.

Alternatively, one could use this graph technique to visualize other features, such as theTopic or Title feature, incorporating some techniques borrowed from natural language processing. You can find more on this in another article that I have written in the past, see article on graphs.


Summary

The goal of this article was to conduct a deep dive into a dataset that I collected in the past year 2023. Here are my key findinds: First learning: The way I measured circulation time is not sufficient to determine how long a post remains popular on LinkedIn. The main attention-grabbing period occurs within the first 12 days, after which posts tend to receive lass attention. Second learning: LinkedIn has been known to re-post certain content after two weeks, with little impact on growth rates compared to the initial two weeks. Third Learning: If a post receives high levels of engagement and impressions within the first hour, it is more likely to perform well. Fourth learning: Including personal elements such as mentioning my company, other people, or including a picture of myself can have positive effects on post success. Fifth learning: Posts related to data events and social events, including collaborations, tend to perform better on LinkedIn.

I used various techniques throughout the analysis, including linear regression, pair plotting, visualization in general, and graph visualization. I employed packages from classical Python libraries such as pandas, numpy, seaborn, as well as more specialized packages like pyvis and networkx.

Since I have continued writing posts and articles on LinkedIn, there is a chance to verify these findings in the future. If you have any additions, ideas, or suggestions, feel free to reach out to me either here or on LinkedIn.


Related Articles