Instagram Data Analysis Using Panoply and Mode

Published in

Towards Data Science

9 min readMay 28, 2019

Background

This project is built on top of the data challenge that Panoply has released in Apr 2019. Panoply is a cloud data warehouse that you could gather data from different data sources (i.e. AWS S3, Google analytics and etc.) easily into one place and then connect to different Business Intelligence tools (i.e. Chartio, Mode and etc.) for analytics and insights.

Panoply has recently integrated their data warehouse with Instagram API to collect data. This challenge is about using Panoply as an ETL tool to explore Instagram data for marketing use (i.e. promotion, segmentation and etc.).

In this challenge, challengers are asked to set up a Panoply account, and connect to self Instagram or provided Instagram data to perform analyses, draw insights and build visualization for storytelling. If you have an Instagram account, you can use the data from your own Instagram account. Or if you prefer to use the data provided by Panoply, you can choose from 2 accounts:

Kirsten Alana: A top Instagram Influencer.
Shinesty: An innovative and edgy apparel brand.

Then you can use any of your choice of BI tools for data visualization. The final deliverable of this challenge will be in English, data visualization(s) for communicating your findings and the SQL queries that you have used.

Project Design

Steps of my project would be:

Explore Panoply, using resources it provided for the challenge and documentations on its own website.
Create Panoply free trial account, connect to provided data source, connect to BI tool.
Explore the datasets, understand the schemas and relationships.
Research and come up with questions/metrics/KPIs that an Instagram owner would like to track and have them answered. So they could get insights to grow their brand.
Use Mode to build visualizations and submit to the challenge.

Tools

The tools and technologies that I am going to use in this project would be Panoply, Instagram data, SQL and Mode.

Process

I started by reading the resources that Panoply provided and explore their website to better understand what Panoply is and how it plays a role in data analytics process. Then I created the free trial of Panoply account and follow their documentations to connect to Amazon S3 to collect the provided Instagram data.

Create a Panoply account

Below is the flow for singing up an Panoply account:

Go to https://panoply.io.

2. Sign up a free 21-day trail account with an email account.

Connect to data sources and collect data

Below is the flow for connecting to Amazon S3 and collect the Instagram data using APIs. In this challenge, I used the data from Shinesty, an innovative and edgy apparel brand as my analytic target.

Go to Data Sources tab.

2. Select the data source you want to connect to, for this project I used Amazon S3.

Input your credentials and start to collect data from your source to Panoply data warehouse.

3. Your tables are ready to go.

Tables for analysis

After collection of data, I found out that there are 31 tables in total. After exploration and examination, I identified the 5 most useful tables and I used them to draw meaningful insights. Below are the tables and the columns that I have used.

shinestyinstagram_instagram_user table screenshot

shinestyinstagram_instagram_media table screenshot

shinestyinstagram_instagram_comment table screenshot

shinestyinstagram_instagram_media_tags table screenshot

shinestyinstagram_instagram_media_location table screenshot

Metrics

After my research and I identified some metrics and questions that an Instagram account owner would like to know in order to improve their account awareness given the data the API provided.

Engagement per each hashtag.
Performance over time.
Performance by day of week using when a user comment.
Performance by hour of day using when a user made a comment.
Performance by hour of day using when the post is created.
Performance by types of media.
Performance for different media by filter.
Performance by location tag used.
Performance difference for posts that has location tag versus there is no tag.
Top active commenters.
Cohort Analysis.

Connect Mode for visualization

Go to https://mode.com to create a free non-business account.

2. After you are in your account dashboard, hit the down-arrow on the left that is next to your name.

3. Then a new drop down menu will come up and select the Connect a Database tab.

drop down menu for connecting a database screenshot

4. Then you can connect to different database using your credentials.

For this project, we used Amazon Redshift because that is the database that Panoply used to store our tables. After you entered and collected to your Panoply database, Mode will start to collect all the tables to your account database in Mode. Once it is done, you can start using SQL to analyze your data in Mode.

using SQL to analyze the data in Mode screenshot

Analysis

Below are the queries that I have used for answering the questions in the Metric section.

Finding engagement for each hashtag used:

with t AS
(SELECT
value as hashtag,
likes_count as likes,
comments_count as comments
FROM public.shinestyinstagram_instagram_media m
left JOIN public.shinestyinstagram_instagram_media_tags mt
ON m.id = mt.instagram_media_id
) 
select 
     hashtag,
     AVG(likes) as avg_likes,
     AVG(comments) as avg_comments
from 
     t
where hashtag is not null
group by 1

Below visual tells you the average likes for each hashtag, #housetonstrong and #theperfectcrime have the most average likes performance.

Below visual tells you the average comments with respects to each hashtag used, #buttstuff and #macrobrews have the best performance.

Below visual combine the above twos.

Find out performance over time:

SELECT
DATE_TRUNC(‘week’, created_time)::DATE as week,
SUM(comments_count) AS total_comments,
AVG(comments_count) AS avg_comments,
SUM(likes_count) AS total_likes,
AVG(likes_count) AS avg_likes,
count(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1
ORDER BY 
1

Finding performance by day of week using when a user comment:

select 
TO_CHAR(created_time, ‘DY’) as day,
COUNT(distinct media_id) AS nums_of_post_got_commented,
COUNT(distinct from_username) AS nums_of_commenter,
ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)), 0) as average_commenter_per_post
from 
public.shinestyinstagram_instagram_comments
group by 
1
order by 
1

Below visual shows that Thursday and Friday are the time when the users like to comment the most.

Finding performance by hour of day using when a user made a comment:

select 
TO_CHAR(created_time, ‘HH24’) as hour,
COUNT(distinct media_id) AS nums_of_post_got_commented,
COUNT(distinct from_username) AS nums_of_commenter,
ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)), 0) as average_commenter_per_post
from 
public.shinestyinstagram_instagram_comments
group by 
1
order by 
1

Finding performance by hour of day using when the post is created:

SELECT
TO_CHAR(created_time, ‘HH24’) as hour,
SUM(comments_count) AS total_comments,
AVG(comments_count) AS avg_comments,
SUM(likes_count) AS total_likes,
AVG(likes_count) AS avg_likes,
count(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1
ORDER BY 1

The purpose of me showing this query is because I think it is not appropriate to use this insight to make a recommendation to post a IG post during the 7am to 4pm of a day. Because this query is using when a post is created to calculate the numbers of likes and comments. In contrast, the last visual used when a user has commented to calculate the performance, which I think would be more accurate. So I would like to suggest @shinestythreads to post during 11pm to 2am of a day to get more comment engagement. Unfortunately the API didn’t provide the same info for likes, which I would like to use to calculate for like engagement.

Finding performance by types of media:

SELECT
type,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments,
COUNT(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1

There are three types of medias that IG offers right now: image, video and carousel. Below visuals show the video media has the most average comments and the carousel media has the most average likes. And the video media shows itself is the best media to get engagement.

Performance by Media Type using average comments

Performance by Media Type using average likes

Finding performance for different media by filter:

SELECT
filter,
type,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments,
COUNT(distinct id) as nums_of_post
FROM
public.shinestyinstagram_instagram_media
GROUP BY
1, 2

In below visual, we can see the filter named Crema has the best performance beside the Normal filter for image media.

In below visual, we can see the filter named Ashby has the best performance beside the Normal filter for video media.

For carousel media, we can do the same to get insight but I didn’t do it in here because @shinestythreads only has normal filter for carousel media.

Find out performance by location tag used:

SELECT
location,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments
FROM
(SELECT
name as location,
m.likes_count,
m.comments_count
FROM
public.shinestyinstagram_instagram_media_location l
LEFT JOIN public.shinestyinstagram_instagram_media m
ON l.instagram_media_id = m.id
) as t
GROUP BY
1

This visual shows us that among all the posts that has a location tag, Augusts National Golf Club has the most average likes and comments engagement.

Find out performance difference for posts that has location tag versus there is no tag:

WITH t AS
(SELECT
m.id,
m.likes_count,
m.comments_count,
l.name as location
FROM
public.shinestyinstagram_instagram_media m
LEFT JOIN
public.shinestyinstagram_instagram_media_location l
ON
m.id = l.instagram_media_id
),w as
(SELECT
*,
(CASE WHEN location IS NULL THEN 0 ELSE 1 END) AS have_location
FROM t
)SELECT
have_location,
SUM(likes_count) as total_likes,
AVG(likes_count) as avg_likes,
SUM(comments_count) as total_comments,
AVG(comments_count) as avg_comments
FROM
w
GROUP BY
1

Posts with location tag has more average likes but slightly less average comments compare to posts that has no location tag.

Performance of Location Tag vs. No Location Tag

Find out top active commenters:

SELECT
*
FROM
(SELECT
from_username as username,
COUNT(media_id) as nums_of_comments,
RANK() OVER(ORDER BY nums_of_comments DESC)
FROM
public.shinestyinstagram_instagram_comments
GROUP BY
1
ORDER BY
2 DESC
) as t
WHERE
rank >1 and rank <=15

This visual shows us who are the top active commenters (users), not including the user @shinestythreads.

Cohort Analysis:

with t AS(selectmedia_id,from_username as username,DATE_TRUNC('week', created_time) as weekfrompublic.shinestyinstagram_instagram_comments),w AS(selectusername,min(week) as first_time_commentingfromtGROUP by1)SELECTx.cohort::DATE AS week,MAX(x.week_number) OVER (PARTITION BY x.cohort) AS total_nums_of_week,x.week_number,MAX(x.nums_of_commenter) OVER (PARTITION BY x.cohort) AS nums_of_new_commenter,x.nums_of_commenter,x.nums_of_commenter/MAX(x.nums_of_commenter) OVER (PARTITION BY x.cohort)::FLOAT AS retention_rateFROM(SELECTw.first_time_commenting as cohort,FLOOR(EXTRACT('day' FROM t.week - w.first_time_commenting)/7) AS week_number,COUNT(DISTINCT t.username) AS nums_of_commenterFROMt tLEFT JOINw wONt.username = w.usernameGROUP BY1,2) as xORDER BY 1,2,3

Above are all the analyses I have done for the project. For the data I have used, I didn’t upload them to my Github because of confidential purposes. If you are interested in learning what resources I have used for this project, please visit my Github repo.

If you have any questions, feel free to comment below. Thank you so much for reading!