What do Google and Facebook know about you? You can download and look for yourself.

Introspection on one’s virtual footprint

or, a little data psychoanalysis

M Carlisle

Published in

Towards Data Science

6 min readSep 16, 2019

Are you worried about the data collected about you online?

If you’re reading this post, I figure you’ve heard this before: you’re already aware of how there is a lot of data about you already online, dystopian Orwellian Kafkaesque yadda yadda yadda you’re powerless to control it.

The GDPR is a major advance in European protections of consumer data; while the USA has multiple protections in place (COPPA, Privacy Shield, etc.) nothing is as singular and encompassing as GDPR.

You can, at least, track some of the things that are known about yourself, and limit their use to target you.

As is my wont (bespoke data science 4ever), I want to briefly examine my own data. We’ll take a quick glance through what Google, Instagram, and Facebook offer and see what we can tease out easily.

Jupyter notebooks to replicate the questions asked and graphs plotted here can be found on my GitHub.

Google

Google offers the ability to download your data through Takeout.

Follow the directions given, and a few hours later, Google will email you with links to download what will likely be several gigabytes of personal information, depending on the level of privacy you’ve already told Google to keep for you. My full archive came in at over 20 GB (reasonable, I think, considering that most of that is Google Photos and Mail).

If you’re interested in simply browsing your data, then the format you choose may not matter to you. HTML might be the easiest to read with human eyes, but if you’re interested in examining your data, you may wish to drill down into each service and change the format, and select JSON whenever possible, to make data manipulation easiest. This choice is most important with My Activity, a meta-product where records of activity across products is stored.

There are obvious products, like Search and Location History, whose records you want to dig into, but there are likely “products” you may not be aware of. The one I’m going to examine here is Purchases & Transactions.

I was not aware that Google was scraping my Gmail for purchase info and collecting it in JSON format.

Purchases & Transactions contains information that was scraped from other Google services. If Google detected that a piece of information (email, usually) regarded a purchase, reservation, subscription, delivery, or other kind of financial transaction (not, seemingly, including bank transactions), that content was extracted and placed in a JSON file. Each transaction gets its own file (as opposed to the My Activity files, which contain all actions for a particular product).

For example, I was surprised (but should I have been?) to see that Google had parsed the contents of a GrubHub receipt email from an order I made in November 2017.

… and part of the JSON extracted from this email.

I was curious how much of this kind of content Google had extracted from emails I’d received, and from other financial services I might have been unaware were connected to Google.

Turns out, a fair amount.

I also wanted to know the “how” behind the creation of this JSON file. Did Google in fact extract this content, or did GrubHub sell it to Google? I don’t know.

Without entering details of other individual transactions, it appears that Google had extracted/acquired, over the course of my use of various Google products, data from 521 transactions (as of the time of my Takeout archive), going back to at least May 2008. (There were 42 that did not have dates.)

Out of these 521 transactions, 390 of them had a “SUBTOTAL” field. Summing these values (ignoring any line items) shows that Google had indexed over $11,000 worth of my transaction data over the past 11 years.

The real surprise is, that’s not actually a lot. It looks like they really started picking up the pace on transaction scraping in 2013, if my data is representative. (That, or that’s when I started to spend more online using this particular gmail address. Or both.)

Instagram

To collect your Instagram data, follow these directions, under the heading “Downloading a copy of your data on Instagram”.

This data is only offered in JSON format.

We’ll run a simple query here: what does the breakdown of likes I gave to other profiles look like, from month to month?

I started using Instagram in mid-2012. After a few months of little activity, it is fascinating to see how power laws seem to emerge from the simple task of which pictures I clicked the little heart icon — an unintentional popularity contest, to say the least. (Profile names are, of course, hidden.)

Also note that the number of ranked profiles increases; as the number of accounts I follow increases, the variance of photos available to like increases, but the rough power law structure is maintained even though the likes themselves have become more diffuse.

a sample of my Instagram likes distributions: 12/2012, 6/2014, 6/2016, 6/2018

Facebook

As before, follow these directions, under the question “How do I download a copy of my information on Facebook?” You’ll notice, like Google’s My Activity, that Facebook data is available in both JSON and HTML formats.

We’ll look at two aspects of the Facebook data: reactions and post sentiment.

Reactions

Reactions (likes and other kinds) can be counted much like Instagram likes, once the recipient of a reaction has been extracted from sentence-structured posts.

Collecting reaction data from 11+ years of my own Facebook feed drove home how many people have left the platform; not wanting to discard those reactions without an object (“Michael Carlisle reacted to a post.”), I found, in ranking my top 25 reaction recipients, that #5 was the aggregate-of-all-who-left-Facebook placeholder “NO_NAME”.

power laws emerging here make sense here too, although a power law appears cleaner with “likes” alone.

As with the Instagram data, the “popularity” effect of likes and reactions appear as a kind of “wealth distribution”, governed by a power law.

Post Sentiment

Running a sentiment analysis on my Facebook posts was a bit more intriguing. VADER, which has been incorporated into NLTK (the “Natural Language Toolkit”), offers a breakdown of document-level sentiment as “positive”, “negative”, “neutral”, and “compound”.

First, we’ll check how many posts were made each year.

Next, we’ll generate box plots of the sentiment of these posts per year. I’m a relatively neutral speaker (hi, mathematician), so we can see the medians are pretty much solidly neutral, and almost all negative sentiment past 2010 (when Facebook’s posts became more sentence-like, instead of the “PERSON is _____” format of the late 2000s) are considered outliers.

This barely scratches the surface of what can be uncovered and what you can learn about yourself in this data. The examples given above don’t really get into the detail known in the individual transactions and posts involved, where a more nuanced analysis can be done.

And if you can learn about yourself through this data, then the companies holding the data can learn as well. And use it to market to you. And possibly sell it to others to market to you.

Are you okay with that?