Analyzing your Friends’ iMessage Wordle Stats Using Python

A step-by-step guide to analyzing all those Wordle results you’ve shared with friends.

William Foote
Towards Data Science

--

Since Wordle took off in my friend group, we’re guaranteed thirteen messages a day: a “Good morning, I love you so much” text, a “good night <3” text, and eleven Wordle-results texts.

Is the optimal Wordler the one who finishes it fastest and in the least guesses? (Image by Author)

Almost every day, almost everyone does the Wordle and sends their results in our group chat — sometimes for better and sometimes for worse. Within the group, people have different routines for when they do the Wordle, but the expectation remains steadfast: everyone finishes the Wordle and sends their results in the group chat.

For some of us, we’ve added completing the daily Wordle to the list of items we have to do between waking up and getting out of bed — in addition to other post-sleep, pre-productiveness necessities like checking Instagram and Snapchat.

The competitive members of the group have taken it to an even higher level. For them, completing the Wordle, like the New York Times Daily Mini, is a test of pure willpower and speed. The task is simple, finish the Wordle as quickly after the clock strikes midnight — when the new Wordle is released — as possible. Psh, who cares how many guesses it takes?

However, what’s a single win on a single day — or a string of success spread over seven — even mean? And so, I set on a journey of summarizing our results to finally and decisively declare the best Wordlers in the group.

Once and for all, we can answer the big questions. Who completes the Wordle in the least guesses? More importantly, though, whose name is synonymous with speed?

In this piece I’ll outline the results from my own data analysis using Python and SQL as well as a step-by-step process on the methods so that you can use to analyze your desired population of interest. The full Jupyter Notebook which has all my code can be found on my Github.

Technology Note: My group chat resides in Apple’s iMessage which guides the specific methods I use for analysis. If you do not have the blue messages, perhaps this specific analysis is not for you. Regardless, certain techniques I employ such as regular expressions might still be helpful to guide different implementations of the analysis regardless of the make and model of your phone.

Ethics Note: When dealing with private, sensitive data such as text messages, be careful to tread with caution. As such, please ask your friends if they are okay with you analyzing their data before proceeding.

How can we extract data from messages?

You may be asking the question, we all have our Wordle statistics on our devices, but how can we access the statistics of others in our group chats? In this article, I’ll outline a step-by-step process for accessing where messages are stored on your Mac — yes, unfortunately, I can only outline the process for Mac users as of now — and how to get meaningful data from a text message with Wordle results in it (what I refer to as a Wordle stat-text) like in the screenshot above.

The general idea is that a Wordle stat-text can tell us a person’s Wordle score for a given day. For example, what if we whittle the text “Wordle 290 ⅚…” into the number ‘5’ and repeated this process for every stat-text ever sent in our group chats?

Luckily, it is possible to access one’s text messages on MacOS in most cases. Once you are able to access all of your text messages, you can create some easy-to-work-with dataframes using the Pandas package in Python.

Using pandas filtering capabilities and regular expressions from the “re” package in Python, we can achieve the goal above of getting every score from every stat-text.

We can also pair this information with other variables that are stored in the iMessage files including who sent the text and when they sent it using tools such as the datetime Python package.

I provide a step-by-step of my coding process in this article. My code is also consolidated in this Jupyter Notebook if you don’t need more. Nonetheless, this is far from the only way someone could do a similar or better analysis, so I’d love to hear what suggestions you might have to make the code more efficient or analysis richer!

Step 1A: FAQ’s and Requirements for doing this analysis

FAQ #1: What if I don’t have an iPhone/Mac combo?

This analysis relies on the process of accessing iMessages that are stored on a Mac. And so, the two cases of:

  1. I have an iPhone but my computer is a PC.
  2. I have an Android phone and a Mac.

are outside the scope of my work. In general, my suggestions for replicating this process if one of the above cases applies to you is to figure out a way to export your messages to a .CSV file or another way to directly access where your messages are stored.

This Reddit thread outlines some tools for analyzing iMessage data on PC using various tools (some of which are paid), but it was unclear if there was a way to access the data in the way that I go about doing so below.

FAQ #2: What if my group message has an Android user or two?

Green messages? Have no fear. As long as the messages are accessible in your Messages app, they will be included in the data we collect in further steps.

Requirement: Messages preferences and storage timeframe

To proceed, you must be able to access your iMessage files in the first place. As stated above, this tutorial is only suited for Mac users at the time, in which case you likely can access your iMessage files on your computer. The only limitation within this framework is how long you have chosen to store messages on your computer — you can choose to keep messages for 30 days, one year, or forever.

To see what setting you’ve selected, open the Messages app on Mac and select Preferences > General. If you’ve selected Forever (like below), then you should have Wordle results as long as you’ve been sending them in your iMessages.

In the Messages Preferences, the General tab has a dropdown called Keep messages. I have the option “Forever” selected.
I have forever selected, which will allow me to look at all Wordle stats I’ve ever sent or received. (Image by Author)

Step 1: Locate the chat.db file and connecting to the chat.db database using SQL

Once you’ve verified that you keep your iMessages stored on your Mac, you need to find where those messages are stored. On Macs, iMessage files are located in a database file called chat.db.

Yorgos Askalidis has a very instructive, easy-to-follow guide on the process of locating the chat.db file and making it into a helpful pandas dataframe, a guide which I actually modified myself for this article.

The chat.db file should be located at ‘/Users/USERNAME/Library/Messages/chat.db’. You can connect to this database using Python, once you’ve found the specific file.

The database you access through the chat.db file is akin to a Google Sheets file with multiple tables spread across different sheets. We’d like to access aspects of the chat.db file including the text message itself as well as associated metadata, including who sent the text and when, which are stored in different tables (e.g. sheets in the analogy) in the database (e.g. the Google Sheets file).

The following code chunk provides the specific commands used to connect to the chat.db database using sqlite3, a package for implementing Structured Query Language commands in Python. SQL allows us to interact with the database in the chat.db file using specific commands (called “queries” in SQL) to grab the data tables (called “Selecting” in SQL) we’d like to analyze. More on this in step 2.

Step 1: Connect to the chat.db database with SQL

Note: If you get an error connecting to chat.db “can’t connect,” you may need to change your privacy settings to allow Python and Terminal “Full Disk Access.” This Reddit thread was helpful with the issue.

Step 2: Using Pandas to create an easy-to-work-with dataframe

A little blurb on SQL

Now that you’ve connected to the chat.db database, it’s time to select the tables that contain the data of interest. We’ll use pandas to run the SQL queries (though you might recall we used sqlite3 to establish the connection).

I won’t get too deep into the code below, but a basic SQL query would look something like this:

Example: A simple SQL query (“;” indicates the end of a query conventionally).

This would select all (“*” indicates all columns, but you can specify the specific columns you’d like to select as well, separated by commas) the columns from the table called “table.” The code below doesn’t get much more complicated than that other than in query_1, which includes some extracurricular data manipulation (as opposed to data collection queries, which queries 2 and 3 are as well as my example).

The middle part of query_1 might be daunting, but it is basically just taking the date column in the message table, changing how the data is formatted and renaming it as “date_utc.” The end result of the SQL query is a table of all the columns from messages as they were and date reformatted as date_utc.

Using Pandas to take the SQL tables and make them easy to work with in Python

If we were using SQL in the wild, it would output a table that is stored like a dataframe, a two-dimensional table with labeled rows and columns. But, since we’re in a Jupyter Notebook, it’s much easier to work with a capital DF DataFrame using the Pandas library, a package that turns data into easy-to-use data structures, making data analysis easier and more organized.

Among Pandas functionality is its ability to take a SQL query and an established SQL connection (like the one we created with sqlite3) and turn this into a Pandas dataframe. This is useful because a lot of the data cleaning and analysis tools are easiest to implement if one is working within the coziness of the infrastructure provided by Pandas dataframes.

Step 2A: Create the initial dataframe

Step 2A: Turn SQL queries into Pandas dataframes.

Step 2B: Clean the dataframe for DateTime analysis

Currently the dates are stored in the dataframes as a character data type. The words “I like ice cream” could also be a value in a dataframe of the character data type. But, we know that “2021–10–14 09:12:21” and “I like ice cream” should not be evaluated the same.

In a way, dates and times represent a numerical value. Midnight is earlier in the day than noon, and March 1, 2022 comes before March 25, 2022. The datetime package in Python implements this idea, allowing us to do mathematical operations on dates or otherwise represent the numerical relationship between dates.

The code below takes the date columns in the various dataframes and converts them into the datetime format.

Step 2B: Clean date values using the datetime package.

Step 2C: ​​Finally, merge the dataframes

Information about a single text message is currently stored in different tables. For the sake of ease, we merge the dataframes to make one dataframe with all of the information.

This process is implemented through Pandas’ merge function. It takes two dataframes, X and Y, related to each other by an identification column and matches related rows. There are different types of merges, but the code I implement is called a left join.

A left join is unique in that it takes dataframes X and Y, joins them on the identifier column, and keeps all the rows of X.

A left join keeps all elements of the left table regardless if a matching element is in the right table. A right join keeps all elements of the right table regardless if a matching element is in the left table. An inner join keeps all the elements that match in the left and right tables. A full join keeps all elements of both tables regardless of matching.
The four basic types of SQL joins. (Image by Author)

The end result is a dataframe called df_messages which includes the data of your iMessages. Now it’s time to hone in on the texts that contain Wordle stat-texts!

Step 2C: Merge the separate dataframes using left join.

Step 3: Filter the dataframe using Regular Expressions

Now that we have a workable dataframe, the next step is to filter by messages that contain Wordle data.

This is done in two steps, using the .loc method, str.contains() method.

The DataFrame.loc method allows a user to subset parts of the dataframe satisfying a certain condition in the form DataFrame.loc[condition].

We specify what those conditions are using str.contains. The code below will generate a True or False for every row in the dataframe based on if that row has the word “basketball” in it.

Example: Search for rows where the text message includes the word “basketball.”

The .loc() method then returns the rows for which the condition (i.e. includes the word “basketball” in the text message) is True and ignores the rows for which the condition is False(i.e. does not include the word “basketball” in the text message). The code below puts the steps together. Note: I used [7:8] to only see the 8th row of the output.

Example: This code selects only the 8th row of my search from the example above.
Picture of resulting DataFrame from the code chunk above. Contains 11 columns and 1 row.
In this example, I sent the text message “we talked basketball” on 3/11/2022. (Image by Author)

Step 3A: Get rows where the message includes the term “Wordle”

While the result is far from done, filtering based on texts that contain “Wordle” using the same process above should include all messages with Wordle scoring data in them (as well as other non-useful texts which we will further filter next).

Step 3A: Search for rows with “Wordle” in the text message.

Step 3B: Remove duplicates due to “Reactions”

The problem is the above query will include rows with duplicated data: when people “react” to the message, that is stored as a separate message than the original Wordle stat-text sent.

Two examples of how a duplicate Wordle stat-text would show up if Person B Laughed at Person A’s Wordle stat-text.
Kudos to this person for sending their lowlights in the group chat. #Integrity (Image by Author)

Those with the audacity to laugh at people’s failure would (without further filtering) have the X/6 counted for the laugher’s stats when we know it should only count for the person being laughed at.

Recall the .loc method is typically used to subset a dataframe into the rows that satisfy a given condition. In the code below, I use the same process with one caveat, the ~. The ~ in front of the condition essentially means “don’t.” That is, I want to subset the rows of the dataframe where words like “Laughed” or “Emphasized” are not in the text message.

Step 3B: Filter “reaction” texts.

Step 3C: Remove rows where the text does not contain helpful data

A separate problem is that the original query will also include text messages like “I hate Wordle” which doesn’t include any helpful data.

To do this, I looked for an aspect of a Wordle stat-text that was always present. I landed on assuming texts containing helpful data included at least one correctly-guessed letter. In Wordle, this is represented by a green box and in the Wordle stat-text is the green box emoji, 🟩.

I have yet to see someone fail to complete a Wordle with zero correct letters. If that were to happen, I believe it is logical to assume the person was likely either 1) messing around or 2) not messing around and too ashamed to send those results in the group chat.

Step 3C: Select only rows with the green box emoji.

Step 3D: Filter extra edge cases (if applicable)

For me, the above steps filtered the dataset to rows containing only Wordle stat-texts, except in one case. One of my friends has their phone device language set to Spanish. As such, the pesky “reaction” edge-case was not filtered for. Instead of them emphasizing a Wordle stat-text being represented as “Emphasized ‘Wordle XYZ…,’” it resulted in “Exclamó por ‘Wordle XYZ….’”

Step 3D: Extra filtering for edge cases.

This may not be the case for you, but if it was this code above should do the trick. Another helpful check-your-work exercise could be, as I did, exporting the dataframe to a .CSV file. You can then further check the “text” column to make sure it only contains actual Wordle stat-texts.

Note: you can do this without exporting, but for me it was easier to see all the rows of the dataframe after opening the .CSV in Excel than it was in my Jupyter Notebook.

If you still have pesky cases, I would recommend exporting the results to a .CSV, finding what the problem is, and including a few more lines of code to further clean the dataframe.

Step 4: Mine Wordle Scores from the Wordle stat-texts using Regular Expressions

Now that we have a dataframe in which the texts include only meaningful Wordle stat-texts, we need to extract the corresponding actual Wordle score as an integer value.

General Process (Ideas)

The general process for this is to take a Wordle stat-text like “Wordle 243 2/6 …” 1) select only the fraction “2/6” and then 2) remove the “/6” aspect. Then, we would be left with only a 2.

Note: In the case someone failed the Wordle (i.e. X/6 in the stat-text), I replace X with 7 for the sake of being able to make the variable an integer.

To implement this process, I used for-loops and regular expressions (regex for short), two techniques I’ll explain below.

What are for-loops and regular expressions?

For those who are unfamiliar with for-loops, it is essentially an automation method that repeats the process on every row of data. The code below iterates through every row of the data and appends (appending is adding a new item to the end of a list) the corresponding score to the list I called “score.”

I’ll also be using some regular expressions, which are a family of characters that are used to locate a search pattern in a given text. If we wanted to select all rows containing “Wordle,” we can implement this in python with .loc and str.contains(). Regular expressions are helpful if we need to search for a more complex scenario.

Regular expressions enhance search patterns by giving certain characters certain functions. For example, the “.” is a wild-card. It will match any letter, number, or symbol. But if you had the sentence “I like ice cream.”, you can also search for the “.” itself using the regex pattern “\.” which utilizes the power of “\” called the escape character.

General Process — But a little closer to actual coding

In the code below, I use the re.findall() function to find a wild-card followed by /6. This shortens ​​”Wordle 243 2/6 …” to “2/6.” Then, I use the re.sub() function to take the previous output and replace “/6” with nothing. Then, I added another re.sub() usage for replacing X with 7 if the person failed the Wordle. Finally, I change “2” to 2 (character data type to integer data type) and append this to the score list.

Step 4: Turn Wordle stat-texts into numerical Wordle score variable.

When I completed the process for the second time on April 4th, I was able to collect 533 Wordle scores from my iMessages.

Step 5: Put this list into a column of the cleaned dataframe

The code below takes the list of Wordle scores and puts it as a new column in the “cleaned” dataframe.

Step 5: Add scores to the dataframe.

Step 6: Change the phone_number variable into a person’s name

When creating group graphs, it’s a lot easier to identify people by name than phone number. I’ve changed the numbers to protect my friends’ privacy.

Step 6: Set phone numbers to people’s names.

Step 7: Final cleaning: subset by the chat_id of interest

You may notice that your “cleaned” dataframe has multiple values in the chat_id column. I deduced that these corresponded to different threads of messages. When I texted Person A, our messages were identified by chat_id = 1.

My group message of interest resides in chat_id = 40 (our group chat is called “The Girls”), but you may need to scroll through the .CSV file you made above to figure out what your chat_id of interest is.

Actually Analyzing Data

As like many other data science projects, cleaning the data in this project is most of the battle. From here, I created various plots using Python’s Seaborn and Matplotlib.pyplot packages.

You should by now have a dataframe of interest that you can make a plethora of plots from. As I referenced in the introduction, I was focused mostly on group accuracy and speed metrics. The Wordle is your oyster though! Investigate at will, and thank you for reading this far.

A quick aside: in the graphs below, you’ll notice two of my friend’s stats have very small sample sizes. Robert and Robby, as I’ve nicknamed them, each had separate reasons for this. Robert was playing on a knock-off Wordle app, unaware we had all been playing on the website (which is now owned by the New York Times). Robby, on the other hand, was an avid Android-user-turned-Appler after a hiking incident “forced” his switch to the blue-messages side.

Step 8 (optional): Make a Kernel Density Estimate graph of accuracy by person

I won’t get into the nitty-gritty statistics behind a kernel density estimate (KDE), but at the most basic level it can help create a smooth trendline for my friend’s individual Wordle-score distributions. For those more statistically inclined, a KDE is a non-parametric estimation of the probability density function (PDF) of a random variable.

An important note while we’re on the topic is that the data we’re dealing with is discrete (e.g. integer numbers 1, 2, 3, …, as opposed to continuous data like 1.2, 2.3, 3.4,…) so a KDE nor a PDF totally makes sense when we should be looking to estimate a probability mass function (PMF).

Nonetheless, the goal of this exploratory data analysis (EDA) is to see general trends which can guide further research into the topic. Perhaps the KDE will show us a trend that we can further investigate using statistical methods specifically for discrete data. In my opinion, that’s the goal of good EDA.

I use the savefig function below to export the graph as a .PNG after I made it using Seaborn (aliased as sns). I recommend reading through the Seaborn documentation for further learning.

Step 8: A KDE plot of Wordle scores by person using Seaborn.
The resulting graph estimates the distribution of Wordle Scores by person. (Image by Author)

Step 9 (Optional): Create individual scoring distributions

Technically, the data is discrete so a histogram isn’t the proper title for this graph. I’ve turned such a histogram into a quasi-barchart, though, in the following code.

Step 9: Individual distributions of Wordle scores by person using Seaborn.
This graph shows the distribution of scores by person. Each person has their own plot. (Image by Author)

Step 10 (optional): Create accuracy summary statistics

Now that we have a visual representation of an individual’s accuracy, I wanted to look numerically at the score data. This was a pretty easy process, involving grouping by (i.e. .groupby) the phone_number (which is now a person’s name), selecting score (i.e. [‘score’] selects the score column), and then using the .describe method to give summary statistics such as mean, min, max.

To organize the resulting output, I sorted by the mean in descending order (i.e. sort_values(‘mean’, ascending=True)) and reset the index (i.e. reset_index) so that the first row’s row number corresponds to their place in the mean-score standings (e.g. first place would be labeled row 0).

An underrated summary statistic in this output is count. For my group chat, it gave insight onto the data such as underreporting. In the resulting dataframe — sorted by lowest mean score — there seemed to be a correlation between score and how often someone reported their scores. Perhaps you can use this to call out those in your group chat who aren’t sending their lowlights. But, alas, I shan’t condone violence. So I definitely wouldn’t recommend doing that.

Step 10: Wordle score summary statistics by person.
A screenshot of the dataframe created using the code above. (Image by Author)

Step 11 (optional): Investigate speed data

Step 11A: A Little Bit More Cleaning of Date-Time Data

Wanting to gather speed data is one thing, but actually implementing this goal proved to be another beast. The process I settled on uses the capabilities of the datetime function from the datetime package — which was imported earlier — to take the timestamp variable from the dataframe and turn these into a different variable.

How to categorize time visually is a subjective question and might vary based on your take on the problem. That is, should a faster time be represented by a shorter bar on a barplot? Or a taller bar?

Given that a Wordle is available from 12:00 AM to 11:59 PM, I saw two ways to frame the speed question. For one, we might ask “How much time does it take on average for someone to finish the Wordle, starting at 12:00AM?” On the other hand, we could ask “How much time is left on average until the next Wordle is available once a person finishes the current Wordle?”

If one chose the first question, shorter bars in the resulting barplot would indicate faster Wordlers. On the contrary, if one chose the second stance, taller bars would indicate faster Wordlers. I chose the second, because in my mind taller is better (but maybe that’s because I’m 5’6”).

Coding-wise, I used the for-loop technique and appended the results to two lists, time_from_mid and time_til_mid. time_from_mid is time after midnight while time_til_mid is time until midnight (recall I chose to graph the second variable). These values are stored as floats (i.e. continuous numerical values) on 24-hour time, so a Wordle completed at 10:15 AM would have time_from_mid = 10.25 and time_til_midnight = 13.75.

Step 11A: Date data cleaning and manipulation for graphing Wordle speed.

Step 11B: Make the Graph

There isn’t much to do after the data-cleaning process in order to make the desired barplot. Each row corresponding to a different Wordle stat-text now had a corresponding time_til_mid value. I wanted to see the average time from midnight to complete the Wordle, which is easy to do using a similar process to step 10.

You can group by person and then take the mean time_til_mid value and plot this easily with matplotlib.pyplot.

Step 11B: Graph average Wordle speed by person.
A bar touching the red line indicates the fastest possible Wordle completion. Recall Robert and Robby both have N < 15. (Image by Author)

Step 12 (optional): Make a helpful graph using the time_from_mid variable:

While plotting average speed is most visually appealing (in my opinion) using the time_til_mid variable, using the time_from_mid variable is better when making a scatter plot of time vs. date. This makes sense to me because time_from_mid is equivalent to the time of day. In this graph, every individual score is a separate data point on the scatter plot.

Seaborn has good functionality for making a grid of subplots and builds on the functionality of matplotlib.pyplot. There are a lot of arguments in the code below, but most of them are just for making the graph easier to understand.

What I find interesting about this graph is that it gives an idea of how fast someone is but also what their Wordle habits look like on the daily. For example, you might look at my graph and see three categories in the early, morning, and later periods of the day, with most data points happening in the morning.

From this insight, it reaffirms what I know about my Wordle habits: when I’m with my speedster friends, I finish the Wordle when it first drops. But, typically I solve the Wordle right after I wake up, which usually happens anywhere from 8 AM — 12 PM.

This graph is powerful because it can give hints to what people’s habits are like and how likely they are to stick to those habits. Bobia, Bobby, and Boba have distributions that would hint at them finishing the Wordle at their own pace. I’d guess for them it’s a leisurely activity that they do when they want to.

On the flipside, I’d guess that Robin and Bobbie, characters with little variance in their data, are more dedicated to the self-imposed grind. Perhaps they see the Wordle more competitively and live by the mantra “the early bird gets the Wordle.”

Step 12: Create a scatterplot of Wordle speed data by person using Seaborn.
This is my favorite of the graphs because it shows people’s habits and how they’ve changed. (Image by Author)

Congrats! Where do we go from here?

Throughout this tutorial, you’ve learned to clean and process iMessage data (another big thanks to Yorgos Askalidis, whose code can be found on their Github) to turn daily Wordle stat-texts into dataframes rich with insights.

In my analysis, I focus mostly on the speed and accuracy of my Wordler friends, but there is much more analysis one can do. As the sample size grows, perhaps applying statistical methods (such as hypothesis testing) can be used to say if our speed or accuracy rates are statistically significantly different.

Are people getting better at Wordle the more they play? Are speedsters like Robin maintaining momentum or running out of gas as time goes on? These are questions I’d love to see answered (by myself or you the reader) in future analyses.

Moving forward, I’m also interested to see how our culture values Wordle as part of our daily lives. Will there be a day we stop posting our Wordle scores on Twitter or sending them to a friend? Or is this 5-letter-word game here to stay?

Thanks for reading and let me know in the comments what routes for analysis you’d like to take!

--

--