Analyzing iMessage with SQL

Analyzing iMessage with SQL

·

9 min read

SQLite is an often overlooked flavor of SQL engines. Some have suggested it is the most prolific SQL engine in existence due to its highly flexible nature and ability to run on almost any platform with limited resources. Unlike other SQL engines like MySQL, PostgreSQL, MSSQL, or Oracle, SQLite runs without a server. SQLite does not rely on a data directory or a constantly running daemon: a database is encapsulated in a single file.

SQLite and iMessage

iMessage is one of the most popular messaging platforms today, largely because it is built into iOS and Mac devices. Since its release, it has evolved significantly. But, at its core, it is simply an instant messaging platform. iMessage uses SQLite in the background to store relational data about messages, conversations, and their participants.

As a long-time Apple user, I have backed up and transferred my iPhone data since my first time using an iPhone, which was November 10, 2009. Because I have been digitally hoarding my text data for so long, my iMessage database is nearly 1GB in size.

Until a few years ago, the built-in search feature for iMessage was very limited and buggy. Although it has recently improved significantly, it is, like nearly any end-user tool, very limited in how you can query it. Those of us who frequently work with data that is trapped behind a limited front-end often wish we could get direct access to the SQL database. Fortunately, the iMessage database is not inaccessible - in fact, it is very easy to access.

Finding the iMessage SQL Database

On Your Mac

If you have iMessage enabled on your Mac as well as your iPhone, you have 2 different databases from which to choose. The database on your Mac is very easy to find, as it is simply under ~/Library/Messages/chat.db. If you do not use your Mac for iMessage, or, as in my case, your Mac iMessages do not go as far back, you can extract your iPhone's database by performing a backup to your Mac.

On Your iPhone

Follow these instructions to extract your iPhone's iMessage database:

  1. Open Finder and select your iPhone under "Locations".
  2. Find the "Backups" section and select "Back up all of the data on your iPhone to this Mac", then press Back Up Now to immediately create a new backup. This process may take a while.
  3. Once it is complete, you will find the SQLite file under /Users/[username]/Library/Application Support/MobileSync/Backup/[backup name]/3d/3d0d7e5fb2ce288813306e4d4636395e047a3d28.
  4. If you plan to open this database with Arctype, you'll want to copy and rename the file with a .db extension to indicate that it is a SQLite file.

Getting Started with SQLite

Unlike most SQL servers, you do not need a connection string, host, or username to connect to a SQLite database. All you need to do is point your SQL client to the database file. Arctype makes it simple and convenient to load in SQLite databases within the same workspace as your other connections.

With Arctype

Add new SQLite connection

  1. Under the Connections dropdown, select "Add new data source"
  2. Select "SQLite"
  3. Find the SQLite database file. The file must have a .sqlite3 or .db extension for Arctype to open it.

More detailed instructions can be found in the Arctype Docs.

With Command Line

From a UNIX terminal, type sqlite3 [filename].

iMessage Schema

One of my favorite parts about Arctype is how easy it is to analyze database schema. I'm a long-time user of command-line tools and old-school editors, but sometimes having a more visually interactive tool is a lifesaver. Let's dig into the schema Apple has created for iMessage. Today we will focus on the chat, message, and handle tables, as well as a few join tables to connect related records.

Key iMessage Tables

Note that I have created a custom view called handle2 which adds a field id2 that obfuscates the phone numbers and email addresses of my personal contacts, and you will see this view referenced in the examples in this article.

Digging Into iMessage

Let's write some queries and makes some observations that would not be possible without direct SQL access.

Pique Your Nostalgia with Old Messages

To get started, let's begin with a simple query to view your first 50 messages. If you have chat threads that go back years and years, there is no easy way to access early messages from your iPhone or Mac.

The interface on both platforms requires you to scroll back by about 25 messages at a time. This is prohibitively time-consuming and can result in a crash or reset if the user sends you a new message while you're scrolled back.

Fortunately, we have custom SQL to save us:

select
   h.id2 as sender_name,
   m.text as message_body
from
   message m
   join handle2 h on h.rowid = m.handle_id
order by
   m.date
limit
   50;

Use SQL to view old messages.

handle.id represents the readable identifier for the user. It will be either a phone number or an email address.

Rate Your Friendships with SQL

Let's use SQL to find out who our best friends are. Assuming you view the quality of friendships as a function of the quantity of sent text messages, this should be very accurate!

First, let's divide the number of messages that are from_me by those that are not to produce a reply ratio. This query shows the top 10 people we have been messaging by the total amount of messages, as well as the reply ratio.

Multiplication by 1.0 casts to the REAL data type to avoid integer division, which would result in 1 or 0 instead of a decimal. You can use the link here to see the rules for integer division in SQLite.

select
    h.id2,
    count(1) as cnt,
    round(
        sum(
            case
                when m.is_from_me then 1
                else 0
            end
        ) * 1.0 / count(1) * 100.0,
        2
    )
from
    message m
    join handle2 h on h.rowid = m.handle_id
group by
    h.id
order by
    cnt desc
limit
    10;

Reply ratio between contacts.

One issue with this analysis is that fewer sent messages does not necessarily imply fewer words sent. Let's add some more fields to get a better insight.

Here we can see the total amount of characters sent and received, the average length of text message sent and received, the total ratio of characters sent and received, and the reply ratio. In my case, people from whom I tend to receive more messages also send longer messages than me.

select
    h.id,
    count(1) as cnt,
    sum(length(m.text)) as chars,
    sum(length(m.text)) filter (where m.is_from_me) as chars_sent,
    sum(length(m.text)) filter (where not m.is_from_me) as chars_received,

    round(avg(length(m.text)) filter (where m.is_from_me)) as avg_length_sent,
    round(avg(length(m.text)) filter (where not m.is_from_me)) as avg_length_received,

    round((sum(length(m.text)) filter (where m.is_from_me) * 1.0 / sum(length(m.text)) filter (where not m.is_from_me)), 2) as characters_sent_ratio,
    round((count(1) filter (where m.is_from_me)) * 1.0 / (count(1) filter (where not m.is_from_me)), 2) as reply_ratio
from
    message m
    join handle h on h.rowid = m.handle_id
group by
    h.id
order by
    cnt desc
limit
    10;

Friend iMessage statistics.

This query makes heavy use of aggregate filters. Aggregate filters allow you to use an aggregate function on only a part of the data by specifying a WHERE clause to filter out unwanted records.

Examining iMessage Reactions

There are 2 newer iMessage features whose implementations, in the context of their schema design, are interesting to look into. Recently an announcement was made that Android phones will be able to show iMessage "reactions" properly. Historically, if you send an iMessage reaction to a non-Apple device, it will show up as a textual addition instead of an icon.

iMessage reactions show up as messages on an Android phone

With the announcement of the new compatibility with Android devices, I was curious to learn how the current implementation of the feature works.

I SELECTed a few records with and without a reaction and compared the results. I discovered that the associated_message_type column was usually set to 0, but in messages with a reaction, it was an integer value between 2000-2005. I also noticed that associated_message_guid was present. Apple appears to be using 2000-2005 for its 5 reaction types, 3000-3005 for when a user removed a reaction, and 3 for an Apple Pay request.

Rows with associated_message_guid

From this investigation, it appears that reactions are sent as iMessages with the reaction's textual equivalent appended and a foreign key relation to the parent message. This allows the messages to seamlessly be sent and received by non-Apple devices.

If the message is sent over SMS, the metadata linking the reaction to the message it references is simply lost. If the device is iMessage capable, Apple devices will ignore the text part of the message, find the associated message and add the proper reaction as a visual overlay.

Note that the message table includes both a ROWID and a guid. ROWID is a typical auto-increment integer id field, which is useful for joining on within the local database. However, the auto-incremented primary key will not be the same for the same message across devices. The GUID is globally unique, generated by the author of the message, and sent to all of its recipients. This allows foreign key references across different databases, devices, and users. For more information about the utility of GUIDs, check out this article.

Group chats are stored in the chat table. Join tables chat_handle_join and chat_message_join are used to associate users and messages, respectively, with group chats. Here's a query to find out most used group chats (chat with > 1 members) and the identities of their participants.

select
    group_concat(distinct h.id2) as participants,
    count(m. "ROWID") as message_count
from
    chat c
    join chat_handle_join chj on chj.chat_id = c."ROWID"
    join handle2 h on h. "ROWID" = chj.handle_id
    join chat_message_join cmj on cmj.chat_id = c."ROWID"
    join message m on m. "ROWID" = cmj.message_id
group by
    c."ROWID"
having
    count(distinct h.id) > 1
order by
    message_count desc
limit
    10

Find your most popular group chats.

visualizing the result in arctype

The group_concat function, which is familiar from MySQL by the same name and familiar to PostgreSQL users as string_agg, is an aggregate function that concatenates strings together. See more on how it can be used within SQLite here.

The HAVING clause is similar to a WHERE clause but operates on aggregate functions. If you've wanted to write a query conditional on an aggregate but are not able to inside of your WHERE clause, HAVING is there for you.

Conclusion

SQLite is a powerful tool whose prolific reach across devices and numerous use cases make it one of the most impressive software projects around. If you're curious about what's behind the scenes, SQLite's source code is well known to be well-organized and fun (well, to some of us) to peek into.

iMessage is just one of many pieces of software that rely on SQLite and are used by millions of end-users. You can try out a SQL client like Arctype for free and start exploring the databases that power the tools you use daily!