Wednesday, December 28, 2016

Load Hangouts Messages into BigQuery

I use Google Hangouts to talk to my girlfriend. I like it because it's a part of the Google stack and it's cross platform. It also supports some decent voice and video calling. Just like with any data you have in Google's system, you can download it in bulk. I decided to take the JSON data you can get from Hangouts and put it into Bigquery to run some interesting analytics on it. Let's see what we can do!

So the first thing you have to do is take out your data. This is done at https://takeout.google.com/settings/takeout. Click "Select None" at the top of the list and then scroll down to turn on Hangouts takeout.


Then, when you click next, you can determine output settings. I left mine at the defaults, but this is all up to you.

Next, I waited for Google to email me the link with my archive in it. This could take anywhere from a few minutes to overnight, but there's not much you can do while you're waiting.

When the email comes from Google, download your archive and find the JSON file containing all of your Hangouts messages and take a look at it. I'm not even going to try and replicate it here - it's deep. The schema they use is quite flexible and contains a lot of info we don't need. Also, it's not in a format that BigQuery supports. So let's change that.

I've written a converter in JavaScript and PHP that will ingest the downloaded file and output something that you can upload into BigQuery. It's available in this GitHub repo. It's literally one file. I downloaded it to my Cloud Shell, added my JSON file from Google Cloud Storage, and ran the file as such:
php hangoutsparser.php
This file takes Hangouts.json and outputs hangoutsparsed.json. It does not do it elegantly, so if your Cloud Shell runs out of memory, try running it on another machine. The file it creates should be ready for BigQuery.

Next, we have to create the table in BigQuery. To do that, we specify the following schema:
[
    {
        "mode": "REQUIRED",
        "name": "conversation_id",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "sender_id",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "sender_name",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "message",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "timestamp",
        "type": "TIMESTAMP"
    }
]
Finally, we can upload our file to Cloud Storage and import it from there. Set your error tolerance high, Some records may be inconsistent (I've tried so hard and fought so long). The code will complain about an unknown sender if that person has left a Hangout. They will not be in the participant list but they'll still have sent Hangouts. That's an unavoidable problem that can be worked around by not caring if the sender id is Null.

Start the import job and let BigQuery do the rest. Once it's done, you can run some queries on it. Here are a few that could be interesting. I'm no data scientist, so I've provided an example query. Comment some cool things you've done with your personal dataset!

  • SELECT REGEXP_EXTRACT(message,r'\[!!!Image (.*)]') AS url FROM dataset WHERE conversation_id="conversation_id" AND message CONTAINS "!!!Image" extracts all of the images shared in a conversation.


No comments:

Post a Comment