Firestore Tip 6: Automatically Sync Firestore Documents to BigQuery

Including a Video Demo

Lukas Oldenburg

Follow

Published in

The Bounce

7 min readNov 26, 2023

--

Firestore is great for handling application logic, BigQuery is great for data analysis. The third episode of my series with Firestore tips shows a convenient way to get both.

To the left, Firestore. To the right, BigQuery. The little boats are waiting for Firestore document updates to then carry them over the river.

I have this tool, the “Adobe Analytics Component Manager for Google Sheets”. Clients across the whole world use it to make their Admin work with Adobe Analytics more efficient, such as cleaning up thousands of unused dashboards and segments. You can even schedule report exports to Google Sheets with it.

The tool runs on Google Sheets Apps Script + Google Cloud Platform (GCP) (Cloud Functions, Firestore, Cloud Storage, Pub/Sub, Secret Manager being the most important components). In the Google Sheet, the user can start scripts, e.g. “Suggest Duplicate Components to Harmonize”. Those scripts then trigger a Cloud Function via Pub/Sub. This Cloud Function then runs a script (= a Python module) of a particular name, e.g. hamonize_duplicates.

How to start a script in the Adobe Analytics Component Manager for Google Sheets

Each script run gets its own “Script Run ID”, which is a must-have if you want to be able to follow logs in Cloud Logging belonging to the same run — it’s nearly impossible otherwise to isolate logs from a particular run from all those other runs that happen at the same time.

The script also creates a Firestore database entry when it starts for this Script Run ID with some context data (which client company started it, for which Google Sheet, etc.). When the script completes or fails, the Firestore entry is updated with some further info (how long did it take, what was the error etc.).

Example Firestore entry for a script run. This client (my Feld M demo account) activated a schedule for a regular Adobe Analytics report export.

Firestore is ideal for this. Multiple reads and updates of a document in the span of microseconds? No strict schema structure to adhere to? No problem.

Where Analysis starts is where Firestore ends

These script run documents are, in theory, also helpful for debugging: Which scripts failed, and what were the errors? And they are also a good foundation for analysis: Which scripts are the most popular? Which scripts are taking the longest on average? Which clients are using the Component Manager the most?

But, analysis based on Firestore data is limited, as I pointed out in the first part of this series when comparing Firestore with BigQuery. So, I have long wanted to send the completed scripts to BigQuery regularly, but was too lazy for that. Then, Stas Litouka, at Analytics Camp Zurich, told me about this Firebase Extension that auto-syncs Firestore documents to BigQuery.

So I took a look at it.

“But I’m not using a Firebase app!”

First, I did not get why this is a “Firebase” extension as I am not using a Firebase app, nor was I planning to (too many nightmares still from Firebase GA4 mobile app debugging sessions)… But Firebase “Extensions” are simply pre-packaged solutions that are not limited to Firebase mobile app development. They can be used in a variety of contexts, including web applications, backend services, and even in scenarios where Firebase-specific features aren't being used.

Alright, I gave it a try.

Video: From Installation to Output

First, you need to create a Firebase app and link it to your GCP project, select the billing plan etc. No worries, no costs arise until you actually do something with the project.

Then, go to the Extension page and click the fat blue “Install in Firebase console” button. If you want to mumble along while I install it live, with lots of ‘uhms’, check out the video:

If you prefer more concise manner, keep reading:

Before we go through the individual fields though, let’s review how this Extension works:

So what does this Extension do exactly?

The Extension links a Firestore Collection to a BigQuery dataset (which it creates if it does not exist). I call it “script_monitoring”. Whenever a document in the Firestore collection is created or updated/deleted, the content of the document is inserted into a “{your_prefix}_raw_changelog” table in that aforementioned dataset. So you will get sort of a Firestore “version control” with this Extension as well, because with this table, you can go and view a Firestore document’s state at any time. I don’t have a need for this, but it might be helpful for debugging one day.

But that also means:

Do not naïvely plug this Extension into just about every one of your Firestore collection! Especially don’t use this Extension if your collection’s Firestore documents get updated a lot!

As an example, I have other Firestore collections for other purposes where the documents can get hundreds of updates within a minute or less. Each of these updates would trigger a new BigQuery insertion. That would make the table unnecessarily large, and thus, the queries on top of it unnecessarily costly. In these cases, you are better (cheaper) off with a self-built process (e.g. via a Cloud Function) that syncs Firestore documents to BigQuery once you know they won’t get updated anymore.

But back to what the Extension actually does: It also creates a “{your_prefix}_raw_latest}” view, which gives you what I will work with: The latest state of the Firestore document as you see it in Firestore.

In the video above, I examine these tables in more detail, but you can see there are a couple of fields. The most important ones explained:

document_id: The Firestore document ID
operation: One of CREATE, UPDATE, IMPORT, or DELETE
data: The full document data in JSON after the operation (null if operation is DELETE)
old_data: The document data before the operation. Comparing this against data will give you the changes.

As Firestore documents can accommodate highly nested JSON-like structures, the only sensible generic way to store this in BigQuery is to use a JSON string as well — which is what the Extension does (you can also link it to a custom Cloud Function to do more customized mapping though). This is how the data looks like:

JSON is not ideal for many destinations. But luckily, BigQuery supports powerful JSON functions (thanks to David Hermann for introducing me to them). With those, we can turn that JSON into a “flat” format. My best friend here is the JSON_EXTRACT_SCALAR method.

This is what I do for my script_runs_latest_flat view:

Turning nested JSON into flat tables. All pretty straightforward, but converting Firestore’s “timestamp in nanoseconds” into a BigQuery-compatible timestamp was quite a headache and still looks awful. Feel free to suggest better solutions.

This gives us a nice, classic flat table:

We can now easily access this table for analytical purposes.

What is this good for?

So do I do anything of practical use with it? Absolutely! I have built a “Component Manager Monitoring Dashboard” based on it in Looker Studio. It is the fanciest dashboard ever with only gray tables and zero visualizations (yet). But it gives me a good overview over which clients are most active, which scripts are more popular than others and what was run (and maybe failed) lately. It is so gray that I added a blue frame to the screenshot so your eyes can have some excitement 😅:

The output: An ugly, but useful monitoring dashboard. Note that the stats you see in these images are not representative of the actual Component Manager usage, because I just recently set up this integration, and I did not do a backfill.

Having transparency over which client is how active was a stretch in the past. Now that is a lot easier.

So far, so good. Let’s look at the not-so-cool side of this Extension:

Bugs (or me being too stupid)

I had to uninstall and re-install the Extension multiple times, and I saw this truly-not-helpful “unknown error“ way too often. For example, while I was writing this article, I saw that an Extension update was available. But when trying to do that update, the “unknown error” kept re-appearing again and again. So, once more, the only option was un- and then re-installing again.

When you start with this extension, bring some tolerance for “unkown errors” with you.

This article originally also featured a particularly annoying bug that the Extension failed to recognize the Firestore timestamp field type as a source for the BigQuery time partitioning field. But this issue has since been fixed. :)

So that’s about it. All in all, the Extension is useful for this particular use case. Now go ahead and try it yourself!

That’s it for my third tips & tricks episode with Firestore. Check out part 1 on how to play God in Firestore or part 2 about must-have properties in Firestore documents and handling concurrent updates.

Do you want to read my content right in your mailbox, without the Medium reading restrictions for non-paying users, immediately after I publish it? Subscribe! I will not use your contact for anything but this purpose.

The Bounce

Firestore Tip 6: Automatically Sync Firestore Documents to BigQuery

Including a Video Demo

Where Analysis starts is where Firestore ends

“But I’m not using a Firebase app!”

Video: From Installation to Output

So what does this Extension do exactly?

What is this good for?

Bugs (or me being too stupid)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in The Bounce

Written by Lukas Oldenburg

Responses (2)