Ocado Technology open sources GCP Census, a BigQuery metadata collector
Ocado Technology uses a wide range of Google Cloud Platform (GCP) big data products for data-driven decisions and machine learning. Notably, we use Google BigQuery as the main storage solution for data analytics in the Ocado Smart Platform, our proprietary solution designed to power all aspects of a grocery retailer’s e-commerce business.
Because BigQuery is so central to the Ocado platform, we wanted an easy way to get a bird’s eye view of the data stored in it. More specifically, we wanted to know:
- Which datasets and tables are the largest or the most expensive?
- How many tables and partitions do we have?
- How often tables and partitions are updated over time?
- How our datasets and tables/partitions are growing over time?
- Which tables and datasets are stored in specific location?
These questions inspired us to build GCP Census, an open source Google App Engine app written in Python which regularly collects metadata about BigQuery tables and stores it back in BigQuery for further analysis.
Our BigQuery domain
We store petabytes of data in BigQuery, divided into multiple GCP projects and hundreds of thousands of tables. BigQuery has many useful features for enterprise cloud data warehouses, especially in terms of speed, scalability and reliability. One example is partitioned tables rather than daily tables, which we recently adopted for their numerous benefits. At the same time, partitioned tables increased the complexity and scale of our BigQuery environment, and BigQuery offers limited ways of analysing metadata:
- overall data size per project (from billing data)
- single table size (from BigQuery UI or REST API)
- __TABLES_SUMMARY__ and __PARTITIONS_SUMMARY__ provide only basic information, like list of tables/partitions and last update time.
GCP Census architecture
Here is how GCP Census works:
- A daily run is triggered by the Google App Engine cron
- GCP Census iterates over all projects/datasets/tables where it has access
- A task is created for each table and queued for execution in App Engine Task Queue
- A task worker retrieves table metadata using the REST API and streams it into the metadata tables. In case of partitioned tables, GCP Census also retrieves the partitions’ summary by querying the partitioned table and stores the metadata in partition_metadata table,
- The user can query metadata using BigQuery’s UI or the API
- Optionally, you can create a Data Studio dashboard based on the metadata.
GCP Census is highly scalable as it can easily scan millions of table and partitions. It’s also easy to set up: after GCP Census scans the resources to which it has IAM access, it automatically creates the relevant tables and views. Finally, it provides a secure cloud native solution with Google App Engine Firewall and fine grained IAMs. You also get all of the Google App Engine features on top!
Using GCP Census
There are several benefits to using GCP Census. For example, the questions listed above can be easily answered using a BigQuery query; here are some examples how you can query GCP Census metadata:
- Count all data to which GCP Census has access
SELECT sum(numBytes) FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
- Count all tables and partitions
SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.table_metadata_v1_0`
SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
- Select top 100 largest datasets
SELECT projectId, datasetId, sum(numBytes) as totalNumBytes
GROUP BY projectId, datasetId ORDER BY totalNumBytes DESC LIMIT 100
- Select top 100 largest tables
SELECT projectId, datasetId, tableId, numBytes
ORDER BY numBytes DESC LIMIT 100
- Select top 100 largest partitions
SELECT projectId, datasetId, tableId, partitionId, numBytes
ORDER BY numBytes DESC LIMIT 100
Optionally, you can create a Data Studio dashboard based on the metadata. We used Data Studio because of the ease and simplicity in creating dashboards with the BigQuery connector. Splitting data by project, dataset or label and diving into the storage costs is now a breeze, and we have multiple Data Studio dashboards that help us quickly dive into the largest project, dataset or table.
Below you can find a screenshot of how these dashboards look (all real data has been redacted):
With GCP Census, we’ve learned some of the characteristics of our data; for example, we now know which data is modified daily or which historical partitions have been modified recently. We were also able to identify potential cost optimization areas—huge temporary tables that no one uses but that were incurring significant storage costs. All in all, we’ve learned a lot about our operations, and saved a bunch of money!
The great news is that GCP Census is now available on Github at https://github.com/ocadotechnology/gcp-census – we recommend you download it for your own use or contribute to the project. You will also find the information needed for installation and setup. If you’re already a GCP Census user or plan to be one, we look forward to your feedback!