Skip to content

Data Backup

Whether interacting with your data in BigQuery via the Console, API or scripted SQL statements, it can happen that the wrong table or dataset is deleted by accident. Whilst in some instances this can be recovered (e.g. using Time Travel), it is good practice to backup your data and code in a manner which enables quick recovery in case something goes wrong.

Inbound Data Backup

If your data is coming into BigQuery via native Google services like Firebase or Google Analytics, it is streaming into BigQuery and is not necessarily stored anywhere else. This means that without backing up your data, accidental deletion of the dataset might be a non-recoverable action.

Backup Destinations

There are a number of different options for where to store your data backups, however they are all relatively simple to execute, schedule and customize. Live functions are currently available for backing up to:

  • BigQuery Table (in different Dataset)
  • Google Cloud Storage Bucket (Parquet, Uncompressed)
  • Google Cloud Storage Bucket (JSON)

Parquet is the preferred file format due to the fact that the schema is included in the file header, which makes recovery more reliable.

BigQuery

BigQuery backups check for the existence of specific partitions and shards in the backup table, and then insert any data which has not yet been backed up into the table. This is done via the table metadata, so does not require full table scans.

Google Cloud Storage

GCS backups write into a single folder for each shard, using a wildcard to split the data into different files in case the shard size is more than the 1GB limit.

Source Backups

Firebase

Firebase data comes directly into BigQuery on a daily basis as shards, which can be efficiently backed up to either GCS or BigQuery.

Firebase to BigQuery

This function checks for the existence of date shards in the destination_dataset_ref, and inserts the data when a new shard is detected in the source data. This queries table metadata, so can be run on a regular basis (e.g. hourly) and incurs only negligible cost.

CALL flowfunctions.backup.backup_firebase_table (
        source_dataset_ref, -- STRING
        destination_dataset_ref -- STRING
        );
CALL flowfunctionseu.backup.backup_firebase_table (
        source_dataset_ref, -- STRING
        destination_dataset_ref -- STRING
        );
Firebase to GCS

This function backs up table shards from the source_dataset_ref into different files including the date in the format YYYYMMDD in the destination bucket determined by export_gcs_bucket_name. The offset parameters start_days_offset and end_days_offset determine how many days to export (offset from today as defined by the CURRENT_DATE() in SQL i.e. 0 = today).

CALL flowfunctions.backup.backup_firebase_table_to_gcs (
        source_dataset_ref, -- STRING
        export_gcs_bucket_name, -- STRING
        start_days_offset, -- INT64
        end_days_offset -- INT64
        );
CALL flowfunctionseu.backup.backup_firebase_table_to_gcs (
        source_dataset_ref, -- STRING
        export_gcs_bucket_name, -- STRING
        start_days_offset, -- INT64
        end_days_offset -- INT64
        );
CALL flowfunctions.backup.backup_firebase_table_to_gcs (
        'project_id.dataset_name.table_name', 
        'my-gcs-bucket', 1, 3)

The example will export table shards from the project_id.dataset_name.table_name table into files (with the date included in the format YYYYMMDD) in the gs://my-gcs-bucket GCS bucket for days between yesterday and three days ago, overwriting any files which already exist. For example, a single file on 2022-07-01 would be saved as gs://my-gcs-bucket/events_20220701_00000000.parquet, and if the data to export was larger than 1GB, the second filename would be gs://my-gcs-bucket/events_20220701_00000001.parquet.

DDL-Based Backups

Data Definition Language (DDL) backups leverage the fact that the DDL to recreate resources is available via the INFORMATION_SCHEMA. This means that backing up the DDL into a BigQuery dataset gives us the mechanism to recover any views, external tables or routines.

The following functions will backup all view, routine and external tables DDL to a set of date-sharded table in the destination dataset defined by destination_dataset_ref. Scheduling this statement on a daily basis will give you a daily backup of all DDL in the datasets defined in backup_dataset_refs.

All Resources

CALL flowfunctions.backup.backup_resource_metadata (
        backup_dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );
CALL flowfunctionseu.backup.backup_resource_metadata (
        backup_dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );

Individual resource types can also be backed up using the following sub-functions.

Routines

CALL flowfunctions.backup.backup_info_schema_routines (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );
CALL flowfunctionseu.backup.backup_info_schema_routines (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );

Views

CALL flowfunctions.backup.backup_info_schema_views (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );
CALL flowfunctionseu.backup.backup_info_schema_views (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );

Tables

CALL flowfunctions.backup.backup_info_schema_tables (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );
CALL flowfunctionseu.backup.backup_info_schema_tables (
        dataset_refs, -- ARRAY<STRING>
        destination_dataset_ref -- STRING
        );

Note that when recovering tables, they will be created as empty tables with the exact same schema as the source table. In the context of a Transformation Flow this is not problematic as the tables will be filled with data when the flow is run and the inbound data should be backed up separately.