Skip to content

Architecture

Flow Configuration

Flow configuration takes place inside the main orchestrator function for simplicity, although it could technically reside in a BigQuery table, external table (e.g. Google Sheet) or anywhere else the data can be accessed from BigQuery (e.g. JSON files in a GCS bucket or even an external database).

There are two configurations required: flow_input_config and flow_output_config.

Input Configuration

The flow input configuration (flow_input_config) defines each source table (source_table_ref), source table partitioning column (source_table_partitioning) and the view ref (input_view_ref) of the entry point of the source data subset into into the transformation flow. It is represented by an ARRAY of STRUCTS with the following schema:

ARRAY<STRUCT<source_table_ref STRING, source_table_partitioning STRING, input_view_ref STRING>>
DECLARE flow_input_config 
ARRAY<STRUCT<source_table_ref STRING, source_table_partitioning STRING, input_view_ref STRING>> 
DEFAULT [
('project_id.firebase_dataset.events_*', '_TABLE_SUFFIX', 'project_id.events_flow.001_inbound_events')
]; 

Note that as the example is a Firebase sharded table, in the configuration declaration we set the source_table_partitioning to _TABLE_SUFFIX. For partitioned source tables we would simply use the partition column name. For non-partitioned tables we pass an empty string ''.

Output Configuration

The flow output configuration (flow_output_config) defines the target output tables (output_table_ref), partitioning (output_table_partitioning) and the source table or view from which to feed the output table (output_source_ref). It is represented by an ARRAY of STRUCTS with the following schema.

ARRAY<STRUCT<output_table_ref STRING, output_table_partitioning STRING, output_source_ref STRING>>
DECLARE flow_output_config 
ARRAY<STRUCT<output_table_ref STRING, output_table_partitioning STRING, output_source_ref STRING>>
DEFAULT [
('project_id.events_flow._bi_events_with_flags', 'event_date','project_id.events_flow.402_events_with_flags'),
('project_id.events_flow._bi_summary_by_week', '','project_id.events_flow.601_summary_by_week')
];

This output configuration will feed two tables which the downstream dashboards can use as data sources, one partitioned by event_date and one unpartitioned summary. Note that in this case all output tables are prefixed with a _bi, which is a naming convention chosen to identify tables to which downstream Business Intelligence tools should connect.

Also note that to leave output_table_partitioning underfined, you need to pass an empty string ('') as the parameter value.

Flow Orchestrator Function

In order to keep the management and operation as simple and consistent as possible, there is only one function used to run the flow, which takes a single integer parameter depending on the number of days in addition to today which are to be processed:

CALL `project_id.firebase_dataset.run_transformation_flow`(last_n_days_to_replace INT64)
CALL `project_id.firebase_dataset.run_transformation_flow`(3)

Flow Orchestrator Logic

The flow orchestrator function run_transformation_flow executes the flow according the the following logical sequence:

Top-Level Logic

Top-Level Logic

Flow Input Logic

Flow Input Logic

Flow Output Logic

FLow Output Logic