Architecture
DEPRECATED
Note that this site is now deprecated and will be removed shortly. The latest information and documentation on the flowfunctions
library and framework can be found at the flowfunctions.io site.
Flow Configuration
Flow configuration takes place inside the main orchestrator function for simplicity, although it could technically reside in a BigQuery table, external table (e.g. Google Sheet) or anywhere else the data can be accessed from BigQuery (e.g. JSON files in a GCS bucket or even an external database).
There are two configurations required: flow_input_config
and flow_output_config
.
Input Configuration
The flow input configuration (flow_input_config
) defines each source table (source_table_ref
), source table partitioning column (source_table_partitioning
) and the view ref (input_view_ref
) of the entry point of the source data subset into into the transformation flow. It is represented by an ARRAY of STRUCTS with the following schema:
ARRAY<STRUCT<source_table_ref STRING, source_table_partitioning STRING, input_view_ref STRING>>
DECLARE flow_input_config
ARRAY<STRUCT<source_table_ref STRING, source_table_partitioning STRING, input_view_ref STRING>>
DEFAULT [
('project_id.firebase_dataset.events_*', '_TABLE_SUFFIX', 'project_id.events_flow.001_inbound_events')
];
Note that as the example is a Firebase sharded table, in the configuration declaration we set the source_table_partitioning
to _TABLE_SUFFIX
. For partitioned source tables we would simply use the partition column name. For non-partitioned tables we pass an empty string ''
.
Output Configuration
The flow output configuration (flow_output_config
) defines the target output tables (output_table_ref
), partitioning (output_table_partitioning
) and the source table or view from which to feed the output table (output_source_ref
). It is represented by an ARRAY of STRUCTS with the following schema.
ARRAY<STRUCT<output_table_ref STRING, output_table_partitioning STRING, output_source_ref STRING>>
DECLARE flow_output_config
ARRAY<STRUCT<output_table_ref STRING, output_table_partitioning STRING, output_source_ref STRING>>
DEFAULT [
('project_id.events_flow._bi_events_with_flags', 'event_date','project_id.events_flow.402_events_with_flags'),
('project_id.events_flow._bi_summary_by_week', '','project_id.events_flow.601_summary_by_week')
];
This output configuration will feed two tables which the downstream dashboards can use as data sources, one partitioned by event_date
and one unpartitioned summary. Note that in this case all output tables are prefixed with a _bi
, which is a naming convention chosen to identify tables to which downstream Business Intelligence tools should connect.
Also note that to leave output_table_partitioning
underfined, you need to pass an empty string (''
) as the parameter value.
Flow Orchestrator Function
In order to keep the management and operation as simple and consistent as possible, there is only one function used to run the flow, which takes a single integer parameter depending on the number of days in addition to today which are to be processed:
CALL `project_id.firebase_dataset.run_transformation_flow`(last_n_days_to_replace INT64)
CALL `project_id.firebase_dataset.run_transformation_flow`(3)
Flow Orchestrator Logic
The flow orchestrator function run_transformation_flow
executes the flow according the the following logical sequence: