Skip to content

Overview

What is Transformation Flow?


Summary

Transformation Flow is a lightweight, open source framework for planning, developing, implementing, documenting, managing and monitoring SQL-based in-warehouse data transformations.


Motivation

The ability to build data transformations in SQL and have the results immediately available to downstream tools, models and consumers is a modern data superpower, but with great power comes great responsibility.

Structure

Unfortunately the responsibility to write clean, clear, well-structured code is often overlooked, especially when the transformations begin life as ad-hoc data explorations. This means that the resulting flows can quickly grow into complicated interconnected networks of datasets, tables and views, which can be extremely difficult to manage and debug.

Simplicity

The objective of this framework is to enable rapid development of robust data transformations within the data warehouse, wherever you want to write your SQL, without requiring any additional tools or installation. It is designed to help humans convert their data transformation objectives into simple, well-structured code, which is quick to build and deploy, and easy to manage, monitor and debug.

Benefits

Aside from the obvious benefits of rapid, clear, readable, maintainable code, using this approach reduces the amount of code written by humans significantly, aims to eliminate potential copy-paste errors and ensures that the resulting code is structured to be easier to manage in the future. It also helps with step-by-step testing to ensure that the data transformation flow stages work as expected, with the aim of preventing 'needle in a haystack-sized2 mound of dirty spaghetti' debugging endeavours.

Functions

Common patterns are abstracted into powerful functions to transform human intent into sometimes verbose SQL structures. By leveraging the information schema1, user defined functions (UDF) and scripting capabilities, these new functions enable dynamic development of patterns which would typically require hard-coding of column-names into human-error-prone structures such as long case statements or column name string concatenations.

Management

In addition to the code becoming easier to reproduce and more portable (as you can simply apply the exact same function with a few argument changes to deploy transformations to different source data), this approach has built-in data lineage, and is structured to support testing, debugging and automatic documentation of transformation flows.

Approach

The framework is intended to simplify and accelerate the workflow for translating human intent into robust, maintainable SQL code, leveraging the power of functional encapsulation to build SQL in real time from an extensible set of functions and their parameters.

Workflow Simplification

Data workflows can be extremely complicated, requiring repeated context switching, multiple systems and large-scale data management, which impacts productivity by limiting the opportunity to get into a flow state for creative problem solving. By keeping the entire workflow inside one platform and mode of thinking, the objective is to maximise creative productivity.

Open Source Code

The code base is open source, with the source function code (including arguments and detailed descriptions) soon to be made available at transformationflow/flowfunctions, and currently useable by any authenticated BigQuery user in the public flowfunctions project.

Alternatives

This framework was born because the alternatives did not meet the requirements for our specific use-cases, where implementation of additional external tools would have added unneccesary complexity to a data transformation process which is completely manageable in pure SQL (plus additional innovstions in scripting, functions, and Data Definition Language).

Modern Tools

For more complex situations where hundreds or thousands of interdependant transformations need to be developed, deployed, orchestrated and managed, or collaboration is required across a large number of people or teams, then it might make more sense to use an external tool.

DBT

DBT is the industry standard tool for data transformation, leveraging Jinja templating inside SQL to enable analytics engineers to model data and deploy/manage data transformations. It is written in Python.

Dataform

Dataform is a BigQuery-specific platform for transformation collaboration, using Javascript and SQLX (a templated SQL variant).

Legacy Tools

There are also a number of legacy tools in this space, however they are typically tailored towards the enterprise and come with complex sales processes and high licence fees.


  1. the various INFORMATION_SCHEMA views contain metadata about datasets, columns, tables, views and other assets 

  2. in this context 'haystack-sized' is of course completely subjective, however in the data world I will spuriously define this as multi-gigabyte 

Back to top