Data Fusion: A Code-Free Pipeline for Your Google Cloud Data Warehouse

Set up an enterprise-grade end-to-end data platform in days rather than months without technical skills

Guillaume Dupont
Towards Data Science

--

Despite the plethora of Cloud-based big data platforms and applications released over the past five to ten years, creating a data warehouse generally remains a difficult undertaking that requires unique technical skills to design, setup and operate daily. After adding Data Fusion to its portfolio of SaaS solutions, Google Cloud aims at providing a solution to overcome these challenges and give every user the power to create and manage an enterprise-grade data platform.

This is timely as for many companies, Covid-19 has not only wreaked havoc existing data sources and business intelligence operations and dashboards, but also created additional pressure to put data to use quickly. At the same time, the economic uncertainty slashed the budgets and resources. More simple solutions accessible to almost everyone are required today more than ever, and thanks to its intuitive drag-and-drop user interface, Data Fusion could be of great help to many organizations.

Photo by Sigmund on Unsplash

Use case

There is still little public or community content available about Data Fusion even if it has been generally available on Google Cloud Platform for more than a year. Even if Data Fusion is still in its infancy, I want to share my impressions and explain how it kept its promise. That is,

How Data Fusion allowed to setup an end to end data warehouse in days rather than months,

while satisfying the following requirements:

Photo by Victor Garcia on Unsplash

1. Ingest a few hundred gigabytes of compressed data produced by a cluster of single-tenant servers every day;

2. Produce PDF reports for each entity on a regular basis;

3. Provide a business intelligence tool to some team members who can modify existing dashboards, create new ones, and query and visualize the latest data through a simple user interface;

4. Keep the door opened for machine learning in the future.

It also satisfied aggressive constraints for a project of that nature, including

5. A consumption-based cost model with zero capex and a low opex;

6. As much SaaS and self-service as possible; and

7. A minimal amount of coding or highly technical operations;

Example Architecture

Data Fusion is essentially coordinating all pipeline steps from the ingestion of data from third parties down to the load of transformed data into analytics databases. Data Studio then creates Business Intelligence reports from this transformed data.

Architecture diagram of the data platform leveraging Data Fusion, Cloud Storage, Big Query, and Data Studio. Diagram by author.

Note that Data Fusion integrates seamlessly with most of Google’s portfolio, and therefore several solutions mentioned here could be replaced by other Google Cloud products. For instance, this use case uses Cloud Storage as a data lake and Big Query as a data warehouse but many alternatives exist. In this example, the pipeline goes through the following steps:

1. Ingestion into Google Cloud Storage

Blob storage is the obvious candidate to build a data lake storing raw and unstructured data, and this is where the ingestion step will store the data first. Data Fusion offers connectors to most data sources including storage accounts, other public Clouds, FTPs, databases such as Oracle, etc. There are also many ways to send raw files to Google Cloud Storage, from highly automated solutions and transfer utilities down to more manual solutions relying on virtual machines.

2. Data fusion workflow

Data fusion comes with an easy to use drag-and-drop user interface to create data pipelines called Integrate. Integrate lets user create pipelines with three types of elements:

  1. Data sources - such as Google Cloud Storage
  2. Transformations, which seat between inputs and outputs
  3. Sinks, which are outputs - such as Big Query, Google’s well-known SaaS data warehouse solution dedicated to analytics workloads
Basic workflow of data fusion: Data input (source)-Transformation-Data output (sink). Screenshot by author.

Every pipeline must contain at least a source and a sink (and Data Fusion does not allow “empty” arrows).

Example

Our first ETL pipeline is split into two paths after reading all matching file names from Google Gloud Storage:

The first one processes the content of the CSV file with

  • A first line by line transformation written in Javascript,
  • A “wrangler” to parse the CSV file,
  • A sink load into Big Query

The second one counts the number of rows using

  • A Group By and Count to count the number of records in the file,
  • A Javascript transformation to add timestamps,
  • A sink load into Big Query,
  • And finally a file move to archive the file in a different Google Cloud Storage directory after it has been successfully loaded.
Example workflow for Data Fusion receiving data from Cloud Storage and inserting transformed data into Big Query. Screenshot by author.

Data fusion constantly relies on strongly typed schemas so that every transformation receives a predictable input and produces a predictable output. For example, the Google Cloud Storage reader will pass on two attributes for each line by line read to the next transformations: A string “body” for the content of the line, and a string “path” for the file path. An XML Reader is also available when input files need to be read record by record instead of line by line.

Transformations

The most powerful tools Data Fusion provides are the data transformations, and among these, the wrangler.

The wrangler lets users parse flat files using an intuitive Graphical User Interface. Starting from a sample raw data file loaded line by line, users can parse, rename and change the type of variables, add or delete columns, extract and explode data, etc. All operations are simultaneously tracked on a list of transformations called a “recipe” .

Data Fusion wrangler used to define a recipe made of 14 transformations steps on a flat file. Screenshot by author.

Once the wrangling steps have been defined, the list of instructions in the recipe is applied to the transformation in the pipeline. When executed, the transformation converts input lines in its input schema into the output schema. Data Fusion does not always update the output schema in the wrangler transformation directly, but both recipes and schemas can be exported as JSON files and imported throughout Data Fusion. The recipe also shares the same language as Google Data Prep, which is a slightly more powerful tool but requires a separate license.

Data Fusion provides many other transformations which will be indispensable for more complex pipelines: Normalizing/denormalizing, XML parsing, etc. While not always as generic as Airflow equivalents, Data Fusion transformations provide comprehensive functional coverage. A transformation which will be inevitably useful is the JavaScript one. As Data Fusion is still a relatively new solution, at one point or another it will lack a feature. This is where the JavaScript transformation becomes handy. Need to add timestamps but there is no transformation to manage timestamps or file names? Not a problem, a three-line script can take care of this.

3. Big Query

Outputs of the transformations can be sent to various data stores and databases, among which Big Query which will be the foundation for the data warehouse. The Data Fusion BigQuery sink turns data produced by the transformations into rows appended in BigQuery tables. Note that BigQuery does not require explicitly defined schemas when the table is created, in which case the schema will be recorded when the first rows get inserted in the table. While convenient, it can be difficult to manage schema changes between Data Fusion and Big Query, especially in the early steps of the data warehouse creation.

Big Query data warehouse to store transformed data. Screenshot by author.

Data Fusion can in turn read from Big Query in order to transform the data and create more complex pipelines. Another alternative for relatively simple data transformations is to use Big Query Scheduled Queries: These will run at scheduled times to select data from given tables and write outputs in new tables.

4. Data studio

Data studio is Google’s SaaS data visualization tool. It is free and tightly integrated with Big Query, so that

  • Visualizations can instantly be created and updated with data retrieved from Big Query tables using the SQL language
  • Vice versa, query outputs from the Big Query GUI can be instantly exported and visualized in Data Studio

Data Studio expects relatively structured data in input and provides less filtering or aggregation capabilities than competing tools like PowerBI, hence having the right SQL scripts and in some cases preprocessed tables is important. Once this point is understood, Data Studio provides an extremely efficient framework to almost instantly retrieve, edit, publish, and export reports to PDF. In addition, parameters in the report can be directly considered within the SQL queries, allowing to easily produce reports for a variety of configurations or for each entity.

Data Studio queries the Big Query data warehouse and builds Business Intelligence reports for users. Screenshot by author.

Conclusion

A code-free data pipeline: Data Fusion keeps its promise

By providing a Graphical User Interface allowing non-technical users to create data pipelines in a matter of days, Data Fusion elegantly handles a business problem that would otherwise require a much more complex tech stack*. All the Google Cloud solutions mentioned in this article are nearly always seamlessly integrated with each other, making it easy for any user to use all different components without any technical configuration or line of code.

Data Fusion is a relatively new solution that has areas of improvements

One exception though are the first few days on Data Fusion which unfortunately bring their set of cryptic errors and usability glitches. It even starts right after the first execution: Data Fusion’s default resource requirements exceed Google Cloud trial account quotas, and this can only be understood after digging deep into log files. Perhaps more importantly, most user inputs do not get validated in the GUI but only during pipeline preview. Basic errors such as typos will most often result in vague errors only reported in fine level logs. Google must quickly improve this to ensure new users do not get frustrated, especially as competing solutions on Azure and AWS tend to handle this better. Finally, maintaining separate environments and versions is also difficult. Without dedicated features in Data Fusion, most of these tasks must be handled manually using exports and imports. The same goes for governance and access control: Users are free to change things as they wish, but this may come at the cost of stability and complexity.

Data Fusion looks promising for many companies

The list price for Data Fusion Enterprise edition is about 3000USD/month, in addition to Dataproc (Hadoop) costs charged for each pipeline execution. It is unclear how many customers are using Data Fusion yet, but Data Fusion addresses a genuine business problem that many companies face, and therefore should have a promising future.

*[Air Flow (Cloud Composer), Hadoop (Dataproc), etc]

--

--

Principal Architect in the airline industry. Digital Transformation consultant. Microsoft Azure certified architect.