The world’s leading publication for data science, AI, and ML professionals.

Superglue – Journey of Lineage, Data Observability & Data Pipelines

Data plays a critical role in business decisions, AI/ML, product evolution and much more. Timeliness, accuracy, and reliability are the…

Democratizing data using self-serve data platform tools

Image via shutterstock
Image via shutterstock

Data plays a critical role in business decisions, AI/ML, product evolution and much more. Timeliness, accuracy, and reliability are the key foundational data requirements for every organization. For a data-driven organization, it’s important to make data easily available for discovery, exploration, processing, governance, and consumption by users like data engineers, analysts, and data scientists. This requires significant investments in building platform tools that democratize data for data users.

Our journey to democratize data at Intuit started with two objectives:

1) reduce the amount of time users spend on building data pipelines (time-to-build)

2) reduce the amount of time users spend on detecting/resolving data issues (time-to-debug)

We built Superglue at Intuit to help users build, manage and monitor data pipelines. There are four core aspects to Superglue that I’ll cover in this blog – Lineage, Observability, Pipelines and Personalization. If you are a data leader, architect or platform engineer, this blog will help you learn the pattern to build lineage at scale, and how to monitor data and pipelines by help of lineage. Let me provide some background before we dive deeper.


Petabytes of diverse data, thousands of jobs, and layers of dependencies

Intuit has petabytes of diverse data collected from its products, applications and third parties. Thousands of Hive, Spark and Massively Parallel Processing (MPP) jobs use these data sets everyday to produce hundreds of reports that provide operational and business insights. Similarly, ML workflows use these data sets for feature engineering and model training.

Insights or features are generated through multiple layers of ingestion, processing and analytical layers using frameworks that are owned and managed by multiple teams. For example, one of the key reports depends on data from 18 different data sources that goes through 20+ levels of processing.

With such scale and complexity, when there are metric inaccuracies (as depicted below), identifying the root causes becomes extremely challenging.

Users would spend hours and in some cases days to get to the root cause of issues. And, the causes of such data issues could be many.

Where do you start to look for failures when an issue is reported? You are looking at thousands of running jobs that are owned by hundreds of users. These jobs use thousands of tables that are processed through many frameworks owned by multiple teams. This is where Superglue’s self-serve debugging experience, built on the foundation of Lineage and Observability, comes in to help detect the root causes of data issues.


Lineage

To get to the root cause of failures that could be in upstream data pipelines, we had to first get visibility into end-to-end lineage. The enterprise scheduler gave us dependencies that users specified when scheduling their jobs. Interestingly, 90% of analytical jobs were scheduled without any job dependencies. Due to this, any upstream delays or failures did not prevent downstream jobs to run; these jobs went ahead anyway and caused operational issues, metric inconsistencies, and data holes.

We determined the need to build lineage tracking based on source code in Git for data processing and data movement frameworks running Hive, Spark & MPP workloads. We use open source and custom SQL parsers to derive relationships between jobs, scripts and input/output tables. Similar parsing is done for BI reports and homegrown data movement frameworks to find associated tables.

Using this metadata, we "glue" the end to end lineage which includes three key entity types: jobs, tables & reports (a.k.a. dashboards). Users can search for these entities and land on their lineage view.

Users can search for tables, jobs or reports
Users can search for tables, jobs or reports

Here is an example of table lineage for the job selected from the search page. Jobs are represented in ovals and tables are represented in rectangles. Color of the job represents whether the job has failed (red), completed successfully (green) or is active (light green).

Table lineage
Table lineage

Here is scheduler lineage for the same job based on user specified job dependencies in the enterprise scheduler.

Scheduler lineage
Scheduler lineage

And, here is an example of lineage for a report, represented in a circle.

Report lineage
Report lineage

Dependency Recommendation

With table lineage based on source code and job lineage based on dependencies specified in the enterprise scheduler, we have visibility into which tables feed into which jobs, which tables are produced by which jobs and which jobs depend on which other jobs. This helped us build dependency recommendation as a feature in Superglue. Using this feature, we are able to pinpoint job dependencies that are missing. It’s like saying – "This job depends on these two tables which are created from these two other jobs, but you haven’t specified these two other jobs as dependencies. Please add them as dependencies in the scheduler."

Lineage APIs

Along with backward lineage, we also made forward lineage available. Forward lineage helps with use cases that need to assess the impact of source and schema changes to downstream pipelines and/or reports. Lineage APIs enabled engineering automation to detect the impact of such changes.

Lineage APIs and data quality frameworks also played a key role when we moved thousands of analytical pipelines, tables and reports to the public cloud. Using forward lineage APIs, we were able to detect which pipelines and reports could be tested when the raw source data was ready in the cloud. Similarly, when we found metric issues in the cloud reports during migration, we could use backward lineage APIs to identify sources of data issues in raw tables.


Data Observability

Our next step was to build a debugging experience for users with an objective to reduce mean-time-to-detect & mean-time-to-restore data issues from hours to minutes (time-to-debug metric). With lineage as the backbone, we overlaid following features to enable Data Observability

  • Job execution stats and logs: Integration with the scheduler to capture start time, end time, run time, execution attempts, failures, logs and job dependencies
  • Table stats: Integration with custom data ingestion frameworks and Massively Parallel Processing (MPP) platforms to capture row counts and table sizes. In some cases, we were able to tap into MPP system tables to get rich table/column profiling stats.
  • Report stats: Integration with Business Intelligence(BI) tools to capture report SQLs, execution stats and refresh logs
  • Change tracking details: Integration with Git to capture changes to Hive, Spark and MPP jobs

Enabling these features served as a single platform for lineage and debugging. Here is an example of a job details page which appeared when clicking on a job on the lineage canvas.

Job stats page
Job stats page

Similar details/views are made available for tables as well as reports.

Anomalies

Having job stats and table stats available, we were able to add anomaly detection that proactively detected anomalous job runs (example below) and data changes in tables. We also introduced alert subscriptions/notifications for anomalies along with alerts for failed and delayed runs (more details in the personalization section below).

Job ETA Service

Another important feature that we introduced was ETA Service. Having lineage and job execution stats (start time, end time, duration), it gave us the capability to proactively estimate the ETA for SLA bound jobs. We called this feature ETA Service which was integrated with Slack to provide frequent ETAs for highly critical pipelines (example below).


Data Pipelines

Having improved the time-to-debug metric, our next goal was to reduce the time users spent to build and test the pipelines from hours to minutes (time-to-build metric). Our objectives to enable data pipelines were multi-fold:

  • Build simplified user experience
  • Provide out of the box lineage & data observability
  • Enable data processing on Spark runtimes through QuickETL, a homegrown configuration driven framework to define and execute Spark ETL workflows. It provides dependency chaining, query level metrics, monitoring, and the ability to apply circuit breakers before and after each step of the workflow. A rich pipeline definition grammar enforces a quick, consistent way to define transformation logic and dependencies.
  • Abstract away data engineering complexities from the hands of data analysts. This included setting up sandbox environments to enable pipeline testability.
  • Enable data movement from Data Lake to MPP platforms that are used as data serving layers for BI
  • Integrate with BI tools to enable report refresh as part of the pipeline
  • Improve user experience for scheduling and orchestration through integrations with the enterprise scheduler

This enabled users to build data pipelines that would run Hive & Spark data processing jobs, move data to MPP platforms, refresh BI reports as well as test these pipelines in the sandbox environments and schedule them on the scheduler. Once the pipelines were built and scheduled, users used Superglue’s data observability capabilities to manage their daily data operations.

Here are a few snapshots of the pipeline development flow:

Above pipeline steps are used to show an example. You could imagine multiple types of data processing, data movement and/or report refresh steps in a single pipeline in a real-life example. The image below shows the created pipeline with controls to edit pipelines, rerun pipelines, view execution logs and see lineage.

My Pipelines page
My Pipelines page

Personalization

Enabling pipelines along with lineage and observability transformed Superglue into a platform tool to build, manage, and monitor analytical pipelines.

With 250+ users with varied journey maps and personas, it became important that we provide a personalized experience. In fact, personalization was driven by inputs from our user base. We wanted to make it easy for users to land on Superglue and start their journey with things that mattered the most to them. Here are few features we introduced

  • Enabled SSO authentication
  • Added Personalized views – My Reports, My Pipelines, My Jobs and My [Alert] Subscriptions
  • Enabled Org hierarchy using data from HR Systems and introduced org level artifacts and metrics
  • Enabled pipeline shareability & transferability
  • Added easy alert subscriptions for failures, delays and anomalies (including subscriptions to assets owned by others)
  • Added Self-help with video tutorials, FAQs and onboarding pages in the product

Here is an example snapshot of "My Reports" landing page

My Reports page
My Reports page

Open Source

A portion of Superglue that includes SQL parser, Table Lineage and UI/Search capabilities is open sourced at https://github.com/intuit/superglue. It will give you a head start to derive, persist and visualize lineage. If you are interested in contributing to Superglue, please check out our contribution guidelines.


What’s Next

We have been working on Superglue for the last couple of years and our journey to democratize data across business units and functional groups at Intuit continues. We are working towards enabling Superglue for advanced users by adding features around testability, debuggability and support for multiple processing runtimes.


Team

Team that made this wonderful journey possible – Anand Elluru, Shradha Ambekar, Sooji Son, Yang Zhou, Rama Arvabhumi, Veena Bitra and Sunil Goplani with support from Intuit Data Engineering leadership.


Related Articles