Companion GitHub repository: Hands-on Great Expectations with Spark
Introduction
At Mediaset, the Data Lake is a fundamental tool used daily by everyone who wants to get some company insights or activate data.
By definition, a Data Lake is "a centralized repository to store all your structured and unstructured data at any scale. You can store data natively from the source without having to transform them on the run" (AWS). This allows you to keep a vast amount of raw data that you can activate later with different types of analytics.
The number of Mediaset employees who access the Data Lake is rising, and as it grows, the number of products and data we ingest and persist. As users and the volume of archived data increase, the complexity of managing the Data Lake grows.
This is a critical point, in fact if Data Quality and Data Governance systems are not implemented, the Data Lake can turn into a Data Swamp: "a data store without organization and precise metadata to make retrieval easy" (Integrate.io).
Data Swamps can happen quickly and create problems for data-driven companies who want to implement advanced analytics solutions. If data is closely governed and constant data health status checks are executed, Data Lake has the potential to give the company an accurate and game-changing business insights tool.
Having full control and knowing what to expect from data persisted in the Data Lake to prevent it from turning into a Data Swamp becomes critical and relevant every day more and more.
For this purpose, we implemented a Data Quality Workflow to support different business units in the demanding task of data validation, keeping track of data health, drifts, and anomalies. This helps people to constantly evaluate both ingested and transformed data, enhancing the data’s trustability and general quality of data-driven products.
The workflow is built around the open-source framework [Great Expectations](http://helps data teams eliminate pipeline debt, through data testing, documentation) (GE), which defines itself as "a shared, open standard for data quality which helps data teams always know what to expect from new data." GE is a python package that enables us to write tests and evaluate data quality. Its simplicity and customizability allowed us to easily integrate it with ETL pipelines to verify that what we expect from both input and output data is satisfied.
In this article, we present the architecture of the Data Quality Workflow made by Mediaset. We list the critical points that brought us to build it and the entire involved technology stack.
Mediaset’s Data Lake: a practical use case
Mediaset is the leading private TV publisher in Italy and Spain, an authentic audience leader with five public networks and more than 30 free and paid theme channels.
Every day, millions of people watch videos on demand and live streams, read articles, and listen to radio shows offered by Mediaset services. People interact with Mediaset properties from different devices: smartphones, web browsers, smart TVs, and internet TVs. All those video views, page views, and clicks are ingested from Mediaset systems to understand what our clients (those who accepted to be profiled) love to watch, read, and listen to, and enhance the general quality of the products.
In the last years, Mediaset decided to invest resources in designing and building a data lake to store all the data generated from the users’ interactions with their devices over the several platforms and services the company offers.
To frame the general environment and the orders of magnitude Mediaset works with, we present summary information provided by Mediaset Business Digital (MBD), the Business Unit in charge of clients development, data ingestion, and provision of the first layer of data transformation:
- Mediaset data lake stores everyday ~100 gigabytes of native data generated by the clients and their interaction with the Mediaset properties and platforms (streaming platforms, news websites, blogs, and radios);
- several types of data are ingested: click-streams, page views, video views, video player interactions, marketing campaigns results, help desk tickets, social media feedback, and many others;
- data lake stores both raw data, ingested from the clients, and the output of the ETL pipelines, used by business and data analysts for data exploration tasks and by data scientists for machine learning models training;
- from day one, when a new platform, service, or property comes out, data are stored in the data lake, activating them later when business requirements are precise.
Big data, big responsibilities
With the evolution of the data lake, the increasing volume of data stored, and the number of people who daily query and activate the data, the necessity to keep the data lake health status under control began to arise.
More specifically, the critical points encountered are four:
- Verify that the published clients (mobile apps, smart TV apps, and websites) are correctly tracking all the events (clicks, views, and actions), and therefore all the data stored in the data lake are correct and trustable. If not, the development team must be notified of any discovered issue so that it may be resolved as soon as possible.
- Extend the ETL pipelines unit tests checking that the implemented transformations return what you expect as output.
- Monitor that all the pipelines or all the services are up and running, tracking the daily data volume variation.
- Detect data drift or data degradation in datasets involved in machine learning models, preventing wrong forecasts or misleading predictions.
These points brought the Mediaset Business Digital unit to develop a Data Quality workflow capable of monitoring and controlling the data lake health status.
The primary tool involved in this process is Great Expectations (GE) which, with the possibility it provides to adopt Spark as an Execution Engine, perfectly fits the current Mediaset data stack. GE is also an open-source project with huge community support and is designed to be extensible and fully customizable.
Data Quality workflow overview
Note: this is not an "Introduction to Great Expectations" tutorial but an overview of the architecture we implemented to monitor the Data Lake Health Status and how we use GE to accomplish this task. Check the companion repository if you are looking for a Great Expectations hands-on with the adoption of Spark as Execution Engine and practical examples on how to implement Custom Expectations.
The workflow has been developed and integrated with the current Mediaset data stack based on Amazon Web Services (AWS). The primary services involved are:
- _Amazon S3: "an object storage service that offers industry-leading scalability, data availability, security, and performance" (AWS)._ This is the central Data Lake storage, where all the data ingested and transformed are stored.
- _AWS Glue Data Catalog: "_provides a unified metadata repository across a variety of data sources and data formats. It provides out-of-the-box integration with Amazon S3, Amazon Athena and Amazon EMR" (AWS).
- _Amazon EMR: "_a managed cluster platform that simplifies running big data frameworks, such as Apache Spark, on AWS to process and analyze vast amounts of data" (AWS). Amazon EMR represents the leading AWS service for the ETL jobs adopted by the MBD unit; this, combined with Airflow as Data Orchestrator, allowed us to develop and schedule daily transformation routines.
- Amazon Athena: "an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL" (AWS).
The implemented Data Quality workflow (see the architecture presented above) comprises five steps.
1. Data Quality Suite Development
Everything starts here, from the local development environment, which allows you to develop what GE calls Expectation Suite, a set of rules (Expectations) that describe what you expect from your dataset. The environment is composed of a Docker image that runs a Jupyter notebook instance with PySpark and all the python packages you need. Once you get a subset of data you want to evaluate (raw or processed data), you are ready to code on the notebook all the Expectations your data have to satisfy and generate the Expectation Suite as a JSON file.
The docker-based development environment also supports you during the Custom Expectations development; you can code and test all the expectations in your favorite IDE using the Docker image as a remote python interpreter.
The possibility to develop Custom Expectations represents one of the most important features GE provides for our workflow. These types of expectations indeed allowed us to test even the most complicated multicolumn conditions granting meticulous data validation.
Bonus: check the repository for the code of three Custom Expectation types: single column, pair columns, and multicolumn. It also includes unit tests and the steps to run them with the provided Docker image using PyCharm.
2. Suite Deployment
Once the Suite is ready (documented and tested), the commit to the master branch of the git repository triggers a Continuous Integration pipeline that copies the compiled Expectation Suites and all the Custom Expectations python modules to an S3 bucket. The CI pipeline is also responsible for generating up-to-date GE Data Documentation, storing it in a dedicated S3 bucket.
3. Data Validation Run
The time has come to actually evaluate how good the data are. As mentioned above, we adopt Airflow as a data orchestrator, where we implemented several Data Quality DAGs. In detail, for raw data, we have dedicated DAGs that exclusively run Validation jobs (see the picture below for the sequence of Operators
and Sensors
involved in this kind of DAG);
while for processed data, we attached Validation Airflow tasks to the Transformation task (see picture below) to give consistency to the entire processing layer: if the transformation generates qualitatively poor data, we receive an alarm so that we can interrupt all the following tasks (see step 4).
Validation jobs get the previously developed and deployed Expectation Suites and Custom Expectations as input. Data Validation output is a JSON file that, after being stored in its raw format on S3, is parsed through the Airflow PythonOperator
and persisted in the Data Lake as a partitioned parquet file. Finally, the metadata in the AWS Glue Data Catalog is updated with the new partitions, making the Data Quality Validation Results ready to be queried with Amazon Athena.
Bonus: in the companion repository, you can find a ready-to-use "validate your data" job that you can quickly run to simulate what we achieve with Amazon EMR on this step (i.e., Docker container running a
spark-submit
command to execute the data validation python module).
4. Data Validation Results Analysis
We decided to integrate Apache Superset into the Data Quality workflow to explore the Validation Results and make their visualization as impactful as possible. Superset is an open-source data exploration and visualization platform capable of connecting to modern databases (including Amazon Athena), easy to master, and rich of built-in visualization plots. Due to its simplicity, Superset allowed us to focus on core tasks, such as designing custom metrics and visualizing insights with interactive dashboards to evaluate our data quality, rather than on the tool itself (see the screenshot below for a sample Superset Data Quality Dashboard).
Examples of metrics used to evaluate data are:
- Data volumes
- Unsuccessful Expectations Percentage
- List of all the columns with at least one Unsuccessful Expectation
- Percentage of wrong records for each column
- First n Unexpected Values for each column with Unsuccessful Expectation.
Superset finally provides two key features useful for the workflow:
- Alerts are triggered when a SQL condition is reached, sending a notification on Slack or an E-mail. You can set a threshold for a KPI you want to monitor (e.g., "the variation on the number of Unsuccessful Expectations between today and yesterday’s run must be lower than 10%"), and when the condition is not satisfied, a notification alarm is sent.
- Reports are sent on a schedule and allow you to analyze validation results related to the last week to communicate results in a much more organized and clear way.
5. Data Documentation
The workflow’s final step is publishing the Data Documentation website, previously generated by Great Expectations with the CI pipeline trigger. Data Docs contains a complete list of all the Expectations belonging to each column and the description of the implemented rules organized by tables and suites. Great Expectations Data Docs, in addition to providing documentation for each Expectation Suites, allowed us to share the checks developed for each table with non-technical people making it easier to discuss possible and future Suites enhancements.
Conclusion
In this article, we presented you the implemented workflow based on Great Expectations that grants Mediaset the capability to constantly monitor the Data Lake health status. This prevents Data Lake to turns into a Data Swamp, keeping the data trustability high and enhancing the quality of all the projects which involves data.
Once the Data Quality Workflow is deployed, the developers of the different Mediaset business units have just to focus on the Expectation Suites development and (obviously) on the data quality alerts and reports analysis.
The local development environment provides all the business units a comfortable place to develop their Suites and, therefore, to be independent in monitoring the data within their competence. Each business unit has been able to decline its own needs and bring them to Great Expectations through native and custom expectations.
The development environment, combined with the power and the elasticity of the cloud, with services like EMR, allows to apply expectations at any scale, both on raw data and on processed data, proactively monitoring what we daily ingest and persist in our beloved data lake. In this case, the cost of EMR clusters will depend on the size of the dataset you want to evaluate, the number of expectations contained in the Suites and the complexity of the custom expectations that have been implemented.
The organization of all the controls in suite-per-table, and the automatic generation of data docs, provide all users an easy way to check the list of tables covered by an Expectation Suite and consult the logic implemented for each column.
Finally, the visualization tool allows democratizing the data Validation Results analysis, making them accessible even to non-technical people who wish to make a status check on the table or the Data Lake before using data.
Thank you for reading our work! If you have questions or feedback, feel free to send me a connection.
Acknowledgements
Thanks to Nicola, Daniele, Fabio from Mediaset, Jacopo from Coveo, and Danny from Tecton for comments and suggestions on the article and the Github repository.
References
- What is a Data Lake? – Amazon Web Services (AWS). https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
- Turning Your Data Lake Into a Data Swamp | Integrate.io. https://www.integrate.io/blog/turning-your-data-lake-into-a-data-swamp/
-
What is Amazon S3? – Amazon Web Services (AWS). https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
- Upgrading to the AWS Glue Data Catalog – Amazon Athena. https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
-
What is Amazon EMR? – Amazon Web Services (AWS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html
- What is Amazon Athena? – Amazon Web Services (AWS). https://docs.aws.amazon.com/athena/latest/ug/what-is.html