How to Test PySpark ETL Data Pipeline

Validate big data pipeline with Great Expectations

Edwin Tan
Towards Data Science
6 min readDec 6, 2022

--

Photo by Erlend Ekseth on Unsplash

Introduction

Garbage in garbage out is a common expression used to emphasize the importance of data quality for tasks such as machine learning, data analytics and business intelligence. With increasing amount of data being created and stored, building high quality data pipelines…

--

--