Data Warehousing | Data Lake | Parquet

Understanding Apache Parquet

Understand why Parquet should be used for warehouse/lake storage

Atharva Inamdar
Towards Data Science
4 min readSep 26, 2020

--

Apache Parquet is a columnar storage format available to any project […], regardless of the choice of data processing framework, data model or programming language.
https://parquet.apache.org/

This description is a good summary of this format. This post will talk about the features of the format and why it is beneficial to analytical data queries in the data warehouse or lake.

Data are stored row after row with each row containing all columns/fields
https://www.ellicium.com/parquet-file-format-structure/

The first feature is the columnar storage nature of the format. This simply means that data is encoded and stored by columns instead of by rows. This pattern allows for analytical queries to select a subset of columns for all rows. Parquet stores columns as chunks and can further split files within each chunk too. This allows restricting the disk i/o operations to a minimum.

The second feature to mention is data schema and types. Parquet is a binary format and allows encoded data types. Unlike some formats, it is possible to store data with a specific type of boolean, numeric( int32, int64, int96, float, double) and byte array. This allows clients to easily and efficiently serialise and deserialise the data when reading and writing to parquet format.

In addition to the data types, Parquet specification also stores metadata which records the schema at three levels; file, chunk(column) and page header. The footer for each file contains the file metadata. This records the following:

  • Version (of Parquet format)
  • Data Schema
  • Column metadata (type, number of values, location, encoding)
  • Number of row groups
  • Additional key-value pairs
Parquet Metadata https://parquet.apache.org/documentation/latest/

The metadata is always written in the footer of the file as this allows a single pass write. In plain English, the data is written first, then the metadata can be accurately written knowing all the locations, size and encoding of the written data. Many formats write their metadata in the header. However, this requires multiple passes as data is written after the header. Parquet makes this efficient to read metadata and the data itself. Another advantage is that the file is splittable in any desirable size. For example, if you are using this with Spark or Amazon Redshift, you can specify writing files of 1GB size for efficient loading.

A separate metadata file is part of the specification allowing, multiple parquet files to be referenced. So the dataset can be arbitrarily large or small to fit in a single file or multiple files. This is particularly advantageous when working with gigabytes or terabytes of data in the data lake. Most of today’s applications and warehouses allow parallel reads. Multiple files mean data can be read in parallel to speed up execution.

Combining the schema and metadata with splittable files makes Parquet a flexible format. The schema can evolve over time. An example is if a field/column is added to the dataset, this is simply encoded within the new chunks and files. the metadata file is updated to record that only certain files and row groups include the new chunk. Thus schema evolution and merging can easily occur. Where files do not contain the new field, they simply result in the field not existing. If multiple files are read where some contain the field and others don’t, null values are used to denote missing column values.

Lastly, the format supports compression natively within its files. This means that data can be efficiently compressed when it is low-medium cardinality by column. When data is high cardinality, compression can be performed on these columns separately in a different file. This allows variability in encoding and compression of different fields and data types. This is another advantage when reading the data efficiently with high throughput.

Overall, Parquet’s features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes. It provides further benefits through compression, encoding and splittable format for parallel and high throughput reads. The metadata enables all this while providing the flexibility of storage and schema evolution. Parquet has been a de-facto format for analytical data lakes and warehouses. Many tools and frameworks support this such as Hadoop, Spark, AWS Redshift, and Databricks platform. Databricks also provide a new flavour to Parquet that allows data versioning and “time travel” with their Delta Lake format. (Teaser: there may be an upcoming post on Delta format.)

--

--