The world’s leading publication for data science, AI, and ML professionals.

Parquet Best Practices: Discover your Data without loading it

Metadata, Statistics on Row Groups, Partitions discovery, and Repartitioning

If you like to experience Medium yourself, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly, and you get to access all the amazing stories on Medium.

Photo by Jakarta Parquet on Unsplash
Photo by Jakarta Parquet on Unsplash

This article is the next article from a series of articles on Parquet. You should check the previous Parquet article before reading this one if you don’t have any Parquet knowledge, but it is also a great reminder for people more advanced. If you want to reproduce the input data for this article, the code is at the end.


Apache Parquet is a columnar storage format for Big Data frameworks, such as Apache Hadoop and Apache Spark. It is designed to improve the performance of big data processing by using a columnar storage format, which stores data in a compressed and efficient way.

Parquet adoption continues to increase as more and more organizations turn to big data technologies to process and analyze large datasets. With this continuous development, it is important that everyone learns some best practices and how to navigate through Parquet files.

In this tutorial, we will show you how to gain maximum insight into your Parquet Data as a Parquet user without resorting to the common brute force of loading it for understanding.

The case study

To do this, we provide you with a case study in which a Data Engineer has provided you with loan applicants’ data and you need to create predictive models with that data. But first, you need to "technically" discover the data. And it is huge data.

Indeed, the Data Engineer that prepared the data, tells you that the Parquet folder is 1TB large (for educational purposes only, this is not the case in our example), so if you try to load everything, you will run through Memory Error on your machine.

Don’t worry, we will provide you with the most efficient way to understand the big Parquet data without even loading the Parquet data in memory.

That means answering the questions below:

  • What do the Parquet files in this folder look like?
  • What variables are inside? With what types? Some statistics?
  • how is the data partitioned?

We will also teach you how to reformat partitions if you notice that something is wrong with the way the data has been partitioned.

Reading the first Parquet file

The import that you will need for this tutorial:

import pyarrow as pa
import pyarrow.parquet as pq
import os

First of all, we want to get an idea of what the folder _’APPLICATIONSPARTITIONED’ contains, this is where the data is stored.

Since you don’t know how the data is partitioned, you cannot just load the whole folder blindly because you’ll be loading all the Parquet files and that’s not what you want to do (remember 1TB size), but rather you want to get an overview of your data.

Here, I give you a function get_first_parquet_from_path() that will return the first Parquet file that is in a directory. The function will scan through each directory and subdirectory until it finds a Parquet file and will return the complete path of this single file.

def get_first_parquet_from_path(path):
    for (dir_path, _, files) in os.walk(path):
        for f in files:
            if f.endswith(".parquet"):
                first_pq_path = os.path.join(dir_path, f)
                return first_pq_path

Looks like a cool function, let’s put it into practice.

path = 'APPLICATIONS_PARTITIONED'
first_pq = get_first_parquet_from_path(path)
first_pq
#Output : APPLICATIONS_PARTITIONED/NAME_INCOME_TYPE=Commercial associate/CODE_GENDER=F/6183f182ab0b47c49cf56a3e09a3a7b1-0.parquet

We notice from the path that this is partitioned by _NAME_INCOMETYPE and _CODEGENDER, good to know.

To read this path now to get the number of rows and columns and also the precious Schema here is what you can do:

first_ds = pq.read_table(first_pq)
first_ds.num_rows, first_ds.num_columns, first_ds.schema

It took less than 1 second to run, the reason is that the read_table() function reads a Parquet file and returns a PyArrow Table object, which represents your data as an optimized data structure developed by Apache Arrow.

Now, we know that there are 637800 rows and 17 columns (+2 coming from the path), and have an overview of the variables and their types.

Wait, I told you before that we won’t need to load anything in memory to discover the data. So here is a method to do it without reading any table.

Metadata

I am partially tricking you because we won’t be loading any data but we will be loading something called metadata.

In the context of the Parquet file format, metadata refers to data that describes the structure and characteristics of the data stored in the file. This includes information such as the data types of each column, the names of the columns, the number of rows in the table, and the schema.

Let’s use both read_metadata() and read_schema() function from pyarrow.parquet:

ts=pq.read_metadata(first_pq)
ts.num_rows, ts.num_columns, pq.read_schema(first_pq)

This is giving you the same output as with read_table().

However, we notice that in the execution times there is a big difference because here it is close to instantaneous. And this is not a surprise because reading the metadata is like reading a very small part of the Parquet file that contains everything you need to have an overview of the data.

Statistics

Now let’s say I want to know a little more about the columns what can I do?

You could read the statistics from the first Row Group of a file.

A Row Group in the Parquet file format is a collection of rows that are stored together as a unit and divided into smaller chunks for efficient querying and processing.

parquet_file = pq.ParquetFile(first_pq)
ts=parquet_file.metadata.row_group(0)
for nm in range(ts.num_columns):
    print(ts.column(nm))

This code above will give you an ugly output, here is some code to format it into a beautiful DataFrame:

beautiful_df = pd.DataFrame()
for nm in range(ts.num_columns):
    path_in_schema = ts.column(nm).path_in_schema
    compressed_size = ts.column(nm).total_compressed_size
    stats = ts.column(nm).statistics
    min_value = stats.min
    max_value = stats.max
    physical_type = stats.physical_type
    beautiful_df[path_in_schema] = pd.DataFrame([physical_type, min_value, max_value, compressed_size])
df = beautiful_df.T
df.columns = ['DTYPE', 'Min', 'Max', 'Compressed_Size_(KO)']

On the DataFrame you have the type of the columns, the minimum, the maximum, and the compressed size. Few learnings from this file:

  • Strings columns were converted to _BYTEARRAY.
  • Min and Max for String columns are sorted alphabetically.
  • The compression size of Boolean is not so much better than the _BYTEARRAY.
  • The youngest applicant is 21 years old, and the oldest is 68 years old.

Becare to not generalize the statistics, it is just from the first parquet file!

Great, now we have a good understanding of the data such as info on columns, types, schemas, and even statistics, but don’t we miss something?

Partitions

Yes, we don’t know the partitions of the data! As said before, we could guess at least the partitioning columns from the file path:

the data is partitioned on _NAME_INCOMETYPE and _CODEGENDER. But we don’t know the other partition values. Suppose that we want to look at other _NAME_INCOMETYPE?

But I’ll provide you a code so you can get the partitions in a more systemic way but also all the values possible for the partitions:

def get_all_partitions(path):
    partitions = {}
    i = 0
    for (_, partitions_layer, _) in os.walk(path):
        if len(partitions_layer)>0:
            key = partitions_layer[0].split('=')[0]
            partitions[key] = sorted([partitions_layer[i].split('=')[1] for i in range(len(partitions_layer))])
        else:
            break
    return partitions

Let’s run this function that returns a dictionary with the keys corresponding to the partitions columns, and values are the associated partitions values to each partition column.

ps = get_all_partitions(path)
ps.keys(), ps.values()

We know now that the Data Engineer partitioned the data first by _IncomeType and then by Gender. And all the values for the partition columns are listed below:

Now that we have the partition columns and values knowledge, we can read another partition of our interest.

Suppose that we want to read all the data of ‘Pensioner’ regardless of Gender.

From the last tutorial, we know that we can do that by reading the Parquet folder _’APPLICATIONS_PARTITIONED/NAME_INCOMETYPE=Pensioner’

df_pensioner = pd.read_parquet('APPLICATIONS_PARTITIONED/NAME_INCOME_TYPE=Pensioner/')

Reformat the partitions

Actually, we have no interest in splitting the data by gender and the size of the data allows us to read data from both genders without an excessive runtime.

It is important to not over-partition your data because, in general, the execution time is increased by the number of partitions within the folder. So you have to keep in mind that there is a potential downside to partitions, even if it makes the data more functionally readable. (From the official documentation 512MB – 1GB is the optimal size for a partition).

Here, let’s say we assume that the subfolders of the genders are small enough after checking the data, and we find that the functional division of the genders is not useful. We decide to reformat the dataset to be partitioned only by _NAME_INCOMETYPE:

pq_table = pq.read_table('APPLICATIONS_PARTITIONED')
pq.write_to_dataset(pq_table, 'APPLICATIONS_REPARTITIONED', partition_cols=['NAME_INCOME_TYPE'])

We just read the data in PyArrow Table object then we wrote a Parquet file __ partitioned only on _NAME_INCOMETYPE and no more per _Gende_r. If we run now the function get_all_partitions() with the values:

partitions = get_all_partitions('APPLICATIONS_REPARTITIONED')
partitions.keys(), partitions.values()

We notice that we don’t have the partitioning by Gender anymore.

In conclusion, you have just seen how to navigate through Parquet files to know everything about the data before loading it: like column names, size, schema, statistics and how to get partition names and values. You also found out how to reformat the partitions to be more technically and functionally correct.

Thanks for reading and see you in the next story!


The full code to generate the input data we used:


Continue your learning with my other Parquet articles:

Simple Parquet Tutorial and Best Practices

Parquet Best Practices: The Art of Filtering

With no extra costs, you can subscribe to Medium via my referral link.

Join Medium with my referral link – Arli


Related Articles