Data Preprocessing

2 New Pandas Methods Provided by Terality You May Not Know

A tutorial on how to use the to_csv_folder() and to_parquet_folder() methods provided by Terality

Angelica Lo Duca
6 min readMar 9, 2022

--

Photo by NASA on Unsplash

Recently, I have read an interesting article by Bex T., who illustrated Terality, a Python package, which provides the same methods as the well-known Pandas, with the possibility to also manipulate huge datasets, up to terabytes of data.

Since I also found myself having to manipulate large datasets, I decided to test the Terality library. From what I have noticed, Terality is practically interchangeable with Pandas. The only difference, compared to Pandas, is in the fact that you have to register an account on the Terality platform, and then you can use it safely just like you do with Pandas.

Looking at the Terality documentation, I found, that compared to Pandas, it provides two additional methods: to_csv_folder() and to_parquet_folder() methods. In this article, I describe these two methods through a practical example.

The article is organized as follows:

  • Overview of Terality
  • Setup of the scenario
  • Splitting the dataset into multiple files

1 Overview of Terality

Before starting to use the Terality package, you need to register an account on the Terality Web site. It is very simple and quick, and in just a few seconds, you can start using the package.

Once created your account, you can log in to the system, you can open a terminal, and run the following command, to install Terality:

! pip install --upgrade terality

Then, you go back to your account in the Terality dashboard, and click on the Generate new API Key button, as shown in the following figure:

Image by Author

You can copy the generated key, open a new Python script, and paste the following code:

import terality as tete.configure(
email="YOUR EMAIL",
api_key="YOUR API KEY"
)

The previous piece of code creates a local directory in your home directory, named .terality, which contains all your configuration parameters, including your username and password. Thus, you can run the previous code only the first time you run Terality, and then, there is no need to repeat it in the next projects, because Terality will access the information contained in the .terality directory.

Now, you are ready to use terality as you usually do with Pandas.

In addition to the classical functions provided by Pandas, Terality also provides two functions: to_csv_folder() or to_parquet_folder().The two functions are very similar, they only differ on the output format that we want. In this example, we use the to_csv_folder(), but the syntax is the same for the to_parquet_folder() function.

These two functions permit you to split your original dataset into multiple datasets. This is particularly useful when you want to split your dataset into chunks, and analyze each of them separately.

According to the official Terality documentation, the to_csv_folder() and the to_parquet_folder() functions provide the following parameters:

  • path — the path to the output file. The path must include an *, which will be replaced by the progressive number of the file.
  • num_files — the number of files you want to produce in output.
  • num_rows_per_file — the number of rows you want each file should include.
  • in_memory_file_size — the size in megabytes that each file should occupy in the memory.
  • with_leading_zeros — if you want the progressive number to start from 01, 02, and so on.
  • other parameters — the same parameters of the to_csv() function, provided by Pandas.

Note that you can specify just one parameter among num_files, num_rows_per_file, and in_memory_file_size.

2 Setup of the scenario

In this example, we use the Mushrooms dataset, available in Kaggle under the CC0 Public Domain License. The dataset contains 8.124 rows and 23 columns, thus it is very small. However, the dataset size is sufficient to illustrate the functions provided by Terality.

Firstly, we load the dataset as a Terality Dataframe:

import terality as te
df = te.read_csv("mushrooms.csv")

Note that we have used the read_csv() function, which is practically the same as that provided by Pandas. By running the following command:

df.head()

we obtain an overview of the first lines of the dataset:

Image by Author

3 Splitting the Dataset into Multiple Files

Now we can split our dataset into multiple files, by using the exclusive functions provided by Terality. Pandas does not provide this feature.

3.1 Splitting by number of files

For example, we can decide to split our dataset into 5 sub-datasets. I use the num_files parameter:

n = 5
df.to_csv_folder('output/A/mushroom_*.csv', num_files=n)

The output directory contains the following files:

Image by Author

We can calculate the size of each dataset by defining the following function, which loads and prints the size of each dataset:

def print_shape(folder,n):
for i in range(1,n+1):
df_temp = te.read_csv(f"output/{folder}/mushroom_{i}.csv")
print(df_temp.shape)

We note that all the datasets, but the last one, contain 1.625 records. The last dataset contains 1.624 records.

3.2 Splitting by number of rows

As an alternative, we may decide to split our dataset into n datasets of 500 rows each. I use the num_rows_per_file:

df.to_csv_folder('output/B/mushroom_*.csv', num_rows_per_file=500)

The output directory contains the following files:

Image by Author

To retrieve from Python the number of files contained in the directory, we can run the following code:

import oslist = os.listdir('output/B')
n = len(list)

and then we can call the print_shape() function previously defined, to print the number of rows of each file:

print_shape('B', n)

As expected, all the files contain 500 rows, but the last one contains 124 rows.

3.3 Splitting by memory size

Finally, we can decide to split our dataset, depending on the memory size. This is particularly useful when we have little memory. In the following example, we use just 1 megabyte of memory:

df.to_csv_folder('output/C/mushroom_*.csv', in_memory_file_size=1)

The output folder looks like the following figure:

Image by Author

We use the following code to calculate the size of each dataset:

list = os.listdir('output/C')
n = len(list)
print_shape('C', n)

The first 6 datasets contain 739 rows. The last 5 datasets contain 738 rows.

Summary

Congratulations! You have just learned how to use the to_csv_folder() and to_parquet_folder() provided by Terality! These functions are particularly useful when you have big datasets, and you want to analyze them in chunks.

For more details on Terality, you can check its official documentation and try it!

If you have read this far, for me it is already a lot for today. Thanks! You can read more about me in this article.

Related Articles

--

--

Angelica Lo Duca

Researcher | +50k monthly views | I write on Data Science, Python, Tutorials, and, occasionally, Web Applications | Book Author of Comet for Data Science