The world’s leading publication for data science, AI, and ML professionals.

Introduction to Big Data with Vaex – A Simple Code to Read 1.25 Billion Rows

Read and visualize 1.25 billion rows of Galaxy simulation data with Python effectively

Making Sense of Big Data

Nowadays, we are entering many kinds of the era. Some people said that we are in the Disruption Era. To understand it, we can use a term from Schwartz (1999) in his book, Digital Darwinism. The term describes that we are entering the era in which businesses can not adapt to the evolution of technology and science. Digital platforms and globalization change the customers’ paradigm and change their needs.

On the other hand, some people said that we are entering the Big Data Era. Almost all of the disciplines got experienced with data booming. One of them is Astronomy. Astronomers worldwide realized that they need to build bigger and bigger telescope or observatory to collect more data in a consortium. For example, at the beginning of 2000, an all-sky survey called 2 Micron All Sky Survey (2MASS) gathered about 470 million objects. In the middle of 2016, Gaia, a space-based telescope, released its 2nd data release consisting of about 1.7 billion objects. How the astronomers handle it?

In this article, we will discuss how to deal with big data practically. We will use a galaxy simulation data from Gaia Universe Model Snapshots (GUMS). It has about 3.5 billion objects. You can access it here. For instance, we will just read 1.25 billion rows.


Read 1 billion rows with Vaex

Firstly, I must thank Maarten Breddels that build Vaex, a python module to read and visualize big data. It can read 1 billion rows in a second. You can read the documentation here.

Vaex will effectively read a file in a format of hdf5 and arrow. Here, we will use some hdf5 files. To read a file in Vaex, you can use this following code

import vaex as vx
df = vx.open('filename.hdf5')

But, if you want to read some hdf5 files simultaneously, use this code

df = vx.open_many(['file1.hdf5', 'file2.hdf5'])

You can analyze the following figure.

Read 1.2 billion of rows (_Image by autho_r)
Read 1.2 billion of rows (_Image by autho_r)

A variable file is an array of all of the hdf5 files in the working directory. You can check the needed time to read all of the files just in 1 second. In the next line, I check the total of rows. It is 1.25 billion rows.


Arithmetical operations and virtual columns in Vaex

You can also do some arithmetical operations for all of the rows quickly. Here is the example

Arithmetical operation in Vaex (_Image by autho_r)
Arithmetical operation in Vaex (_Image by autho_r)

Vaex can do an arithmetical operation in a millisecond order, applying it to all of the rows (1.25 billion). If you want to assign a column in the Vaex data frame, you can use this code

df['col_names'] = someoperations

Here is the example

Arithmetical operation in Vaex (_Image by autho_r)
Arithmetical operation in Vaex (_Image by autho_r)

If you read it carefully, the code is similar to Pandas API. Yups, as I know, Vaex is working in the same API with Pandas. You can also select columns as you do in Pandas, like this code

df_selected = df[df.ag, df.age, df.distance]

Back to the figure above, in line 14, new columns have been added with the name of ‘new_col.’ It just takes about 2.07 milliseconds.


1D and 2D Binning

Do you need a binning statistic to be applied in your big data quickly? Vaex will be the answer, too. To do a 1D binning and plot it, you can use this following code

df.plot1d(x-axis, shape=(number_of_bin))

Here is the example of our data

1D binning with Vaex (_Image by autho_r)
1D binning with Vaex (_Image by autho_r)

We plot column distance in 1D (histogram) with the number of a bin is 25. It will produce a figure like this. Vaex takes about 2 seconds to visualize a 1D plot.

1D plot (histogram) in Vaex (_Image by autho_r)
1D plot (histogram) in Vaex (_Image by autho_r)

To apply 2D binning and visualize it, you can use this code

df.plot(x-axis, y-axis, shape(shape_x, shape_y))

Here is the example to be applied in our data

2D binning with Vaex (_Image by autho_r)
2D binning with Vaex (_Image by autho_r)

It takes about 4 minutes to visualize a 2D plot. It is very acceptable. You imagine how long another module in visualizing 1.2 billion rows in the density of a 2D plot. It will give you a figure like this.

2D density plot in Vaex (_Image by autho_r)
2D density plot in Vaex (_Image by autho_r)

Conclusion

The future is here, and the big data era is already here. We must prepare all of the techniques needed. One of the preparations is building a Python library that can read and visualize the big data effectively. Vaex is coming as a solution for it. It claims that Vaex can read 1 billion rows in a second. I hope you can learn the example mentioned in this story and elaborate it to your data.

If you liked this article, here are some other articles you may enjoy:

Python Data Visualization with Matplotlib – Part 1

5 Powerful Tricks to Visualize Your Data with Matplotlib

Matplotlib Styles for Scientific Plotting

Creating Colormaps in Matplotlib

Customizing Multiple Subplots in Matplotlib

That’s all. Thanks for reading this story. Comment and share if you like it. I also recommend you follow my account to get a notification when I post my new story.

References:

[1] Schwartz, E. I., Digital Darwinism (1999), Broadway Books

[2] Zhang, Y. and Zhao, Y., Astronomy in the Big Data Era (2015), Data Science Journal

[3] Skrutskie, M. F. et al., The Two Micron All Sky Survey (2MASS) (2006), The Astronomical Journal, 131:1163–1183

[4] Gaia Collaboration, T. Prusti, et al., The Gaia Mission (2016) Astronomy and Astrophysics 595, A1

[5] Robin, A. C. et al., Gaia Universe model snapshot (2012) Astronomy and Astrophysics 543, A100


Related Articles