How Big Is Big Data?

A survey about big data “sizes” for some of the most prominent big data sources

Luca Clissa
Towards Data Science

--

Ever wondered how big is Big Data? This article tries to draw an up-to-date comparison of the data generated by some of the most renowned data producers.

CERN Tier-0 data center. Image credit: CERN

We are witnessing an ever-increasing production of digital data, so much to earn our epoch the title of Big Data era. Multiple and diverse players contribute to this growth, ranging from tech companies to standard industries, media agencies, institutions and research labs. Moreover, even everyday objects can collect data nowadays, thus including also ordinary people among data producers.

By and large, the modern trends in data production are driven by two main factors: the digital services offered by several stakeholders from different sectors and their use at scale by millions of users.

Although reconstructing the amounts of generated data is very hard due to the lack of official information, an attempt can be made by integrating multiple sources of various nature.

Big Data sizes in 2021

Image by the author. For the interactive version please visit: datapane.com/lclissa/big-data-2021/

Interactive version linked in the caption featuring clickable links redirecting to all the sources consulted for the estimation.

The first think that catches the eye is surely the huge bubble in the top right corner. This illustrates the data detected by the electronic equipment of the Large Hadron Collider (LHC) experiments at CERN. In one year, the amount of processed information is around 40k ExaBytes (EB), which is 40 thousands billion gigabytes. To get an idea, that could be ten times the amount of data ever stored on Amazon S3!

However, not all of these data are of interest. Hence, the information actually recorded shrinks down to some hundreds of PetaBytes (PB) per year (160 and 240 for real and simulated data), and it will grow by nearly one order of magnitude in the upcoming High Luminosity (HL-LHC) upgrade.

Commercial big data sources also experience similar volumes. Google Search Index has an estimated size in the order of less than a hundred PB, with company-fellow service YouTube being around 4 times as big.

The same relations can be reconstructed for Instagram and Facebook yearly photo uploads, respectively.

Regarding more proper storage services, like Dropbox, the sizes increase to something in the range of a few hundreds PB, similar to the HL upgrade for LHC experiments.

However, streaming data account for a consistent slice of the big data market. Services like Netflix and electronic communications generate traffic from one to two orders of magnitude more than mere data producers.
Remarkably, the scientific community plays an important role also in this respect. For instance, the traffic coming from the circulation of LHC data thanks to the Worldwide LHC Computing Grid (WLCG) amounts to thousands of petabytes per year.

Fermi estimation process

...but where do these figures come from?

Precisely reconstructing the amounts of data produced by an organization — especially large ones — is very hard, if even possible. Indeed, not all companies track down this information in the first place. Even if so, such data are typically not shared with the general public.

Nevertheless, a tentative approach can be done breaking down the data production process to its atomic components and making justified guess (Fermi estimation). In fact, retrieving information about the amounts of contents produced in a given time window is easier when targeting specific data sources. Once we have that, we can deduce the total amounts of data through reasonable guesstimates of per-unit sizes for such contents, e.g. average mail or picture size, average data traffic for 1 hour of video, and so on. Of course, even small errors make a big difference when propagated at scale. However, the returned estimates can still be indicative of the orders of magnitude involved.

Let’s dive into the sources mined during the estimation process.

Google. Despite the endless quantity of web pages on the internet, recent indications suggest that not all of them are indexed by search engines. In fact, such indexes should be kept as small as possible to ensure meaningful results and timely responses. Thus, a recent analysis estimates that the Google search index contains 30 to 50 billion web pages (for a current estimate, see worldwidewebsize.com). Considering that an average page weighs roughly 2.15 MB according to the annual Web Almanac, the Google search index total size should be approximately 62 PetaBytes (PB) as of 2021.

YouTube. According to Backlinko, 720k hours of video were uploaded daily on YouTube in 2021. Assuming an average size of 1 GB (standard definition), this makes a total of roughly 263 PB produced in the last year.

Facebook and Instagram. Domo’s Data Never Sleeps 9.0 report estimates that the number of pictures uploaded every minute on the two social media in 2021 amounts to 240k and 65k, respectively. Assuming an average size of 2 MB, this makes a total of approximately 252 and 68 petabytes.

Dropbox. Although Dropbox does not actually produce data, it offers a cloud storage solution to host users’ contents. In 2020, the company declared 100 million new users, 1.17 millions of which were paid subscriptions. By conjecturing an occupancy of 75% (out of 2 GB) and 25% (out of 2 TB) for free and paid subscriptions, respectively, the amount of new storage required by Dropbox users in 2020 can be estimated to be around 733 PB.

Image by the author.

E-mails. According to Statista, nearly 131k billion electronic communications were exchanged from October 2020 to September 2021 (71k billion e-mails and 60k billion spam). Assuming average sizes of 75 KB and 5 KB for standard and junk e-mails, respectively, we can estimate the total traffic to be in the order of 5.7k PB.

Netflix. The company’s penetration has sky-rocketed in the last years, also in response to the changed daily routines imposed by the pandemic. Domo estimates that 140 million hours of streaming per day were consumed by Netflix users in 2021, which makes a total of roughly 51.1k PB when assuming 1 GB per hour(standard definition).

Amazon. According to Amazon Web Services (AWS) chief evangelist Jeff Barr, over 100 trillion objects were reportedly stored in Amazon Simple Storage Service (S3) until 2021. Assuming a size of 5 MB per object for an average bucket, this makes the total size of files ever stored in S3 equal to roughly 500 ExaBytes (EB).

LHC. More detailed information is available in this case thanks to the controlled experimental conditions and the publication of scientific results.
In the last run (2018), the LHC generated roughly 2400 million particle collisions per second for each of the four main experiments — namely ATLAS, ALICE, CMS and LHCb. This delivers about 100 MegaBytes (MB) of data per collision, thus projecting the yearly yield to roughly 40k EB of raw data.

However, storing such a tremendous amount of data is unattainable with current technology and budget. In addition, only a tiny fraction of that data is actually of interest, so there is no need to record all of that information. Consequently, the vast majority of raw data is discarded straight away using hardware and software trigger selection systems. Hence, the amount of data recorded is lowered to roughly 1 PB per day, which gives 160 PB during the last data taking in 2018.

Besides the real data collected by LHC, physics analyses also require comparing experimental results with Monte Carlo data simulated according to current theories. This produces nearly 1.5 more data in addition to the real recorded information.

Furthermore, the CERN community is already working at enhancing the Large Hadron Collider capabilities for the so-called High Luminosity LHC (HL-LHC) upgrade. In this new regime, the generated data are expected to increase by a factor ≥ 5 and produce an estimated 800 PB of new data each year by 2026.

In addition, the collected data are continuously transferred thanks to the Worldwide LHC Computing Grid (WLCG) to allow researchers across the world to conduct their studies, thus producing yearly traffic of 1.9k PB in 2018.

For a more detailed explanation of LHC data production see [1].

Conclusions

Although the lack of official data hampers a punctual estimation of the big data volumes produced by the individual organizations, reasoning on atomic data production units and their sizes may provide interesting insights.

For example, streaming data (e.g. Netflix, e-mails and WLCG transfers) already account for a significant slice of the big data market, and they will presumably continue to increase their impact in coming years due to the extensive adoption of smart everyday objects able to generate and share data.

However, what struck me the most is the fact that scientific data also play an important role in the big data phenomenon, experiencing volumes comparable to some of the most renowned commercial players!

Wrapping up, the data production rate is at its peak and it will continue growing in the next years. It will be nice to see how these estimates will evolve over time and whether the relative contributions of the organizations will change in a few years.

Now I’m really interested in knowing what you think and what caught your attention!

The presented figures sound reasonable to you? How closely do they reflect your personal experience?

Are there important big data sources or organizations you think should be added to the comparison? Feel free to fork the code used for the visualization and integrate them yourself!

References

[1] L. Clissa, M. Lassnig, L. Rinaldi, How big is Big Data? A comprehensive survey of data production, storage, and streaming in science and industry (2023), Frontiers in Big Data
[2] L. Clissa, Survey of Big Data sizes in 2021 (2022), arXiv

--

--

PhD in Data Science. I share takeaways from my research about applied Machine and Deep Learning. linkedin.com/in/luca-clissa-b3908695/