The Data Science Experience
128 Petabytes. Not Terabytes, Petabytes. That’s how much storage space the Hadoop File System (HDFS) had when I first started working with Big Data.
We have all heard of the term big data, but I believe we never fully understand it before getting a hands-on experience.
I first learnt about it during my sophomore year in the university and even worked on "big data" in my first job.
I was responsible for setting up a pipeline to process 10 MB of generated data everyday. Which would amount to around 3.65 GB of data per year.
We even decided to not use Hadoop since the overhead of loading the data from HDFS is much longer than simply reading the data directly through the file system.
Now that I have came across a big data, I realised what I did that time not come remotely close to what I would call big data.
Here are 5 things I learned from my first experience with big data.
Everything is Big
When I said big, I meant really big. Like, big big.
I was having problem trying to grasp how big is the scale of big data before this experience.
When I saw 128 Petabytes, it was the first time I saw the phrase PB used. Sure, I have heard of the term Petabytes before, which is a step higher than Terabytes. However, seeing it for the first time was like seeing Andrew Ng sitting across the room in a Starbucks.
It’s the same story for the RAM capacity. 64 TB. Sixty-Four Terabytes! I had never even used a machine with 64 TB storage before.
Then, it turned out those server specifications was just for the recommendation engine team.
Other departments also have their own servers. Our server specification is still considerably smaller compared to other departments, especially those who process images and videos.
I had seen the storage space for one of those teams reached 1 EB and the RAM capacity at 512 TB.
Every day, the data for my team would grow by an average of 30 GB depending on customer’s usage. That’s roughly 8x more data in a day than the data generated in 1 year from my first "big data" project.
Everything is not Big Enough
Although we have huge amount of storage and RAM available, they are not unlimited.
In fact, they could be considered very limited.
At that time, we "only" used 60 out of 128 PB storage and 36 out of 64 TB RAM on average.
However, we were following a strict guideline to make sure these resources are used as efficiently as possible by all 20 team members.
We had to keep our personal Hadoop directory as lean as possible. Generated data should be moved from the HDFS and saved in the server’s file system and unused data should be deleted.
Every month, we would get a notification from the system engineers to free up space if our personal directory exceeded 10 GB
For memory usage, we were told to keep it as low as possible. Every MapReduce job has a default limit of 2.5 GB maximum RAM usage per worker. You could increase the limit to up to 16 GB, but you have to prepare a damn good explanation to justify why you need that much RAM when asked.
To be fair, if you are running a basic MapReduce job and your script needs more than 2.5 GB of memory, then there is probably something wrong with your code.
It was also quite common that the Hadoop jobs queue was very long and crowded that a job that usually take 2 hours will need 10+ hours to finish.
Sometimes we would also get a notification if an urgent or high priority Hadoop job will be run in the server.
We would be asked to schedule our jobs around that particular job and those that fail to do so will often have their jobs terminated.
So yeah, even when everything is seemingly huge, sometimes they are still not enough.
If you see it from a cost perspective, then it would make sense to have just enough resource instead of having more of it but only using like 30% of them at all time.
Performance, Performance, Performance
Performance is crucial when you are working with big data.
Imagine you are working on 1000 data and processing one item takes 1 ms on average. In total, you would need 1 second to process all of them.
Let’s say you are able to make your code more efficient and now it takes 0.5 ms to process one item. In turn, you would only need 0.5 second to process all of them.
The difference between 1 and 0.5 second is barely noticeable, but if you multiply that by 1000 then you would see at 500 seconds (~8 minutes) difference.
Scale it up by a factor of 10, and you would see around 83 minutes difference between the original and the more efficient code.
Simple optimisations can lead to much better performance
One real example I have is when one of my MapReduce job took 8 hours to complete. It was a fairly simple job, so I asked my team lead at that time why it took so long.
He pointed out to me that I initialised a regex pattern inside a loop, then he asked me to try and put it outside the loop since it’s the same pattern for every item.
I reran the job and it only took 2 hours to finish.
No Room for Errors
When a job took hours or even days to complete, having errors in the code is a big no. Not just code errors, producing a wrong result is often not acceptable.
Code errors might be caught by the IDE, but runtime errors are harder to detect.
I usually write down the steps on paper and map out all the possible values just to make sure that the code could handle all kinds of input without error.
It usually works as expected for simpler tasks, but might take a lot of your time if you need to work on more complicated ones.
Test runs are your best friends.
If I’m not sure the code will be correct and too sleep deprived to check it thoroughly, I will usually run a modified version of the code on a fraction of the data to see if it would work.
I mostly change the reducer script to use very few filters that only worked on more data or lower the thresholds significantly to make sure the data doesn’t get filtered out.
These kind of jobs might take 20–40 minutes due to the Hadoop queue, which is not that long considering it helps us figure out if it will produce useful result in the end.
Do not Wait Around

We have all heard about the jokes of data scientists and machine learning engineers fooling around while waiting for the model to train.
The same thing is applicable for those working with big data because a MapReduce job could run for hours and even days for bigger tasks.
I usually spend the time to either work on the next step for the current task or analyse the data generated by my other tasks that have been processed.
If you don’t have anything on your plate, then I suggest asking for more work to keep you occupied, because having more items to work on will force you to be more productive.
I knew some of my coworkers that use their spare time to have some rest since they only slept 2 hours last night due to work, and one of them even use his spare time to study for interviews because he’s planning to apply to another company.
Anyway, however you use your time is up to you, but in my experience, browsing Reddit and watching YouTube is not a good way to spend your waiting time.
Working with big data is generally only possible on big companies or mature start-ups.
Smaller companies and newer start-ups will not come close to the scale of data being generated and analysed by their bigger counterpart.
Dealing with big data should be the role of big data engineers. However, most companies have their own definition of data scientists, so it doesn’t hurt to have a little knowledge or experience about big data.
The easiest way to get this knowledge is by doing internships as big data engineer, machine learning engineer, or even data scientist.
You could roughly tell if you would be working on big data from the job description or requirement. If it mentions anything related to working with Hadoop and performing MapReduce, then you will be more likely to work on big data.
Keep in mind that smaller companies might have a smaller scale of big data and less resources compared to bigger ones.
The Data Science Experience is a series of stories about my personal experience and views from working as a data scientist and machine learning engineer. Follow me to get regular updates on new stories.