The world’s leading publication for data science, AI, and ML professionals.

A day in the life of a Big Data Engineer

Working for an ad tech giant in Paris

Facts: I work with approximately 200 PB of data

According to Gizmodo:

1 PB is equal to 20 million four-drawer filing cabinets filled with texts or roughly 13 years of HD-TV videos.

50 PB is the entire written works of mankind from the beginning of recorded history in all languages

Here's what our datacenters look like. Photo by Manuel Geissinger from Pexels
Here’s what our datacenters look like. Photo by Manuel Geissinger from Pexels

I wake up at 7 A.M.

I make myself the first cup of coffee. Scrolling through tech newsfeeds, I enable email notifications. You should not open emails before start working, but everything can happen to our data at any time, and we must stay informed.

I arrive at the office around 8:30 A.M.

Nobody’s here yet as I’m one of the earliest birds in the open space. After settling down, I take a look at our data monitoring tools: resource consumption graph, storage bar charts, anomaly alerts. If something is strange with the data, I’ll have to check it out. Some data can go missing or have unusual values. Sometimes the data transactions crash and we end up with no data. In case one of those bad things happens, I’ll need a second cup of coffee.

Photo by Stephen Dawson on Unsplash
Photo by Stephen Dawson on Unsplash

Next, I check on our data schedulers. We ingest hundreds of terabytes a day. We cannot handle by hand all those data from one pipeline to another. We need data schedulers with which we can execute queries, copy data between databases, and of utmost importance, set a timer to those commands. A scheduler is a vital part of a data pipeline where the amount of data exceeds the capacity of human manipulation. All our data schedulers are in-house because we engineers love crafting.

AN EXAMPLE INSTRUCTION FOR A SCHEDULER
1. Create table if not exists
- Schema:
  + Column 1: name + data type
  + Column 2: name + data type
- Storage format: text
- Table name: Table A
2. SQL query
INSERT INTO Table A
SELECT Column 1, Column 2
FROM Table B
WHERE $CONDITIONS
3. Frequency = Daily

I have my daily meeting with my teammates at 10:30 every morning. We tell each other what we did the day before, what we are going to do that day, and whether anything blocks us. We try to speak concisely to save everybody’s time because we hate long meetings. I find these compact reunions helpful seeing it maintains a minimum interaction of every team member.

I come back to my daily tasks by doing some code reviews. Your programming needs to go through your peers’ judgment before merging into production. They can either approve, reject, or adjust some of your work. I realize it hard to assess somebody’s code without talking to them in person. I mostly choose to come to their desk and make sure we’re on the same page. You don’t want to become an "approval slut" – an engineer who always endorses his teammates’ work without examining them. Appraise their code the same way you want yours to be evaluated.

Photo by Kevin Ku on Unsplash
Photo by Kevin Ku on Unsplash

At noon, my team has lunch together at the company’s cafeteria. We chat about life, jobs, technology issues. Everyone can discuss everything without pretense on the table. Our lunch break lasts not for long, and we soon gather in the kitchen for another coffee – my third one.

Around 2 P.M. I resume my workday on some software developments. We data engineers still work with data-related pieces of software. I read SQL queries, run tests in the Hadoop ecosystem, verify whether the modification won’t put too much pressure on the storage, and finally commit the change. The development could get stuck for any kind of reason: the code won’t compile, the production and test environments don’t match, typos in the queries. I refill my cup of coffee. I lost count of them.

Later in the afternoon, I typically get a demand to perform a production release. We wish to update tables, refresh the queries, or kick off a brand new pipeline. Working at an engineering-focused company, I am happy to see we automate every software procedures. I merely have to press some buttons, and let the scripts handle the whole thing. If things go well, we’ll have a new version of the data system up and running.

6 P.M., time to wrap up. I make sure the data schedulers still run as expected, no warning in the monitoring dashboard, no one complaints about data being missing or incorrect. We collaborate with co-workers all around the world, so I am not surprised by receiving error messages in the middle of the night. For now, all seems fine, so I grab my computer and say goodbye to my colleagues.

Photo by Lilly Rum on Unsplash
Photo by Lilly Rum on Unsplash

After turning off all work-related notifications, I enjoy the last moment of the day by reading a non-technical book, thus conclude a typical day in the life of a Big Data engineer.


Related Articles