The world’s leading publication for data science, AI, and ML professionals.

6 Bad Data Engineering Practices You Shouldn’t Apply as a Data Scientist

And what you should do instead

Photo by Yan Krukov from Pexels
Photo by Yan Krukov from Pexels

Data science jobs are becoming more competitive and having data science and Data Engineering skills will give you an advantage over data scientists that lack both skillsets. In order to learn the best practices as a data engineer, I also had to learn the bad ones and deal with the consequences along the way. These were my takeaways as a data engineer that made my life easier when I became a data scientist and I hope they’ll help you too.


1. Unnecessary infrastructure

The head of the team decided to standardize file names because programmers were apt to use file names that didn’t reflect what the code actually did. The file names were to be put into a lookup table with the file name and description of the code’s purpose. We were told to use alphanumeric names in order such as a1, a2, a3, and so on. Each time a new piece of code needed a name we had to find the last name in the sequence and use the next one.

Thankfully this system never made it to production and was only implemented for a project that was canceled midway. I shudder to think of my days having to find the purpose of a1 in a lookup table.

Takeaway: Don’t overcomplicate data engineering infrastructure. The department head could’ve established a standard file naming convention that was self-explanatory for the code or another protocol that didn’t involve a lookup table.

2. Difficult debugging

My team inherited legacy code that created code on the fly and was deleted after execution. The unfortunate task of debugging this code fell to me. It was a nightmare because I had to edit the program to write out the code to make sure it didn’t disappear to troubleshoot the problem.

Takeaway: Don’t make code harder to troubleshoot for yourself and others. Add comments when necessary to explain what the code is doing. Format the code to make it easier to read. This will reduce troubleshooting time and free up time to work on other tasks more interesting than debugging ( unless this is actually what you love to do all day ).

3. Lack of data QA

As a data engineer, I felt I didn’t know enough about the data to verify the quality. Analysts would often ask us to fix common issues that could’ve been caught with a few QA checks before the data engineer passed the data to analysts for review. This increased the time for analysts to sign off before we could put the ETL changes into production.

Takeaway: Establish a list of common QA checks for a data engineer or yourself to review before putting ETL changes into production. A couple of checks we established was to look for duplicates and missing values in primary key fields that helped us reduce the number of QA issues analysts came back with.

4. No backups

When I first became a data engineer we didn’t have source control. If you accidentally deleted a file or changed code and wanted to revert it your only option was to request IT to pull a copy from the backup tapes. This delayed our work because we had to wait for IT to retrieve the file.

An analyst once changed data in a table we didn’t have a backup for. We only had the changelog and I had to manually update the table back to the original values. Needless to say, this took time that wouldn’t have been necessary if there had been a backup.

Takeaway: Before a major code change, make sure the latest version is saved in source control or just copy the original files into a backup folder. If you decide to modify data in the table, copy the data you’re planning to change into another table as a backup in case something goes wrong.

5. Not deleting original data before incremental updates

As a data engineer, there were typically two kinds of table updates we had pipelines for in production. The first was a full refresh where all table records were deleted and a new set of records were inserted with the latest data. The second was an incremental update where data was updated for the previous day or a specified time period such as the last seven days.

When pipelines were rerun, duplicate records were inserted for tables with incremental updates because the ETL didn’t delete data for the same time period being updated before inserting. This caused many issues with reporting and downstream processes especially when it wasn’t caught until later on.

Takeaway: Always add code to delete records for the same period before inserting into tables with incremental updates. This will prevent duplicates and incorrect reporting if the duplicates aren’t caught early.

6. Deploy it and forget it

You would think there’s no need to check on code that’s passed QA and been deployed to production. This is where you’re wrong. Often times processes that run in development with sample files and development databases don’t mimic the real world and aren’t accounted for in the code. I’ve had pipelines fail in production even though they passed QA because I didn’t consider a use case that didn’t exist in the development environment.

Takeaway: Always check on the data output of an ETL pipeline after it’s been deployed in production, especially the first couple of days to confirm everything is as expected. You don’t want to realize something is wrong when you actually need to use the data.


Final Thoughts

Whether you’re a data scientist that works with data engineers or one that spans both roles, knowing bad practices will also help with learning the good ones. I wasn’t able to escape these bad practices completely unscathed but I hope knowing them helps make your job as a data scientist a little easier.


You might also like…

6 Best Practices I Learned as a Data Engineer

How Data Scientists Can Troubleshoot ETL Issues Like a Data Engineer

How to Troubleshoot an Abnormal KPI Change


Related Articles