
If you’re a data scientist, at one point in your career you’ll have to troubleshoot ETL issues. If you’re new to ETL troubleshooting and unclear on the best place to start, these are the common causes of issues I’ve encountered as a data engineer and how to deal with them.
1. Is all the source data available?
A typical ETL job loads source data either from raw files or extracts data from another system into a staging table. In the example ETL pipeline below, three data files are transformed, loaded into a staging table, and finally aggregated into a final table. A common issue for ETL failures is missing data files for the latest day’s run.
How to handle this:
If the data comes from an external source, check with the provider and confirm if the files are running late. If the data is internal such as application events or the company website activity, confirm with the team responsible if there were issues that could’ve caused delayed or missing data. Once you get the missing data your ETL issue is resolved. If data is delayed, you may need to manually rerun the ETL once the data becomes available.

2. Did the upstream ETL dependencies finish running?
You’ll likely be alerted to an ETL issue when a database table is missing data for the latest day. If this is an aggregate table most likely you’ll have multiple jobs that need to complete before this table is updated.
How to handle this:
Find the ETL pipeline responsible for updating the table. Using the example ETL pipeline shown above, work backward from the aggregate table to the staging table and confirm each step finished successfully. Check if the data files for the latest day are available if you didn’t already verify this in step one. Once you’ve identified the stage that failed, rerun from that point to confirm the tables are updated downstream.
3. Were the data files loaded into the database completely?
Occasionally data is not completely loaded into the database for reasons I can’t fathom 😫 .
How to handle this:
In these cases, compare the record count of the raw files versus the staging table to verify if the data is incomplete. I normally run this check when the row count between yesterday and today’s data changes more than the average percentage historically. A rerun of the ETL pipeline from the beginning to reload the raw files should fix the issue.
4. Does the data file contain unexpected values?
If a rerun of the ETL continues to fail, it’s time to examine the log to verify if it’s the same error message as the original failure or a new one. If you’re lucky, the log file may indicate which row number or field in the file caused the loading error into the database.
How to handle this:
The most common unexpected values I’ve encountered are characters in a numeric field, non-ASCII characters, and field values that exceed the field length defined in the table schema. How you deal with the bad data depends on your database and the tools available but these are a few options.
For non-ASCII characters, you can delete the bad rows from the raw file or remove them from the files with terminal commands and rerun the ETL job.
For data exceeding the field length, you can alter the table schema to increase the field length if you have database admin privileges.
For characters in a numeric field, you can either increase the maximum number of rows to skip because of bad data or create a new staging table with a character instead of a numeric field to clean or filter the bad data before loading.
5. Is the data file incomplete?
Check if the row count for the data file is similar to past files. If it’s lower than expected, the data may be incomplete for that day. This can happen due to outages where events stopped emitting to the backend or scheduled system maintenance.
How to handle this:
Check with the data provider to confirm if the data is truly missing or a new file needs to be sent with all the data. If a new file needs to be sent, rerun the ETL after the data is updated.
6. Does the data file contain the expected format?
Data formats can change unexpectedly, especially external data. Compare the delimiters and field order the ETL code is expecting to the latest file format.
How to handle this:
Confirm with the data provider if the format has changed. If not, wait for an updated file and rerun the existing ETL. If yes, modify the production code yourself or create a ticket for Data Engineering to process the new format changes.
7. Are there duplicates?
This doesn’t happen often but duplicates can make their way into your data and cause a big headache.
How to handle this:
Using the same example Etl pipeline above, work backwards and run a check for duplicates in the aggregate table, staging table, and then in the data files. Remove the duplicates and rerun the ETL from that point.
Final Thoughts
When I first became a data engineer I had no idea how to troubleshoot ETL issues. After much trial and error, I came up with a list of common causes to check before moving on to more unusual reasons. This list has accounted for the majority of my problems in the past and now that you know what to look for, I hope many of your issues will be resolved too.
You might also like…
6 Best Practices I Learned as a Data Engineer
6 Bad Data Engineering Practices You Shouldn’t Apply as a Data Scientist
How I Used a Machine Learning Model to Generate Actionable Insights