When I first became a data engineer, I was lucky (or unlucky depending on how you look at it) to have a manager that wrote development guidelines outlining database table design and file naming conventions. It even went as far as setting standards for coding syntax. I didn’t appreciate the restrictions at the time, but years later I was grateful to have learned these best practices that helped me as a data scientist.
1. File names
As a data engineer, I had to troubleshoot ETL code written by other data engineers. We didn’t use Git for source control and having file-naming conventions made it easy to find the source code that needed debugging.
I’ve since been in companies where file names were all over the map and code was decentralized across multiple locations. The disorganization caused delays in troubleshooting that could’ve been prevented.
How you can apply this as a data scientist:
- Standardize your file names to make them easy to find for yourself and others. For example, a Python script that cleans and aggregates data for a customer churn model can be named customer_churn_data_prep.py. I built more than one hundred models as a data scientist. Having standard naming conventions helped me find my model code quickly when a stakeholder had questions.
- Prefix with numbers to indicate the run order. If you need to run SQL queries, process the data with Python, and train the model you can have 3 files – 1_customerchurnpull_data.sql, 2_customerchurnagg_prep.py, ** and 3_customer_churn_trai**n.py. I’ve put models into production where I had 6 steps before scoring and having numerical prefixes helped me order the ETL process correctly.
2. Table names
As a data engineer, we followed table naming standards to make data easy for our users to find in absence of any documentation.
For example, all email-related tables started with email followed by the type. A table with email open and click data was named _email_open_click_.
Aggregate tables derived from raw data such as aggregated email open and clicks by email campaign had an agg suffix such as _email_campaign_open_click_agg_.
View names started with vw and lookup tables started with lookup.
Temporary tables started with temp followed by the user’s first initial and last name, i.e. _temp_vyu_, to make it easy to find the table owner if we needed to delete tables when the database was close to maximum storage capacity.
All of these naming standards reduced the number of questions to data engineers because the table names were self-explanatory and helped with maintenance when we needed to delete old tables.
How you can apply this as a data scientist:
- Agree on table naming standards with data engineers to make it easy for you and other data scientists to figure out the table contents. If you also function as the data engineer, decide on naming conventions to make it easier for yourself and others to find the right data.
- If you create temporary tables in the database, use a prefix with **temp**_ even if it’s not required because one day you may be asked to help delete old tables in the database and this will make it easy for you to create a list quickly.
3. Column names
Not only should table names be self-explanatory but column names should be as well. When I first started as a data engineer I worked on an ETL rewrite of an entire database. This allowed us to rename identification fields to be the same name across all tables. For example, a customer ID field can be named custID, _cust_id, customer_id_, and so on depending on your creativity. Having the same name across tables meant anyone querying customer ID automatically knew the field name without having to look at the table schema.
How you can apply this as a data scientist:
- Work with data engineering or if you’re the data engineer use the same field names across tables for common identification fields such as customer ID and email address. This will make the fields self-explanatory and easy to find across tables.
4. Code changes
I’ve worked in companies without source control. In those cases, I would save a copy of the code in another location with a readme.txt file and the date to indicate the date and version of the code I was saving before making further edits.
This practice saved me a huge amount of time as a data engineer.
Once I accidentally deleted a piece of code I spent the day updating but luckily I had saved an older version and didn’t need to rewrite it from scratch.
Another time I pushed code to production and it crashed although I had no issues during testing. I couldn’t figure out the problem until I compared the new code to my backup and found the cause.
How you can apply this as a data scientist:
- Make a habit of backing up your code as you make changes if you don’t have or don’t want to use source control. Having an older version to compare with can help you find issues if you experience unexpected errors with your current code.
5. QA
Each time I made an update to an existing ETL job I had to compare the new output to the original one to confirm only the expected values were changed. We were able to catch many issues before they reached our end users. This forced me to learn how to look at the data and find issues, a process that I’ve benefited from up to today.
How you can apply this as a data scientist:
- Always compare results if you are making changes to an existing process. You never know what unexpected issues may arise that you’ll catch. It’s not worth your reputation to deliver wrong results.
6. Production
As a data engineer, I always wrote code expecting it would eventually go into production. We rarely had ad hoc tasks that didn’t end up having to be automated.
How you can apply this as a data scientist:
- Keep a production mindset and write code expecting it will one day be automated. Don’t hardcode values and use variables if possible keeping in mind you may have to rerun the code at a later point in time with different values.
- Be aware of packages and software versions if your model needs to go into production. I once built a model in Python 3.7 but production only had version Python 2.7 installed. I had to spend more time retraining my model on the lower version before deploying it to production. Another data scientist on my team needed a Python package that wasn’t available and had to wait for IT to install it before the model could go into production. If your model is on a deadline, delays from lack of packages and a mismatch of Python versions need to be accounted for ahead of time.
Now that you know how to apply data engineering best practices as a data scientist, I hope this helps you as much as it helped me.
You might also like…
My Unbelievable Move From Data Engineer to Data Scientist Without Any Prior Experience