
Learning Data Science is like learning how to play a musical instrument – you must develop good habits and get the foundations straight to succeed.
Just like a musician requires scales, arpeggios, and rhythm exercises before being able to play concertos, a data scientist needs to ingrain key practices to develop their potential.
Avoiding detrimental habits and cultivating productive ones allows you to shift your mental focus from the mechanics to the artistry of your work.
Developing data science habits like using virtual environments and tracking experiments transforms your workflow from a struggle to a smooth-flowing creative process.
In this article, we’ll explore six everyday bad habits that can secretly destroy your effectiveness as a data scientist and provide tips to help boost your Productivity.
Using the system interpreter
A virtual environment is a siloed Python installation separate from your system environment. It lets you install packages and libraries for a specific project without affecting your system Python setup. Neglecting to use virtual environments can lead to dependency hell.
For example, in one of my first data science projects, I was building a machine learning model for image classification. I installed TensorFlow 2.0 globally to get started. A few weeks later my colleague gave me some code that required TensorFlow 1.x. Installing this caused all kinds of conflicts with my first project’s dependencies! I spent hours debugging before realizing I should have used virtual environments to avoid this mess. I couldn’t get the inherited code working until I set up a virtual environment to match my colleague’s original setup.

A virtual environment neatly sidesteps this issue by giving each project its own sandboxed space. Each environment has a dedicate python interpreter, pip and libraries.
You can install libraries safely, knowing it won’t affect other workflows. No more worrying about breaking your Python installation!
Tools like Anaconda and virtualenv make creating and managing virtual environments a breeze. Set aside 1 minute to activate an environment for each new project to avoid hours of frustration down the line. It’s one habit that offers enormous time savings.
To create a virtual environment with conda, use the following command.
conda create -n env_name python=3.x
And then, activate the environment.
conda activate env_name
These two lines of code will save you hours of time and free you from dependency hell.
Over Relying on Notebooks
Jupyter notebooks are beloved in data science for their intuitive workflow and ability to interweave code, visualizations, and text.
Using notebooks for everything has considerable downsides when it comes to collaboration, reproducibility, and project structure.
Notebooks are challenging to version control and lack features like testing frameworks and linting present in other environments. Simultaneous editing rapidly leads to merge conflicts. It’s also tempting to use notebooks as a scratchpad, leading to disorganized, sprawling documents.

Once past initial exploration and prototyping, shift your workflow to .py scripts and .py modules to benefit from software engineering best practices.
Move the notebook code into organized functions and files. Use notebooks for presenting findings, establishing narratives, and sharing reproducible results.
Combining notebooks with scripts gives you the best of both worlds.
Using the mouse too much
The mouse seems intuitive, but excessive clicking and menu hunting wastes precious time. The keyboard offers much more efficient ways to navigate and manipulate your workspace.
You can use the keyboard to:
- Jump directly to a specific line number instead of scrolling.
- Select and edit multiple lines together rather than line-by-line.
- Comment and uncomment blocks of code.
- Replace multiple occurrences of a word.
- Navigate directly to functions and variables.
- Instantly format messy code instead of manually fixing indentation and spacing.
- Refactor code safely by renaming variables and functions across files in one step.

The keyboard has shortcuts to speed up nearly everything you do daily in the code editor.
Forget the mouse and keep your hands on the keyboard to code faster.
Skipping data versioning
I was building a text classification model and wanted to try different preprocessing methods on the training data. I created several variations, but when I found the best performing version, I realized I had forgotten the exact steps I used, and I couldn’t reproduce the results. It was incredibly frustrating.
Code can be easily version-controlled using Git. But, data is often neglected when it comes to versioning best practices. It’s a huge missed opportunity in terms of experimentation and reproducibility.
Versioning your data allows for tracking of how datasets change over time. You are free to experiment, creating many variations to explore how changing the data improves your model, knowing you can quickly revert if needed.
It is important to store metadata, such as dataset descriptions, preprocessing steps, and intended usage, to prevent data decay. Document what each version of your data represents to prevent reproducibility issues.
Version control gives you the confidence to enhance datasets iteratively and rapidly explore modeling ideas, knowing you have an escape hatch if things go wrong.
My favorite tool is Data Version Control (DVC) because it’s easy to set up and can connect to different cloud storage services, including Google Drive.
Not Tracking Your Experiments
When I started studying data science, I was manually tracking everything. Every detail, from features to model architecture, was scattered across messy spreadsheets with inconsistent file names.
This led to a huge confusion, and I found it very difficult to reproduce the results. I was wasting time manually adding parameters when experiment tools can easily automate it.
Enter experiment tracking. Machine learning experiments involve many moving parts – data samples, feature engineering code, model configurations, performance metrics, etc. An experiment is a configuration of parameters that lead to a specific result, usually tracked during a training run.

MLFlow, Comet, and W&B provide handy ways to log all the pieces of your ML workflow – code versions, datasets, parameters, and metrics. These systems capture the end-to-end flow in an organized, searchable structure.
Review your experiments tracking reports to identify successes worth pursuing and valuable lessons from failures.
Build on what worked rather than reinventing the wheel each time.
Not Using a Code Assistant
Become an early adopter of AI coding to maximize your productivity as a data scientist. The ability to quickly translate ideas into code gives you an edge in bringing data science projects to fruition faster.
GitHub Copilot is an AI tool that suggests whole lines and functions inside your editor while you code. It helps you write code faster by suggesting contextually relevant code snippets in real time. Copilot improves your productivity by reducing time spent on boilerplate code and debugging.
A recent study by Microsoft, GitHub, and MIT researchers investigated the impacts of AI tools on programming efficiency. They designed an experiment utilizing Copilot to test completion times on a coding task between assisted and unassisted developers.
Sampling participants from freelancing platforms, they divided 95 volunteers into treated and control groups. The treated group received a Copilot demo while both coded an HTTP server. Performance was measured by success in passing tests and completion time. Shockingly, the Copilot group finished over 55% faster on average than the controls at just 71 minutes compared to 161 minutes!

However, my advice about integrating AI tools in your workflow shouldn’t lead to mindless copy-pasting of AI-generated code. You don’t want to use them without first understanding what your code is doing and why. Also, questions remain around potential code quality impacts from reliance on AI. Carefully evaluate these tools to ensure they appropriately support – not replace – human problem-solving and creativity.
Setting up a Data Science project
Let’s walk through setting up a structured project for a binary classification project on the Titanic dataset, using MLFlow and DVC.
First, create a project and Git repo:
mkdir titanic-project
cd titanic-project
git init
Create a Conda environment and install some libraries.
conda create -n titanic python=3.11
conda activate titanic
pip install pandas sklearn jupyter mlflow
Organize the project structure:
mkdir data models notebooks figures src
cd data
mkdir raw processed
cd ../src
mkdir features models visualization
In src/models/train_model.py
, use MLflow to log parameters and metrics. This is a very minimal example that should give you the idea. It tracks a Random Forest model with n_estimators
and the cross-validation accuracy.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
def train_random_forest(n_estimators):
# Load data
...
# Model training
rf = RandomForestClassifier(n_estimators=n_estimators)
scores = cross_val_score(rf, X, y, cv=StratifiedKFold(5))
return rf, scores
if __name__ == "__main__":
n_estimators = 100
with mlflow.start_run():
rf, scores = train_random_forest(n_estimators)
# Log parameter, metrics, and model to MLflow
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_metric("cv_accuracy", scores.mean())
mlflow.sklearn.log_model(rf, "model")
In src/features/build_features.py
, use DVC to version data:
# Feature Engineering function
def feature_engineering(df):
# Placeholder function, add your feature engineering steps here
return df
# Apply feature engineering
titanic_transformed = feature_engineering(titanic)
# Save transformed dataset
titanic_transformed.to_csv('../../data/processed/titanic_transformed.csv', index=False)
# Track with DVC
os.system("dvc add titanic.csv")
os.system("dvc add titanic_transformed.csv")
os.system("dvc commit -m 'Ran feature engineering'")
Now your data versions are tracked alongside model parameters and metrics. Remember to commit to git and DVC every time you change something.
Modular code, MLflow, and DVC provide end-to-end experimentation and reproducibility. Versioning data and models allows you to improve workflow iteratively.
Once organized, you should have something that looks like this:
├── data
│ ├── raw
│ └── processed
├── models
├── notebooks
│ ├── eda.ipynb
├── figures
└── src
├── features
│ └── build_features.py
├── models
│ ├── train_model.py
│ └── predict_model.py
└── visualization
└── visualize.py
Source code and outputs are separated to facilitate tracking and versioning. This is just a concept and you should customize it according to your needs.
The Next Level of Productivity
Data science productivity relies not just on technical knowledge but also on developing the proper habits. Avoid common pitfalls like neglecting virtual environments, overusing notebooks, and not tracking experiments. Instead, cultivate practices like leveraging keyboard shortcuts, versioning data, tracking experiments systematically, and offloading repetitive work to AI.
Enjoyed this article? Get weekly data science interview questions delivered to your inbox by subscribing to my newsletter, The Data Interview.
Also, you can find me on LinkedIn.