Data Science
You’re a data scientist. As Data Science is becoming more and more mature every day, software engineering practices begin creeping in. You are forced to venture out of your local jupyter notebooks and meet other data scientists in the wild to build a great product.
To help you out with this grand mission, you can rely on Git, a free and open-source distributed version control system to keep track of what everyone is coding.
Table of Contents
1. Git commands for setting up a remote repository
2. Git commands for working on a different branch
3. Git commands for joining in collaboration
4. Git commands for coworking
5. Resolving merge conflicts
Wrapping Up
To be more concrete, let’s work with an actual project (see the end product here). And to minimize the hassle of creating one, we’ll use the famous Cookiecutter Data Science. Install cookiecutter
and create a project template locally.
Fill in the prompt accordingly. In our case, it’s as follows.
project_name [project_name]: Data Science Project Example
repo_name [example_project_name_here]: ds-project-example
author_name [Your name (or your organization/company/team)]: Albers Uzila
description [A short description of the project.]: A simple data science project, template by cookiecutter
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 (1, 2, 3) [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 (1, 2) [1]: 1
Change your working directory to ds-project-example
folder by running the following command.
1. Git commands for setting up a remote repository
You now have a local project in ds-project-example
. You need to push your local project to Github to collaborate with other data scientists.
To do that, initialize an empty Git repo using git init
. You can confirm the repo is ready by observing that there is a hidden folder named .git
in your working directory or by running git status
.
Your local:
⬀ main*
As you can see, you’re working on a branch called main
and have many untracked files by Git. You can use git add .
to add all of these files to the index, also known as the "staging area" between the files you have in your working directory and your commit history.
To record changes in the index to the local repo, use git commit
. Add a message like "Set up repo with cookiecutter".
Your local:
⬀βββ⬀ main*
Now, create a remote repo in https://github.com/new and name it ds-project-example
. Before pushing the local repo to remote, you need to add the remote repo in the directory where your local repo is being stored, using git remote add
command.
The git remote add
command takes two arguments:
- A remote name, for example,
origin
- A remote URL, in our case, https://github.com/dwiuzila/ds-project-example.git
After running git remote add
command, you will see in .git/refs
folder that you now have a local HEAD and a remote named origin
.
Now, to push commits made on your local branch to the remote repo, use git push
. This command takes two arguments:
- A remote name, for example,
origin
- A branch name, for example,
main
To summarize:
Your local:
⬀βββ⬀ main*
origin/main
Remote:
⬀βββ⬀ main
The -u
flag in git push
sets the branch you are pushing to (origin/main
) as the remote-tracking branch of the branch you are pushing (main
), so Git knows what you want to do when you push/pull branches in the future.
After doing all this, your project is now set up on GitHub:
βββ LICENSE
βββ Makefile <- Makefile with commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project.
βββ data
β βββ external <- Data from third party sources.
β βββ interim <- Intermediate data that has been transformed.
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ docs <- A default Sphinx project; see sphinx-doc.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.py <- makes project pip installable (pip install -e .) so src can be imported
βββ src <- Source code for use in this project.
β βββ __init__.py <- Makes src a Python module
β β
β βββ data <- Scripts to download or generate data
β β βββ make_dataset.py
β β
β βββ features <- Scripts to turn raw data into features for modeling
β β βββ build_features.py
β β
β βββ models <- Scripts to train models and then use trained models to make
β β β predictions
β β βββ predict_model.py
β β βββ train_model.py
β β
β βββ visualization <- Scripts to create exploratory and results oriented visualizations
β βββ visualize.py
β
βββ tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
2. Git commands for working on a different branch
Your main
branch should represent the stable history of your code. Create other branches to experiment with new things, implement them, and when they have matured enough you can merge them back to main
.
Now, to create a new branch from local main
, use git checkout
. You can use git branch
to see all available branches and which branch you are currently on.
Your local:
⬀βββ⬀ main
origin/main
make_dataset*
Remote:
⬀βββ⬀ main
You’ve made a new local branch named make_dataset
and checked out this branch. After adding some codes on make_dataset
, you’re ready to add, commit, and push changes to a new remote branch also called make_dataset
with remote tracking branch origin/make_dataset
. The only change you want to push was in the src/data/make_dataset.py
file.
Your local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ make_dataset*
origin/make_dataset
Remote:
⬀βββ⬀ main
β
βββ⬀ make_dataset
You can now merge remote make_dataset
to remote main
by first clicking the "Compare & pull request" button on your GitHub, then following the steps.
After successfully merging, you will see something like this.
Your local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ make_dataset*
origin/make_dataset
Remote:
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
3. Git commands for joining in collaboration
You have another contributor for your project. Let’s say his name is Hiro. To get started, Hiro has already cloned your remote repo using git clone
just before you merged remote make_dataset
to remote main
. He also checked out his own local branch called train_model
from the cloned repo.
Your local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
origin/main
train_model*
Remote:
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
After adding src/configs/config.py
and editing it along with src/models/train_model.py
, Hiro generates:
- four trained models in
models
directory, and - a JSON file containing the performance of the ensembled model on train and validation split in
reports
directory.Just to make sure, Hiro runs
git status
.
Just as you did before, Hiro adds, commits, and pushes the changes in his local branch to remote. However, models
directory is not included since they occupy a large space.
Your local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ train_model*
origin/train_model
Remote:
βββ⬀ train_model
β
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
4. Git commands for coworking
You want to add something to Hiro’s work. However, you already did some other tasks for a while now: moving some parts of the code in src/data/make_dataset.py
into src/features/build_features.py
. So, let’s talk about that first.
What you did for a start was to pull all changes using git pull
from remote main
to local main
so that you’re checking out the new branch build_features
from the most recent version of main
.
Your local:
⬀βββ⬀ββββββ⬀ main
β origin/main
β build_features*
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ train_model
origin/train_model
Remote:
βββ⬀ train_model
β
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
In the middle of editing build_features
branch, you want to see Hiro’s progress. But you still have 2 files in the branch that haven’t been staged for commit.
So, you stash these changes in a dirty working directory away using git stash
. Then you can:
- create a local
train_model
branch checked out from localmain
, - set the upstream of local
train_model
toorigin/train_model
so it can track remotetrain_model
, and - pull from the remote
train_model
that Hiro has created.
It’s all well and good until a problem appears in step 3 above. Since:
- Hiro checked out his local
train_model
from localmain
before you merged your remotemake_dataset
to remotemain
(see Section 3), and - you pulled from remote
main
to your localmain
so you have the most recent version ofmain
(see at the beginning of Section 4),
your local main
is more updated (also called several "commits ahead") than Hiro’s. Hence you need a more elaborate way to pull the remote train_model
(hint: git pull
is just git fetch
followed by git merge
).
Your local:
βββββββ⬀ origin/train_model
β β²
β ⬀ train_model*
β β±
⬀βββ⬀ββββββ⬀ main
β origin/main
β build_features --> stash
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ train_model
origin/train_model
Remote:
βββ⬀ train_model
β
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
Now, after merging the latest local main
with your local train_model
, you’re ready to push the changes to remote and take anything back from stash to build_features
.
Your local:
βββββββ⬀
β β²
β β²
β ⬀ train_model
β β± origin/train_model
β β±
⬀βββ⬀ββββββ⬀ main
β origin/main
β build_features*
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ train_model
origin/train_model
Remote:
βββββββ⬀
β β²
β ⬀ train_model
β β±
⬀βββ⬀ββββββ⬀ main
β β
βββ⬀βββ
You create and edit another file src/configs/config.py
, stage all 3 files, commit, and push to remote.
Your local:
βββββββ⬀
β β²
β β²
β ⬀ train_model
β β± origin/train_model
β β±
⬀βββ⬀ββββββ⬀ main
β β origin/main
β β
β βββ⬀ build_features*
β
βββ⬀ make_dataset
origin/make_dataset
Hiro's local:
⬀βββ⬀ main
β origin/main
β
βββ⬀ train_model
origin/train_model
Remote:
βββββββ⬀
β β²
β ⬀ train_model
β β±
⬀βββ⬀ββββββ⬀ main
β ββ
βββ⬀ββββββ⬀ build_features
5. Resolving merge conflicts
After everything has been pushed to remote, we won’t use local repos anymore. So let’s focus on the remote repo. Merge train_model
and main
.
After requesting pull and merging train_model
to main
, here’s what we got so far.
Remote:
βββββββ⬀
β β²
β ⬀ main
β β±
⬀βββ⬀ββββββ⬀
β ββ
βββ⬀ββββββ⬀ build_features
Now, merge build_features
and main
. This time, the two can’t automatically merge. But don’t worry, you can still create the pull request.
It turns out build_features
has conflicts that must be resolved, and the culprit is src/configs/config.py
.
You see the problem? Hiro added n_splits
and max_features
in this file for train_model
branch, which has been merged to main
. However, you also added loss
and learning_rate
for build_features
branch in the same file. The merging operation becomes confused about which changes to be made.
We want to maintain all variables since they all are useful in our project pipeline. Let’s just do so and delete all unnecessary lines.
After merging build_features
to main
, here’s the worktree that we have on the remote repo.
Remote:
βββββββ⬀
β β²
β ⬀ββββ
β β± β
⬀βββ⬀ββββββ⬀ βββ⬀ main
β ββ β
βββ⬀ββββββ⬀βββ
We are done π
Wrapping Up
I hope you learned a lot from this story. You’ve been introduced to several essential GitHub commands and use them in a real-case scenario of building a data science project. Here are some most common ones (not ordered in any way):
$ git add
$ git branch
$ git checkout
$ git clone
$ git commit
$ git fetch
$ git init
$ git merge
$ git pull
$ git push
$ git remote
$ git stash
$ git status
With these git commands, you can create/clone new repos, navigate through them or their branches, and collaborate with anyone on the opposite side of the world.
π₯ Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.
π Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading: