The world’s leading publication for data science, AI, and ML professionals.

A Real-World Case Study of Using Git Commands as a Data Scientist

Complete with Branch Illustration

Data Science

You’re a data scientist. As Data Science is becoming more and more mature every day, software engineering practices begin creeping in. You are forced to venture out of your local jupyter notebooks and meet other data scientists in the wild to build a great product.

To help you out with this grand mission, you can rely on Git, a free and open-source distributed version control system to keep track of what everyone is coding.

Table of Contents

1. Git commands for setting up a remote repository
2. Git commands for working on a different branch
3. Git commands for joining in collaboration
4. Git commands for coworking
5. Resolving merge conflicts
Wrapping Up

To be more concrete, let’s work with an actual project (see the end product here). And to minimize the hassle of creating one, we’ll use the famous Cookiecutter Data Science. Install cookiecutter and create a project template locally.

Fill in the prompt accordingly. In our case, it’s as follows.

project_name [project_name]: Data Science Project Example
repo_name [example_project_name_here]: ds-project-example
author_name [Your name (or your organization/company/team)]: Albers Uzila
description [A short description of the project.]: A simple data science project, template by cookiecutter
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 (1, 2, 3) [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 (1, 2) [1]: 1

Change your working directory to ds-project-example folder by running the following command.

1. Git commands for setting up a remote repository

You now have a local project in ds-project-example. You need to push your local project to Github to collaborate with other data scientists.

To do that, initialize an empty Git repo using git init. You can confirm the repo is ready by observing that there is a hidden folder named .git in your working directory or by running git status.

Your local:

⬀ main*

As you can see, you’re working on a branch called main and have many untracked files by Git. You can use git add . to add all of these files to the index, also known as the "staging area" between the files you have in your working directory and your commit history.

To record changes in the index to the local repo, use git commit. Add a message like "Set up repo with cookiecutter".

Your local:

⬀───⬀ main*

Now, create a remote repo in https://github.com/new and name it ds-project-example. Before pushing the local repo to remote, you need to add the remote repo in the directory where your local repo is being stored, using git remote add command.

The git remote add command takes two arguments:

After running git remote add command, you will see in .git/refs folder that you now have a local HEAD and a remote named origin.

Now, to push commits made on your local branch to the remote repo, use git push. This command takes two arguments:

  • A remote name, for example, origin
  • A branch name, for example, main

To summarize:

Your local:

⬀───⬀ main*
        origin/main

Remote:

⬀───⬀ main

The -u flag in git push sets the branch you are pushing to (origin/main) as the remote-tracking branch of the branch you are pushing (main), so Git knows what you want to do when you push/pull branches in the future.

After doing all this, your project is now set up on GitHub:

β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile           <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ external       <- Data from third party sources.
β”‚   β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.py           <- makes project pip installable (pip install -e .) so src can be imported
β”œβ”€β”€ src                <- Source code for use in this project.
β”‚   β”œβ”€β”€ __init__.py    <- Makes src a Python module
β”‚   β”‚
β”‚   β”œβ”€β”€ data           <- Scripts to download or generate data
β”‚   β”‚   └── make_dataset.py
β”‚   β”‚
β”‚   β”œβ”€β”€ features       <- Scripts to turn raw data into features for modeling
β”‚   β”‚   └── build_features.py
β”‚   β”‚
β”‚   β”œβ”€β”€ models         <- Scripts to train models and then use trained models to make
β”‚   β”‚   β”‚                 predictions
β”‚   β”‚   β”œβ”€β”€ predict_model.py
β”‚   β”‚   └── train_model.py
β”‚   β”‚
β”‚   └── visualization  <- Scripts to create exploratory and results oriented visualizations
β”‚       └── visualize.py
β”‚
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

2. Git commands for working on a different branch

Your main branch should represent the stable history of your code. Create other branches to experiment with new things, implement them, and when they have matured enough you can merge them back to main.

Now, to create a new branch from local main, use git checkout. You can use git branch to see all available branches and which branch you are currently on.

Your local:

⬀───⬀ main
        origin/main
        make_dataset*

Remote:

⬀───⬀ main

You’ve made a new local branch named make_dataset and checked out this branch. After adding some codes on make_dataset, you’re ready to add, commit, and push changes to a new remote branch also called make_dataset with remote tracking branch origin/make_dataset. The only change you want to push was in the src/data/make_dataset.py file.

Your local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ make_dataset*
            origin/make_dataset

Remote:

⬀───⬀ main
      β”‚
      └──⬀ make_dataset

You can now merge remote make_dataset to remote main by first clicking the "Compare & pull request" button on your GitHub, then following the steps.

After successfully merging, you will see something like this.

Your local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ make_dataset*
            origin/make_dataset

Remote:

⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

3. Git commands for joining in collaboration

You have another contributor for your project. Let’s say his name is Hiro. To get started, Hiro has already cloned your remote repo using git clone just before you merged remote make_dataset to remote main. He also checked out his own local branch called train_model from the cloned repo.

Your local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
        origin/main
        train_model*

Remote:

⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

After adding src/configs/config.py and editing it along with src/models/train_model.py, Hiro generates:

  1. four trained models in models directory, and
  2. a JSON file containing the performance of the ensembled model on train and validation split in reports directory.

    Just to make sure, Hiro runs git status.

Just as you did before, Hiro adds, commits, and pushes the changes in his local branch to remote. However, models directory is not included since they occupy a large space.

Your local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ train_model*
            origin/train_model

Remote:

      β”Œβ”€β”€β¬€ train_model
      β”‚
⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

4. Git commands for coworking

You want to add something to Hiro’s work. However, you already did some other tasks for a while now: moving some parts of the code in src/data/make_dataset.py into src/features/build_features.py. So, let’s talk about that first.

What you did for a start was to pull all changes using git pull from remote main to local main so that you’re checking out the new branch build_features from the most recent version of main.

Your local:

⬀───⬀──────⬀ main
      β”‚         origin/main
      β”‚         build_features*
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ train_model
            origin/train_model

Remote:

      β”Œβ”€β”€β¬€ train_model
      β”‚
⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

In the middle of editing build_features branch, you want to see Hiro’s progress. But you still have 2 files in the branch that haven’t been staged for commit.

So, you stash these changes in a dirty working directory away using git stash. Then you can:

  1. create a local train_model branch checked out from local main,
  2. set the upstream of local train_model to origin/train_model so it can track remote train_model, and
  3. pull from the remote train_model that Hiro has created.

It’s all well and good until a problem appears in step 3 above. Since:

  1. Hiro checked out his local train_model from local main before you merged your remote make_dataset to remote main (see Section 3), and
  2. you pulled from remote main to your local main so you have the most recent version of main (see at the beginning of Section 4),

your local main is more updated (also called several "commits ahead") than Hiro’s. Hence you need a more elaborate way to pull the remote train_model (hint: git pull is just git fetch followed by git merge).

Your local:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€ origin/train_model
      β”‚        β•²
      β”‚         ⬀ train_model*
      β”‚        β•±
⬀───⬀──────⬀ main
      β”‚         origin/main
      β”‚         build_features --> stash
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ train_model
            origin/train_model

Remote:

      β”Œβ”€β”€β¬€ train_model
      β”‚
⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

Now, after merging the latest local main with your local train_model, you’re ready to push the changes to remote and take anything back from stash to build_features.

Your local:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€ 
      β”‚        β•²
      β”‚         β•²
      β”‚          ⬀ train_model
      β”‚         β•±   origin/train_model
      β”‚        β•±
⬀───⬀──────⬀ main
      β”‚         origin/main
      β”‚         build_features*
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ train_model
            origin/train_model

Remote:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€
      β”‚        β•²
      β”‚         ⬀ train_model
      β”‚        β•±
⬀───⬀──────⬀ main
      β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜

You create and edit another file src/configs/config.py, stage all 3 files, commit, and push to remote.

Your local:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€ 
      β”‚        β•²
      β”‚         β•²
      β”‚          ⬀ train_model
      β”‚         β•±   origin/train_model
      β”‚        β•±
⬀───⬀──────⬀ main
      β”‚       β”‚ origin/main
      β”‚       β”‚
      β”‚       └──⬀ build_features*
      β”‚
      └──⬀ make_dataset
            origin/make_dataset

Hiro's local:

⬀───⬀ main
      β”‚ origin/main
      β”‚
      └──⬀ train_model
            origin/train_model

Remote:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€
      β”‚        β•²
      β”‚         ⬀ train_model
      β”‚        β•±
⬀───⬀──────⬀ main
      β”‚      β”‚β”‚
      β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€ build_features

5. Resolving merge conflicts

After everything has been pushed to remote, we won’t use local repos anymore. So let’s focus on the remote repo. Merge train_model and main.

After requesting pull and merging train_model to main, here’s what we got so far.

Remote:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€
      β”‚        β•²
      β”‚         ⬀ main
      β”‚        β•±
⬀───⬀──────⬀
      β”‚      β”‚β”‚
      β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€ build_features

Now, merge build_features and main. This time, the two can’t automatically merge. But don’t worry, you can still create the pull request.

It turns out build_features has conflicts that must be resolved, and the culprit is src/configs/config.py.

You see the problem? Hiro added n_splits and max_features in this file for train_model branch, which has been merged to main. However, you also added loss and learning_rate for build_features branch in the same file. The merging operation becomes confused about which changes to be made.

We want to maintain all variables since they all are useful in our project pipeline. Let’s just do so and delete all unnecessary lines.

After merging build_features to main, here’s the worktree that we have on the remote repo.

Remote:

      β”Œβ”€β”€β”€β”€β”€β”€β¬€
      β”‚        β•²
      β”‚         ⬀───┐
      β”‚        β•±     β”‚
⬀───⬀──────⬀      β”œβ”€β”€β¬€ main
      β”‚      β”‚β”‚      β”‚
      β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€β”€β”€β”˜

We are done πŸ™‚

Wrapping Up

I hope you learned a lot from this story. You’ve been introduced to several essential GitHub commands and use them in a real-case scenario of building a data science project. Here are some most common ones (not ordered in any way):

$ git add
$ git branch
$ git checkout
$ git clone
$ git commit
$ git fetch
$ git init
$ git merge
$ git pull
$ git push
$ git remote
$ git stash
$ git status

With these git commands, you can create/clone new repos, navigate through them or their branches, and collaborate with anyone on the opposite side of the world.


πŸ”₯ Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.

πŸ”– Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:

Machine Learning from Scratch

Advanced Optimization Methods

MLOps Megaproject

My Best Stories

Data Science in R


Related Articles