6 GitHub Scenarios for Data Scientists

Start to contribute your company’s codebase from Day 1

Wen Yang
Towards Data Science

--

When I first started to work as a data scientist, I was very intimated to commit code to Github, especially to the repo others are managing. The idea of my clumsy action may break other people’s code terrifies me.

Now I finally overcame the fear and I thought I’ll share how my GitHub workflow looks like on my daily job.

Photo by Roman Synkevych on Unsplash

Scenario 1: you got assigned to a new task with a Jira or Clubhouse ticket number DS-1234.

A good habit I learned from my developer coworkers is that always start a new branch with your ticket number. This way, you will be able to point to the lineage of the background, the story, and the scope of work. This is going to be very helpful not only for your code reviewer, your PM (who might friendly ask you what have you shipped), and most importantly, for your future self.

# Create and switch to a branch named after the Jira ticket
$ git checkout -b DS-1234

You can check your current branch by below command

$ git branch

Now you can start work on your branch DS-1234 . You can add new code, edit existing code and once you are happy with it, you can add and commit your work by doing this:

$ git add new_algo.py # a new script called new_algo.py
$ git commit -m "added a new algo"
$ git push origin DS-1234

Now your code is updated on both local DS-1234and on Github remote branch DS-1234 .

Scenario 2: After your first commit, it’s always a good idea to create a Pull Request (PR) and tag your co-worker for code review.

You can create a PR for branch DS-1234 on Github by visiting:

https://github.com/your-project-folder/pull/new/ds-1234

You can also find this path when you run git push origin DS-1234

Say your co-worker give you some feedback and comments, and you made some small edits on readme directly on github and now you want to further update your new_algo.py locally.

You can first pull the readme changes from github by running below on your local terminal

$ git pull origin DS-1234

Then after updating new_algo.py , you can push your updated code to github remote branch by running

$ git push origin DS-1234

Once you confirmed that your coworker merge your updated branch DS-1234to master branch, you can safely delete DS-1234 locally and remotely.

# delete local branch
$ git branch -d DS-1234
# or you can run
$ git branch -D DS-1234
# delete remote branch using push --delete
$ git push origin --delete DS-1234

Scenario 3: if your coworker trusts you enough and ask you to merge by yourself, or if you need to merge other people’s branch into master, you can do the following:

# assuming you are to merge DS-1234 to master branch
$ git merge --no-ff -m "merged DS-1234 into master"
# or if you are to merge other people's bug fix branch bugfix-234
$ git merge --no-ff -m "merged bugfix-234 into master"

Note that the no-ff flag prevents git merge from executing a “fast-ward” if it detects that your current HEAD is an ancestor of the commit you’re trying to merge.

Scenario 4: you want to remove a file.

$ git rm test.py # remove test.py file
# commit your change
$ git commit -m "remove test.py file"

What if you changed your mind or deleted the wrong file? Not to worry, you can do this

$ git checkout -- test.py

Now test.py is back!

Scenario 5: remove untracked files

Sometimes you made a bunch of changes that you may not want to keep, and you can run the below command to clean them all at once.

# dry run to see which files will be removed
$ git clean -d -n
# remove them
$ git clean -d -f

Scenario 6: [Advanced Scenario] I found myself start to make changes on masterbranch before creating the DS-1234 . Actually, this happened to me multiple times 😂.

We have two solutions.

Solution 1: git stash

# step 1: save the changes you made on master branch
$ git stash
# step 2: create and switch to DS-1234 branch
$ git checkout -b DS-1234
# step 3: transfer the changes using stash pop
$ git stash pop

Solution 2: cherry-pick 🍒

This is a trick I learned from my previous manager. It’s similar to git cherry-pick but it is more intuitive in my opinion since it shows the path very clearly.

Now, imagine that on your DS-1234, you worked many pieces of code. And tomorrow is the deadline to merge your code to the release branch. And it looks like only one piece of code is ready to merge. You can choose to pick this one to merge by running the below commands:

# step 1: pull all recent changes from remote
$ git pull
# step 2: checkout your branch and release branch
$ git checkout DS-1234 # pick from this branch
$ git checkout release-2021-07-01 # to this target branch
# step 3: checkout the code ready to merge
$ git checkout DS-1234 new_algo.py
# step 4: add this code to release branch
$ git add new_algo.py
# step 5: commit
$ git commit -m "cherry pick changes from DS-1234 to release"

Closing Thoughts

  1. Git and GitHub are powerful collaboration tools, and it can be intimidating at first for new data scientists who used to work in notebooks’ fashion.
  2. For data scientists, you definitely don’t have to learn the most peculiar GitHub commands to start collaborating with your developer coworkers.
  3. By orientating yourself with common GitHub workflow situations, you can start to push your code to your organization's codebase with confidence from Day 1 (Ok, maybe week 1 😊).

--

--