Using Git in Data Science: the Solo Master

Pan Wu
Towards Data Science
7 min readDec 3, 2019

--

The solo tree in Tongva Park, Santa Monica, CA (I took the photo during a lunch break back in 2014)

Git is a very popular version control system for tracking changes in computer files and coordinating work on those files among multiple people. It is well used in Data Science projects to keep track of code and maintain parallel development. Git can be used in a very complicated way, however, for Data Scientist, we can keep it simple. In this post, I am going to walk through the main use cases if you are a “Solo Master”.

What is “Solo Master”?

​When you use GIT simply to keep your code safe, to avoid going crazy after your laptop is broken/stolen, all changes are on the “master” branch only. Now you are in the “solo master” mode. In this mode, things are relatively easy, and there are mainly six possible scenarios:

Credit. https://vignette.wikia.nocookie.net/character-stats-and-profiles/images/a/a9/Neo_%28BOS%29.png
  1. One working space, nothing goes wrong
  2. One working space, mistake before “git add”
  3. One working space, mistake before “git commit”
  4. One working space, mistake before “git push”
  5. One working space, mistake after “git push”
  6. Multiple working spaces

Note. There are many awesome resources out there talking about “what is Git”, and “the basic concept in Git”, I would refer to the official Git website on this “Getting Started — Git Basic

Now we can start with some cool project! First, let’s go to Github to create an empty project, then start to config it properly on your local laptop.

Create the Git repository in GitHub
git clone git@github.com:PanWu/git-solo-master.gitcd git-solo-master/git config user.name "Your Name"git config user.email yourname@gmail.com

Case 1. One working space, nothing goes wrong

​This is the ideal and simplest situation, what you need to do is just add more files to one commit, commit the code, and then push to the remote master branch. Life is so easy under such a situation.

echo 'GOOD CODE' > feature_1.pygit add feature_1.pygit commit -m 'add feature'git commit --amend -m 'add one feature'git push origin master# after this, your Git structure will look like following

Case 2. One working space, mistake before “git add”

​This always happens… you started playing with your idea, and added a few draft code in the file, and quickly figured out this idea does not work, and now you want to get back the clean slate. How to do that? Fortunately, if you didn’t run any “git add” on the new file, this is very easy.

For more details, please refer to “Git checkout”.

echo 'BAD CODE' > feature_1.pygit checkout -- feature_1.py# Now, feature_1.py file will contain only "GOOD CODE"

Case 3. One working space, mistake before “git commit”

You thought the idea is going to work, added a few files, made some changes, did a few “git add”, and finally, you figured out the result is not right. Now you want to get rid of the mess and back to the nice, right, old code.

For more details, please refer to “Git reset”.

echo 'BAD CODE' >> feature_1.pyecho 'BAD CODE' > feature_2.pygit add feature_1.py feature_2.pygit resetgit checkout -- feature_1.py# Now, feature_1.py file will contain only "GOOD CODE"# and feature_2.py will be considered as "Untracked files" in the folder

Case 4. One working space, mistake before “git push”

You went even further this time, not only you did “git add”, but also this modification took a few hours and you also did a few “git commit”! Ah, another huge mistake, what to do?!

For more details, please refer to “Git reset”.

# if there is only 1 incorrect "git commit"echo 'BAD CODE' > feature_1.pygit add feature_1.pygit commit -m 'add an unnecessary feature'git reset HEAD~git checkout -- feature_1.py# if there is more than 1 incorrect "git commit"echo 'BAD CODE' >> feature_1.pyecho 'BAD CODE' > feature_2.pygit add feature_1.pygit commit -m 'add first unnecessary feature'git add feature_2.pygit commit -m 'add second unnecessary feature'# now you need to run "git log", find out where the "origin/master"# and "origin/HEAD" points to d317a62a12481a850be4c3cf5bc9a7bf45c094b7# now the "HEAD -> master" is 2 commits ahead of the "origin/HEAD"git loggit reset d317a62a12481a850be4c3cf5bc9a7bf45c094b7git checkout -- feature_1.py# if you run "git log" again, you will see now "HEAD -> master" is the same# as "origin/master"
Git commits history: when you made a mistake and commit into the branch
Git commits history: after you reset the HEAD to the previous “good” commit

Case 5. One working space, mistake after “git push”

You pushed the code to production, and other members found this is a big mistake/bug. Now you need to revert the code back to where it was.

For more details, please refer to “Git revert”.

echo 'BAD CODE' >> feature_1.pyecho 'BAD CODE' > feature_2.pygit add feature_1.py feature_2.pygit commit -m 'add unnecessary features'git push origin mastergit revert HEAD# or first use "git log" to find current head# then run "git revert 66cda7e93661df0c81c8b51fec6eec50cf1e5477"# either way, you need to edit and save the revert messagegit push origin master# now although the master/HEAD gets to where it was, your mistake is forever# recorded in Git system :( so pay attention to your push!
Git commits history: it is recorded on the GitHub server!
Git commits history: after revert, the mistake is no longer there. However, you leave a permanent log in the server.

Case 6. Multiple working spaces

You have two working spaces, one is in your laptop, one is in your desktop work station. You develop feature 2 in one working space and feature 3 in another working space.

# SPACE 1git clone git@github.com:PanWu/git-solo-master.gitcd git-solo-master/git config user.name "Your Name"git config user.email yourname@gmail.comecho 'NEW CODE 2' > feature_2.pygit add feature_2.pygit commit -m 'add feature 2'# SPACE 2git clone git@github.com:PanWu/git-solo-master.gitcd git-solo-master/git config user.name "Your Name"git config user.email yourname@gmail.comecho 'NEW CODE 3' > feature_3.pygit add feature_3.pygit commit -m 'add feature 3'# In SPACE 1: we pushed successfullygit push origin master# In SPACE 2: the same cod will failgit push origin master# error: failed to push some refs to 'git@github.com:PanWu/git-solo-master.git'# hint: Updates were rejected because the remote contains work that you do# hint: not have locally. This is usually caused by another repository pushing# hint: to the same ref. You may want to first integrate the remote changes# hint: (e.g., 'git pull ...') before pushing again.# hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Now you see the problem, and the solution is to use “git pull” first.
​”git pull” = “git fetch” + “git merge” or “git fetch” + “git rebase”
For the details, refer to “Git pull”. Remember, now the remote branch looks like following

# In SPACE 2# first, pull the current most up-to-date HEADgit pull# this equals "git fetch" + "git merge"# then, edit and save the merge message# you may also try 'git pull --rebase", see what's the differencegit push origin master

Now, as long as you develop each individual features in each working space, this process would have no problem. This is considered better practice than working on the same feature in different working spaces. Because if the same file is modified in different spaces, the “merge” process will have many conflicts and resolving that would be a huge deal for “solo masters”.

Great, now after these simple case studies, you become the real “solo master” using Git in Data Science. You will never lose any code (it will always be pushed to the cloud) or worry about code inconsistency in multiple working spaces (as long as “git pull” is used correctly).

​Enjoy using Git!

Note. the article was originally posted on my personal blog and re-posted on Medium.

--

--