The world’s leading publication for data science, AI, and ML professionals.

Git commands for Data Scientists in a Collaborative Workspace

Minimalistic Git survival guide – Explaining the rationale & situation for each Git command

Ah yes, every programmer’s worst nightmare. While there is no perfect answer to this, rather than memorising each Git command by brute-force, I thought sharing a compact list of Git commands and situation of usage on a day-to-day basis could be more instinctive for other fellow Data folks 🙃

Note: master branch = main branch. Basically it’s the codebase which is deployed onto the server. To avoid confusion, I shall refer to this as the main branch in this article.

I. git branch & git checkout

Screenshot by Author | From (1) - (4), "git branch" creates a snapshot of the main codebase onto your local machine. To switch to this branch, "git checkout" is used to switch to this branch. This is evident from "(myBranch)" denoted in the terminal.
Screenshot by Author | From (1) – (4), "git branch" creates a snapshot of the main codebase onto your local machine. To switch to this branch, "git checkout" is used to switch to this branch. This is evident from "(myBranch)" denoted in the terminal.

(3) In the event you have forgotten the name of the branch you created, simply run Git branch to list out the current existing branches created on your local machine.

(4) Conversely, if you have created too many redundant branches, the command git branch -d <your branch name> deletes a branch you specify.

II. git add & git commit

git add <file 1> <file 2>
git commit -m <your commit message>

As you make subsequent changes to that codebase, it would be wise to log each significant code edits which you have made. As a project goes through several iterations, reviewing the git commits gives the team a clear picture of how the project had progressed. On the other hand, git add allows you to specify which file(s) to stage for deployment. In the event you wish to stage all changes, use:

git add .

instead to make your life easier.

III. git rebase

git rebase -i main

Long story short, the above command aims to resolve code conflicts you have with the main branch.

In a collaborative setting, expect at least one other co-worker to be working on the same codebase.

So recall that you were working on a SNAPSHOT of the codebase. From the moment you "branched out" to the moment you are ready to deploy your code, other co-workers may have made other changes which are not reflected on your local codebase. Hence, running git rebase -i main would run your codebase through each code commit which had been made by your team to the very last commit made to the main branch in its most recent deployment. There are 2 possible scenarios—

Scenario (1) – Code resolves itself perfectly, no action required

This part is self-explanatory. Allow Git to auto-combine your code commits with other prior code commits made. Unfortunately, while Git is fairly smart, when certain lines of code directly conflict with your current code, we are more often met with:

Scenario (2) – Code conflicts occur. Time for manual changes to be made.

Image by Author | A sample view after running git rebase -i master. | "i" stands for interactive, meaning git goes into interactive mode which is precisely what is happening above.
Image by Author | A sample view after running git rebase -i master. | "i" stands for interactive, meaning git goes into interactive mode which is precisely what is happening above.

While at first glance the output by the Git terminal seems intimidating, just be aware that the intention of the output here is Git’s way of signaling 2 things – (1) It is currently in interactive mode; (2) Here are the list of code commits which you have made (and need to be resolved).

Proceed to key in 😡 (Colon (:) and then the alphabet ‘x’) and select the [Enter/Return] key to let Git do its thing.

At some point, expect the following similar output to be displayed:

Image by Author | For conflict type I, Git would proceed to state the list of file(s) which were modified by others after you had branched out from the main branch
Image by Author | For conflict type I, Git would proceed to state the list of file(s) which were modified by others after you had branched out from the main branch

and over on your IDE/Text Editor, for Conflict Type I, Git would specify which lines of code are in direct conflict via the following markings:

Illustration by Author | Git outputs the above markings to compare an alternative version of the code with your current code side-by-side
Illustration by Author | Git outputs the above markings to compare an alternative version of the code with your current code side-by-side

Explanation: The lines of code between "HEAD" and ">>>>>>>" belong to lines of code of another version. To combine this with your changes, modify the code to keep previous features while implementing your new changes.

⚠ Very Important: Please run the application at least once to ensure the code is resolved without issues. In the event your manual changes had broke the code, it’ll be a lot harder to fix this glitch ("debug") when you exit rebase mode.

Finally, after you are done making the changes, proceed to run:

git rebase --continue

and depending on how many code commits which need to be resolved, the above iteration would continue until all code commits are synchronised with other code commits made by the rest of your team 🤗

IV. git merge

git checkout main
git merge <your branch name>

So if you have made it to this stage, give yourself a pat on the back because you just made it through the toughest part. After switching to the main branch by running git checkout main, git merge myBranchoverwrites the codebase on the main branch to synchronise with your rebased code.

V. git push

To deploy the latest code on the main branch, run the following:

git push origin main

And at this point we have just completed 1 round of code versioning with Git 😀


Two Essential Git Commands for Code Inspection

Command 1 – git status

git status

Rule of thumb: When you feel lost, the above command is the first one you should run in order to know the current state of your code versioning in most situations.

Screenshot by Author | Example of using git status. The command summarises the latest status quo of your code's version control.
Screenshot by Author | Example of using git status. The command summarises the latest status quo of your code’s version control.

Command 2 – gitk

For a visual overview, run:

gitk

Surprisingly this is 1 of the least mentioned Git command within the community.

Screenshot by Author | A UI popup which is displayed after gitk is run in the terminal
Screenshot by Author | A UI popup which is displayed after gitk is run in the terminal

– End of Git Commands –

FYI and disclaimer: While these are the minimal Git commands I personally feel are crucial, there is no perfect answer to this so do bear in mind that this list is only sufficient to get by on a day-to-day basis (emphasis on ‘day-to-day’).

Ultimately it’s all about finding the Git flow which relates to your setting the most!


Many thanks for reading and please follow me on Medium if you have found this article helpful


If the above Git commands are not very applicable then here are other articles which you could refer to instead 😀 :

By Author Soner Yildirim:

8 Must-Have Git Commands for Data Scientists

By Author Dr. Varshita Sher:

Git commands data scientists use on a day-to-day basis


Related Articles