The world’s leading publication for data science, AI, and ML professionals.

Git Cheat Sheet for Data Scientists

Git is a free and open-source version control system. Most programmers and data scientists interact with git on a daily basis. So what is…

Getting started

Photo by Roman Synkevych on Unsplash
Photo by Roman Synkevych on Unsplash

Git is a free and open-source version control system. Most programmers and data scientists interact with git on a daily basis. So what is version control? Version control is a way that we as a programmer track our code changes and a way to collaborate with other programmers. This allows us to look back at all the changes we’ve made over time, which helps us to see when and what we did, as well as convert back to a previous version of the code if needed. You may have heard of Github before and may wonder what the difference is between Git and Github. It’s important to note that Git and Github are not the same. Github is a cloud-based hosting service that hosts and manages Git repositories, which expands upon Git’s basic functionality. Besides Github, there are many other services such as Bitbucket, GitLab, Codebase, Launchpad, etc. In this article, I’ll share some common Git Commands along with some comparisons and their use cases.

Basic Overview of how Git works:

  1. Create a "repository" (project) with a git hosting tool (like Github)
git init

Make sure you are in a root folder of your project when you type git init , otherwise, git will track all the files on your computer and slow everything down. If you accidentally type git init in the wrong place, you can undo it by typing rm -rf .git .

  1. Add the remote repository / Copy (or clone) the repository to your local machine
//add the remote repository
▶️ git remote add origin <HTTPS/SSH>
// clone the repository
▶️ git clone <HTTPS/SSH>
Use the code drop down to find the HTTPS/SSH URL. (screenshot by Author).Reference: https://github.com/academic/awesome-datascience
Use the code drop down to find the HTTPS/SSH URL. (screenshot by Author).Reference: https://github.com/academic/awesome-datascience
  1. Create a "branch" (optional but recommended)
//create a new branch and switch to it at the same time
▶️ git checkout -b <branch-name>
▶️ git switch -c <branch-name>
//simply switch to an existing branch
▶️ git checkout <branch-name>
▶️ git switch <branch-name>

git switchis not a new feature but an additional command to the overloaded git checkoutcommand. git checkout can be used to switch branches and restore the working tree files, and it can be confusing. To separate the functionalities, the GIT community introduced this git switch command.

  1. To keep your feature branch fresh and up to date with the latest changes in the master branch, use rebase
▶️ git pull
▶️ git pull --rebase origin master

We often see conflicts happen in this step. Resolving conflicts during this step can help keep your feature branch history clean and have an easier time with merging at the end.

While git pulland git rebaseare closely connected, they are not interchangeable. git pullfetches the latest changes of the current branch from a remote and applies those changes to your local copy of the branch. Generally, this is done by git merge, i.e. the local changes are merged into the remote changes. So git pull is similar to git fetch+ git mergeat the same time.

git rebaseallows us to apply our changes on top of the remote master branch, which gives us a cleaner history. It is an alternative to merging. Using this command, the local changes you made will be rebased on top of the remote changes, instead of being merged with the remote changes.

  1. Add a file or make some changes to your file to your local repo then put it into the staging area when you are ready to save the changes
//Add one file
▶️ git add <file-name>
//Add all the new/modified/deleted files to the staging area
▶️ git add -A (note: -A is shorthand for --all)
//Stages files in the current directory and not any subdirectories, whereas git add -A will stage files in subdirectories as well.
▶️ git add .
//Stage all new and modified files. The previous commands will also remove a file from your repository if it no longer exists in the project. 
▶️ git add --ignore-removal
//Stage all modified and deleted files
▶️ git add -u (note: -u is shorthand for --update)
  1. "Commit" (save) the changes

git commit -m "message about the changes you've made"

  1. "Push" your changes to your branch

git push origin <branch-name>

The git set-upstream allows you to set the default remote branch for your current local branch. You can set an upstream by adding -u , git push -u origin <branch-name> . This command will set the <branch-name>branch as the default branch, which allows you to push the changes without specifying the branch you are pushing into. After setting an upstream, next time when you push some changes to the remote server, you can simply type git push.

  1. Open a "pull request" (aka "PR") and request to merge the changes to the main branch

Pull request is a feature that makes it easier for developers to collaborate. Once a developer created a pull request, the rest of the team members can review the code, and then merges the new changes into the master branch.

Once you push the new changes in the feature branch, you can now create a pull request (screenshot by Author)
Once you push the new changes in the feature branch, you can now create a pull request (screenshot by Author)
  1. "Merge" your branch to the main branch
Then the admin of the master branch will see this and he or she will merge your pull request once the code has been reviewed. (screenshot by Author)
Then the admin of the master branch will see this and he or she will merge your pull request once the code has been reviewed. (screenshot by Author)

Useful Git Commands

  1. git status – show what files are staged, unstaged and untracked.
  2. git log – display the entire commit history. One thing to note here is that the yellow highlighted number is the commit ID. The commit ID is a sha1 hash of all the data in the commit. It’s very rare for two commits to have the same commit ID, but it’s possible.
git log (screenshot by Author)
git log (screenshot by Author)
  1. git diff – comparing the changes. git diff can be used to compare commits, branches, files, and more. You can copy the first few characters (>4) from the commit ID, and Git will be able to figure out which commit you are referring to. Using the image above, we can use 9b0867 and 51a7a.
//Show difference between working directory and last commit.
▶️ git diff HEAD
//Show difference between staged changes and last commit
▶️ git diff --cached
//Show difference between 2 commits
//To see what new changes I've made after the first 51a7a commit: 
▶️ git diff 51a7a 9b0867
  1. git branch – list all of the branches in your repo. Remember to check this before you push the code. I’m sure you don’t want to accidentally push your code to the master branch or other branches.
  2. git branch -m <new-branch-name>— rename branch name
//Checkout to the branch you need to rename
▶️ git checkout <old-branch-name>
//Rename branch name locally
▶️ git branch -m <new-branch-name>
//Delete old branch from remote
▶️ git push origin :<old-name> <new-branch-name>
//Reset the upstream (optional) branch for the new branch name
▶️ git push origin -u (optional) <new-name>
  1. git revert – create a new commit that undoes all of the changes made in , then apply it to the current branch. This has to be done at the "commit level".
  2. git reset— This can be done at either the "commit" or "file" level. At the commit level, git reset discard commits in a private branch or throw away uncommitted changes. At the file level, git reset can remove the file from the staging file.
//Reset staging area to match most recent commit, but leave the working directory unchanged.
▶️ git reset
//Move the current branch tip backward to <commit>, reset the staging area to match, but leave the working directory alone.
▶️ git reset <commit>
//Same as previous, but resets both the staging area &amp; working directory to match. Deletes uncommitted changes, and all commits after <commit>. 
▶️ git reset --hard <commit>
//Reset staging area and working directory to match most recent commit and overwrites all changes in the working directory. 
▶️ git reset --hard
  1. git stash – takes your uncommitted changes (both staged and unstaged), saves them away for later use, and then reverts them from your working copy. By default, Git won’t stash changes made to untracked or ignored files. This means that git will not stash unstaged files (i.e haven’t run git add) and files that have been ignored.
//Stash your work: once you've stashed your work, you're free to make changes, create new commits, switch branches, and perform any other Git operations; then come back and re-apply your stash when you're ready.
▶️ git stash
// re-apply stashed changes
▶️ git stash pop
// list stack-order of stashed file changes
▶️ git stash list
//discard the changes from top of stash stack
▶️ git stash drop
  1. git fetch <remote> <branch>— fetches a specific from the remote repository. Leave off to fetch all remote refs.
  2. git rm <file>— remove the file. When a file is removed using this git rmcommand, it doesn’t mean the file is removed from history. The file will keep "living" in the repository history until the file will be completely deleted.

Summary

Now that you understand the basic Git commands, it’s time to put them to use and start building your Data Science portfolio with Github!

Resources:

  1. https://www.youtube.com/watch?v=RGOj5yH7evk
  2. https://git-scm.com/
  3. https://www.atlassian.com/git/tutorials
  4. https://learngitbranching.js.org/
  5. https://bluecast.tech/blog/git-stash/
  6. https://www.atlassian.com/git/tutorials/cherry-pick

If you find this helpful, please follow me and check out my other blogs. Stay tuned for more! ❤

How to Communicate More Effectively as a Data Scientist

10 Tips To Land Your First Data Science Job As a New Grad

How to Prepare for Business Case Interview as an Analyst?


Related Articles