The world’s leading publication for data science, AI, and ML professionals.

Get-git: A primer on Git

Say it fast enough and it sounds like get it…

Git has relished its stance as the epitome of collaboration for many years. As a developer I interact with Git on a daily basis; committing, pushing to the remote and the occasional hair pulling when things go wrong (explains my hair loss). Recently while musing on "how does Git actually work?", I realized how little I know about this deeply-rooted way of software development. And goes without saying that so many blunders that kept me guessing for hours sometimes could’ve been avoided if I knew the structure/functionality that dictates Git.

Git is an essential part of a software development project. It offers,

  • Tracking – Git maintains a history which is essentially a snapshot of your code base over time
  • Feature isolation – When you develop features you create your own branch – you don’t break the working/stable version
  • Collaboration – Now any number of developers can work on different features, work on their local copy, push to a single central remote repository when ready
  • Peer reviews – Git lends itself to peer reviews, when developer A has finished feature X, before merging it to master branch, he can ask developer B to review it (in the form of a pull request)

In this article, you’ll learn,

  • Several handy Git commands like, git add , git commit, git push, git fetch, git merge, git pull, git checkout, git reset.
  • Various stages your code goes through while running these commands like the working directory, the index, the local repository and the remote repository.

I don’t find it interesting to sweep through 100s of pages on Git documentation and get bombarded with terminology from all directions. Therefore, I created a visual Git cheat sheet and shared it on LinkedIN. I saw quite a lot of interest generated for this LinkedIn post . So let this be a more detailed discourse of what I wanted to convey in that post.

Tour in your local neighbourhood…

It all starts with a string of bits …

Yes it does, but we’re not going to start there. We are going to start with a code repository that you have created locally. Now this could have been created in our local environment using git init, or cloned from a remote repository using git clone <url to git repo> .

To grok the concepts, meet Dave! Dave is like any other developer and hates 9am meetings as much as you and me. He has created a code directory (let’s call it very specifically – code ) which has three files a.py , b.txt and c.py .

The local codebase is called the working tree or the working directory

Dave understands the value of Git and run git init. This will create a local Git repository for the codebase. You can see what makes up this repository in the .git folder created inside the code folder. By tracking his code in a Git repository he knows when and where he introduced various changes.

First Dave would like to add three files a.py ,b.txt and c.py to his local repository. It’s a two-fold process. He first calls git add a.py b.txt c.py. This adds these files to what is known as the index (or index tree/cache/staging area). This is an interim space between your current code and what is in your local repository. In other words, you’re code is not tracked by the repository (yet). Think of this as the backstage of a theater. You can make as many changes as you like (actors/actresses, props, script, etc.) backstage, before going on stage.

The index a space where you prepare/stage what you need to commit to the local repository.

Once you’re happy with the staged changes, you commit the files to your repository by calling git commit -m <message>. This will record a snapshot of your current working directory (more specifically a snapshot of the tracked files) in the repository.

Once you commit, your current version of the code is recorded in the local repository as a snapshot with a timestamp and an ID

You can use git status to see the status of your files during a commit. It will show the files that are staged, tracked but unstaged and untracked files.

Great we have done our first commit. But we’re just beginning to have fun. Let’s say Dave makes some changes to a.py and b.txt. Then Dave decides he only wants to add a.py to the index and not b.txt (i.e. execute git add a.py). This means new version of b.txt in the working directory will not be a part of the index. In other words, a.py is tracked/indexed and staged for the commit,b.txt is tracked but not staged for the commit in this turn.

Additionally, Dave thinks, c.py is not very useful and decides to remove that file from the index. He can do this by calling git rm c.py or git rm --cached c.py. The difference between the two is that git rm --cached c.py removes the file only from the index (and keeps the physical copy).

Git sees everything …

Dave has done two commits and they are recorded in the local repository. As I mentioned earlier, one of the advantages of git is the ability to track changes over time. This means the local repository maintains a history. The history looks like below.

Each commit we make is recorded as a single node in the history. Each commit is distinguished by an unique ID generated with a hashing function (it’s an alphanumeric ID that looks like 47970bc91aa7daec9def3...). You start with a single branch (master) and as you develop new features you fork out and create branches. For example if you’re building a car, you’d have a branch called feature/eject_seats. But once you realize ejector seats are a bad idea, you might drop that branch and come back to the master branch. Then you develop feature/doors and once the feature is complete and tested, you merge the feature to the master branch.

From local to remote …

Dave knows quite a lot about maintaining hisown local repository. But Git is about collaboration. Dave needs to share the codebase with the rest of the team with all it’s changes overtime. For that you typically maintain a central remote repository. You can let Git know about the remote repository using,

git remote add origin https://github.com/smart-vroom/_car_.git

Then Dave pushes the changes in the local repository to a remote repository. Say that Dave is working on the default branch (master), then Dave calls git push origin master. This will push the changes in the local repository to the remote repository. Now any developer can see the changes introduced by Dave.

The remote repository is a central online repository that is seen by several collaborators. The remote repository has a history just like the local repository

John has been a bad developer …

Dave is part of a team. In other words, Dave contributes to the code, so does John, Anna and Tim. The more developers there are, the more problems in maintaining the remote codebase. For example, John didn’t tell Dave that John was also working on the master branch and has committed to master, just before Dave did. Now Dave can’t push to the remote repo. Because he doesn’t have the latest version. Dave has to go back and update his local repository and the working directory and resolve any conflicts that occurred because of John’s update. Don’t worry, Git has answers for that too.

Now Dave will go through three things

  • Dave tries to push, but errors out, as his local repo doesn’t have the update John committed
  • Dave needs to update the local repository to be in sync with the remote repo (git fetch) (shown in orange in Figure below)
  • Dave needs to integrate those updated changes in the local repo to the working directory. This can result in merge conflicts (git merge) if John has edited a file Dave has edited (shown in orange in Figure below)
  • Dave needs to resolve the conflicts by manually editing conflicted files (shown in blue in the Figure below) – Discussed further below.
  • Dave stages files, commit files and pushes them to the remote (shown in blue in the Figure below)

You can also see a command called git pull. It simply performs git fetch and git merge in one go. This means git pull is intrusive and will modify your working directory

git pull = git fetch + git merge

What happens in that third step is circled in gray to the left of the Figure. When you type git merge and if John has edited a file that Dave have also edited, it will most likely result in a merge conflict. Say John and Dave both have edited b.txt. Now b.txt has a conflict that needs to be resolved. After a merge conflict, the conflicted file will look like it came from another world.

Alright people put down your flamethrowers, this is quite normal. All it’s saying is that the part between <<<<<<<HEAD and ====== is what you have in your local repo. But the remote repo has the part between ======= and >>>>>>> AB123CDE (in those lines in the file) . Here AB123CDE is the latest commit ID. Now Dave has three options;

  • Pick John’s update (Keep only the yyyy)
  • Keep your update (Keep only the xxxx )
  • Combine the two

depending on what Dave needs. But at the end, it must result in a valid file without the syntactic sugar sprinkled by the merge conflict. After you resolve the conflict, you can stage the files, commit them and push them to the remote repo. Phew! that was close. Hopefully, John would be more vocal about his stuff in the future!

The car is getting smarter

Dave’s boss wanted him to work on an important feature (feature/self_drive) and explained all about it. The thought of several sleepless nights became evident to Dave, as this is a huge thing. Also Dave needs a separate branch to work on this feature. Because the car can still be released without self_drive.

For that Dave created a new branch with git checkout -b feature/self_drive (-b tells Git it’s a new branch. If the branch already exists, you can do git checkout feature/self_drive). A branch is a stream of commits in git history that takes its own path. It branches out from an existing branch (say master) and you can add a series of commits to it without affecting the parent branch. Then once you implement the feature and test it, you can merge it back to the parent (optionally resolving conflicts).

In Git, branches are maintained in the repo. You can have a look at all the branches you go got by going to the .git/refs/head directory. It will have a file for each branch and simply records the latest commit of that branch.

Once your local developments are ready on the feature/self_drive branch, you can commit to the remote repo with git push origin feature/self_drive. This will create a new branch in the remote repo (if it doesn’t already exists).

Dave made a boo boo …

Dave is back on the master branch and he’s been working on few bugs on a.py. But, he’s gone down a rabbit hole and can’t get out of it. "To hell with sunk cost fallacy", he thinks, "I’m going to reset my working directory with the file from my last commit in the repo".

All Dave needs to do is, run git checkout master -- a.py . And it will restore the a.py in the working directory with the snapshot of a.py it has in the local repo. If you think there’s a later version of a.py, you can do a git fetch first and then do the checkout.

Impress your co-workers with these moves

You can also use git checkout to move between branches. Simply say git checkout feature/eject_seats or git checkout feature/self_drive. You can also checkout a specific commit with git checkout <commit ID> (e.g. git checkout 47970bc91a .

If you specify a commit with git checkout, that means, your HEAD and the branch ref are not the same and pointing at two things. This is called a Detached HEAD.

Another useful command is git reset. git reset has three modes;

  • --soft – Affects the local repo (This is useful if your HEAD is in detached mode)
  • --mixed – Affects the local repo and the index
  • --hard – Affects all of local repo, index and the working tree

Say, Dave needs to reset the index/staging area to remove some changes he has staged but not change the working directory. He can do this with git reset --mixed HEAD. Here the HEAD refers to the latest commit you made to the repo.

And here are some other commands that will come in handy

  • git stash – If you want to temporarily stash your updates in the working tree
  • git diff – Gives the difference between two files (e.g. working tree vs index)
  • git rebase – If you want to update the start of your branch to a later commit that happend after you branched out, this is the command to use
  • git reflog – Shows the history of the repository

Conclusion

Let’s look back at what we have learned. We learned,

  • Why Git is important
  • The differences between the working directory, the index, the local repository and the remote repository
  • Basic commands like git add, git commit, git push
  • Resolving merge conflicts git fetch, git merge
  • Git branches and their purpose
  • git checkout and git reset to correct mistakes

I hope that, if you went through this article, you’ll find reading Git documentations (e.g. Git-scm)on how Git works way easier.


If you enjoy the stories I share about Data Science and machine learning, consider becoming a member!

Join Medium with my referral link – Thushan Ganegedara


Related Articles