How Git truly works

A deep dive on the internals to acknowledge and master Git

Alberto Prospero

Published in

Towards Data Science

8 min readMay 24, 2022

#Introduction

Git is undoubtedly one of the principal cornerstones of modern software development. It is the must-have toolbox for coordinating work among developers and became a fundamental engine for the open-source movement over the years. To have a simple idea, as of November 2021 GitHub, the main repository manager of Git, reported having over 73 million developers and more than 200 million repositories.

Several programmers deal with Git every day, and commonly apply the key concepts. In this lecture, we are going to take the next step, by deep diving into the internals and exploring Git’s basic foundations. What is a branch? What is Head? What does it mean to merge a branch? Today, we are going to answer these and other questions.

Before we begin, I would like to give special thanks to Raju Gandhi, who helped create this article through his wonderful lecture on “Git next steps”, which can be found on O’Reilly. The clarity and completeness of his explanations were a source of inspiration for me.

#The foundations

Blobs, trees, and commits are the main components of Git’s data structure. Exactly as a house is built of bricks, or a graph is formed by edges and nodes, these elements form the Git’s foundations.

To understand these all, let us start with an example. Assume we create an empty repository. When we launch the command git init, git automatically creates a hidden folder named .git which is used to store internals.

Blobs

Now, suppose we create a file named myfile.txt and add it to our repository with the command git add myfile.txt.

When we perform this operation Git creates a blob, a file located in the sub-folder .git/objects which stores the content of myfile.txt, without including any related metadata (such as the creation timestamp, the author, and so on). Hence, creating a blob is like storing a picture of the content of the file.

The name of the blob is related to the hash of its content. Once the content is hashed, the first two characters are used to create a sub-folder in .git/objects, while the remaining characters of the hash constitute the name of the blob.

In summary, when adding a file to Git, the following steps occur:

Git takes the content of the file and hashes it
Git creates a blob within the .git/objects folder. The first two characters of the hash are used to create a sub-folder in this path. Within it, Git creates the blob having a name formed by the remaining characters of the hash.
Git stores the contents of the original file (a zipped version of it) within the blob.

Description of the process that Git carries out when creating a blob (image by Author)

Note that if we have a file named myfile.txt and another file named ourfile.txt, and both of them share the same content, they have the same hash, and so they are stored in the same blob.

Also notice that if we slightly modify myfile.txt and re-add it to the repository, Git carries out the same process, and since the content is changed, a new blob is created.

Trees

Assume now we create a sub-folder in our repository named subfolder. Also let us create a file named yourfile.txt in this sub-folder, and add it to the repository. In so doing, Git creates a new blob for yourfile.txt according to the process we defined in the previous paragraph.

Git hashes the second file named yourfile.txt, which is stored in the folder .git/objects (image by Author)

At this point, we commit both myfile.txt and yourfile.txt with the command git commit. When doing this, Git takes two steps:

It creates a root tree of the repository
It creates the commit

Let us focus on the first step. So, what is a root tree? A root tree stores the structure of files and folders of the entire repository. It is a file containing the reference to every blob or sub-folder included in the repository, built in a recursive manner.

Each row of the root tree references a blob or other sub-trees, which in turn reference other blobs or other sub-trees in the same way. Hence, the tree is the equivalent of a directory: just as we can access files and sub-folders from a directory, so we can access blobs and sub-trees from a tree.

Content of the root tree and the sub-tree related to mysubfolder (image by Author)

Once Git has created the root tree and all related sub-trees, it performs the same hashing and storing operations we described above. More precisely, it hashes each tree and uses the first two characters to create a sub-folder in .git/objects while the remaining hashing characters form the name of the saved file. Hence, from this process, we get as many new files as the number of trees in the data structure.

Git hashes the root tree and the sub-tree related to mysubfolder, and both are stored in the folder .git/objects (image by Author)

Commit

When running the command git commit, the second step is the creation of the commit. The commit content is stored in a file containing information related to the root tree, the parent commit (if any), and some metadata like the name and e-mail of the committer and the commit message.

Once the commit file is created, Git hashes its content and uses the hash name to store the content in a new file, exactly as above (the first two characters form the sub-folder name in .git/objects, while the remaining part of the hash constitutes the actual name).

Structure of the all the trees, commits and blobs up to now (image by Author)

And that is! Congratulations, you just realized how Git is structured. Now, with these concepts, it is extremely simple to define the notions of branch, tag, head, and merge!

#The bricks

Branches

Branches are named references to a commit. When creating a new branch named mybranch for example (with the command git checkout -b mybranch for instance), Git generates a new file in the path .git/refs/heads named mybranch. The content of this file is the hash of the commit from which the branch is created.

Initially both master and mybranch point to the same commit (image by Author)

Then, when we commit on mybranch, Git performs the operations defined previously (it creates root tree and commit file) and then updates the file of the branch with the new commit hash.

A new commit is performed and the file mybranch is updated with its content. The file mybranch now points to the new commit (image by Author)

Hence, branches are files tracking commits, and the content of these files is updated at every commit we perform.

Head

HEAD performs a few tasks in Git:

It’s how Git knows which commit is checked out, so when we do a git branch, Git looks at HEAD to know which branch we are on.
It references to the parent of next commit, so the commit that HEAD points to will be the parent of the next commit. Recall that when we perform a commit, the parent commit is stored in the commit file.

If we are on branch master, HEAD is referencing this branch. If we open the HEAD file we see “ref: refs/heads/master”. Instead, if we switch to the branch mybranch and open the HEAD file in the .git folder we see: “ref: refs/heads/mybranch”. Hence, HEAD does not point to a commit directly, but rather to a branch which in turn points to the latest commit on that branch. In this way, Git tracks which commit is currently checked out.

We are on branch mybranch. HEAD points to the file mybranch which in turn points to a specific commit. The file master, related to branch master, is pointing to another commit (image by Author)

When we are on a branch and perform the commit, Git reads the content of the HEAD file and writes the commit which is referenced as the parent commit. In this sense, HEAD provides (indirectly) the parent of the next commit.

Content of a commit file. HEAD (indirectly) provides the parent commit (image by Author)

Now, in Git, we can checkout to a previous commit and start to make changes from there. This mode is called “detached mode”. In this situation, HEAD points directly to a commit, and not to a branch. Note that this might be dangerous because we risk losing new commits. In fact, after having performed a commit, if we check out to a branch we are not able to come back to this new commit anymore because it is not referenced by any branch! This is the reason why is always a good practice to create a new branch before committing any change when we are in detached mode!

Merge

Merge allows joining two or more commits. There are two types of merge:

The first kind occurs when the two branches diverged. Git creates a new child which has two parents. The first parent is the branch we are on, while the second parent is the branch that is going to be merged. The commit file will have two parents, and HEAD is moved to the new child node.
The second kind occurs when the two branches did not diverge, but indeed one branch is the continuation of the other one. In this case, the merge is called fast-forward merge, and it is not a real merge because there are no conflicts. In this case, Git just moves HEAD and the current branch to the same commit pointed from the to-be-merged branch.

That’s it. Congratulations on coming so far! Hope you enjoyed the article! By now, you should have a good grasp on how Git works. Please feel free to comment if you have any questions!

See you around, stay gold! :)