Why Git And How To Use Git As A Data Scientist

Admond Lee
Towards Data Science
8 min readFeb 23, 2019

--

Perhaps you’ve heard of Git somewhere else.

Perhaps someone told you that Git is only for software developers and being a data scientist simple couldn’t care less about this.

If you’re a software engineer turned data scientist, this topic is something very familiar to you.

If you’re as aspiring data scientist from different background hoping to get into this field, this topic is something that you’ll be interested in — be it now or in the near future.

If you’re already a data scientist, then you’ll know what and why I’m writing about here.

At the end of the article, I hope that my sharing of experience in Git will let you understand the importance of Git and how to use that in your data science work as a beginner in data science.

Let’s get started!

So… What is Git?

Git is a distributed version-control system for tracking changes in source code during software development

— Wikipedia

Looking at this definition given by Wikipedia, I was once in your position before thinking that Git is made for software developers. And me as a data scientist has nothing to do with that to somehow comfort myself.

In fact, Git is the most widely used modern version control system in the world today. It is the most recognized and popular approach to contribute to a project (open source or commercial) in a distributed and collaborative manner.

Beyond distributed version control system, Git has been designed with performance, security and flexibility in mind.

Now that you’ve understood what Git is, the next question in your mind might be, “How does it relate to my work if I’m the only one doing my data science project?”

And it’s understandable to not be able to grasp the importance of Git (like what I did last time). Not until I started working in real world environment that I was so grateful for learning and practising Git even when I worked on my personal projects alone — which you’ll know why in the later section.

For now, keep reading.

Why Git?

Let’s talk about the WHY.

Why Git?

A year ago, I decided to learn Git. I shared and published my simulation codes which I did for my final year thesis project at CERN for the very first time on GitHub.

While having hard time to understand the terms that are commonly used in Git (git add, commit, push, pull etc.), I knew that this is important in data science field and being a part of the contributors towards open source codes made my data science work much more fulfilled than ever.

So I continued learning, and kept “committing”.

My experience in Git came in handy when I joined my current company where Git is the main way of code development and collaboration across different teams.

Even more, Git is especially useful when your organization follows agile software development framework where the distributed version control of Git makes the whole development workflow much efficient, faster and easily adaptable to changes.

I’ve talked about version control quite a few times. So what exactly is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Say for example you are a data scientist working with a team where you and another data scientist working on the same function to build a machine learning model. Cool.

If you make some changes on the function and uploaded to a remote repository and the changes are merged with the master branch, your model now becomes version 1.1 (just an example). Another data scientist also makes some changes on the same function with the version 1.1 and the new changes are now merged with the master branch. Now the model becomes version 1.2. At any point in time, if your team finds out that version 1.2 has some bugs during the release they can always recall the previous version 1.1.

And that’s the beauty of version control

— Me

How to use Git as a Data Scientist

(Source)

We’ve talked about what Git is and its importance.

The question now boils down to: How to use Git as a data scientist?

You don’t need to be an expert in Git to be a data scientist, neither am I. The key here is to understand the workflow of Git and how to use Git in your day to day work.

Bear in mind that you won’t be able to memorize all the Git commands. Feel free to google for that whenever needed, as what everyone else does. Be resourceful.

I’ll focus on using Git in Bitbucket (FREE for use). Of course, the workflow here is applicable to GitHub as well. To be exact, the workflow that I’m using here is Git Feature Branch Workflow, which is commonly used by open source and commercial projects.

If you want to learn more about the terminology used here, this is a great place to start.

Git Feature Branch Workflow

The Feature Branch Workflow assumes a central repository, and master represents the official project history.

Instead of committing directly on their local master branch, developers create a new branch every time they start work on a new feature.

Feature branches can (and should) be pushed to the central repository. This makes it possible to share a feature with other developers without touching any official code — master branch in this case.

Before you start doing anything, type git remote -v to make sure your workspace is pointing to the remote repository that you want to work with.

1. Start with the master branch and create a new branch

git checkout master
git pull
git checkout -b branch-name

Provided that the master branch is always maintained and updated, you switch to the local master branch and pull the latest commits and code to your local master branch.

Let’s assume that you want to create a local branch to add a new feature to the code and upload the changes later to the remote repository.

Once you get the latest code to your local master branch, let’s create and checkout a new branch called branch-name and all changes will be made on this local branch. This means your local master branch will not be affected whatsoever.

2. Update, Add, Commit and Push your changes to the remote repository

git status
git add <your-files>
git commit -m 'your message'
git push -u origin branch-name

Okay. There is a lot of stuff going on here. Let’s break it down one by one.

Once you’ve made some updates to add the new feature to your local branch-name and you want to upload the changes to the remote branch in order to be merged to the remote master branch later.

git status will therefore output all the file changes (tracked or untracked) made by you. You will decide what files to be staged by using git add <your-files> before you commit the changes with messages using git commit -m 'your message'.

At this stage your changes only appear in your local branch. In order to make your changes appear in the remote branch on Bitbucket, you need to push your commits using git push -u origin branch-name.

This command pushes branch-name to the central repository (origin), and the -u flag adds it as a remote tracking branch. After setting up the tracking branch, git push can be invoked without any parameters to automatically push the new-feature branch to the central repository on Bitbucket.

3. Create a Pull Request and make changes to the Pull Request

Great! Now that you’ve successfully added a new feature and pushed the changes to your remote branch.

You’re so proud of your contribution and you want to get feedback from your team members before merging the remote branch with the remote master branch. This gives other team members an opportunity to review the changes before they become a part of the main codebase.

You can create a pull request on Bitbucket.

Now your team members have taken a look at your code and decided to require some other changes from you before the code can be merge into the main codebase — master branch.

git status
git add <your-files>
git commit -m 'your message'
git push

So you follow the same steps as before to make changes, commits and finally push the updates to the central repository. Once you’ve used git push , your updates will be automatically shown in the pull request. And that’s it!

If anyone else has made changes in the destination to the same code you touched, you will have merge conflicts, and this is common in the normal workflow. You can see here on how to resolve merge conflicts.

Once everything is done without problems, your updates will be finally merged with the central repository into the master branch. Congratulations!

Final Thoughts

(Source)

Thank you for reading.

When I first started learning Git and I felt very frustrated as I still didn’t really understand the workflow — the big picture — despite my understanding of the terminology in general.

This is one of the main reasons of writing this article to really break down and explain the workflow to you on a higher level of understanding. Because I believe that having a clear understanding of what’s going on in the workflow will make the learning process more effective.

I hope this sharing is beneficial to you in some ways.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄

About the Author

Admond Lee is currently the Co-Founder/CTO of Staqthe #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter — Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup — Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.

--

--

Co-Founder & CTO @ Staq | Building the universal API to help fintech companies access financial data from SMEs across Southeast Asia 🚀