Hello world! I’m writing my very first article on Medium and decided to share how I introduced GitHub to my data team to promote collaboration.
I’ll talk about why having GitHub can help a team (more than 2 data folks) collaborate more effectively, and how you can implement GitHub into your day-to-day data analytical workflows. This should help set you up for establishing some form of collaborative framework within your team.
In this article, I will focus on the following topics:
Table of Contents:
- Why GitHub
- What you need before starting
- Branching Strategy
- What else you can add
Why GitHub
If you’re working in the data space and have worked with at least one other teammate, chances are you have encountered collaborative issues like working on the latest code version, tracking who has edited what code or frustratingly trying to recall what purpose was this code written for which was last edited XX days ago.
These are just some of the pain points that I have heard from fellow data folks and my personal experience. Though I suspect all of these are merely scratching the tip of the iceberg.
So how exactly can GitHub tackle these?
With GitHub, analysts can utilise its Version Control System (VCS) and Git to work on the same data project together. Any changes made to the project can be updated, tracked and reviewed by fellow analysts, and any earlier versions of the work can also be recovered.
While GitHub is most popularly used with software development, we can take a leaf out of the book and incorporate some of it into data analytical workflows.
What you need before starting
Before introducing GitHub to your team, it’s important to first understand the basics of GitHub. Knowing how to add, commit, push files and create pull requests. I will not be going into these basics as there are plenty of resources and articles out there already that provides a comprehensive guide on mastering these basics. Here are some useful quick links:
Once you have a basic grasp on how git works, the next thing to decide is what type of branching strategy to employ for your team.
Branching Strategy and Directories
What exactly is a branching strategy? It’s essentially a strategy used typically by software developers to write, merge, and ship code in the context of a VCS like Git. Choosing a suitable model is important to determine when and how changes are made and committed back to the codebase.
Personal BranchesThe simplest form of branching strategy is to have the main
branch serve as the main source of the codebase, and have each team member create a branch off main
with their name as the branch name (i.e. person-A
, person-B
etc.). All of the personal branches will merge into main
. This works when the team is small (< 3), and there are few complexities in terms of the work involved (i.e. frequency of new projects is low).
As the projects scale up, this approach may be insufficient to handle the various commits and could lead to confusion.
Here’s a snippet on how it looks like.
First, create a new repository:
## git template for creating a new repository on the command line
echo "# playground" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin <git url>
git push -u origin main
Now that you have a main
branch that tracks remote branch ‘main’ from ‘origin’, you and your team can create new branches from main
.
# new branches for each teammate
git checkout -b "person-A"
git checkout -b "person-B"
git branch

BAM. Now you have two branches for each user to work on their projects separately to be pushed to main
once their part or the entirety of the project is completed.
Project-based BranchingA more efficient yet lightweight approach is to adopt a project-based branching model, which is essentially a simplified model of Git Flow. The main
branch serves as the main source of codebase, and project branches are used to develop new projects to be merged into main
.
git checkout -b 'project-A'
git checkout -b 'project-B'
git branch

If there is more than one user working on the same project, more downstream branches can be created from the main project branch. E.g. project-A/person-A
, project-A/person-B
.
Double BAM.
How it worksLet’s take a look at how a change can be incorporated for a team member, Dave, working on project titled "analytics101" using project-based branching as an example.
These are some of the branches that will exist:
- main
- analytics101
- analytics101/dave (present only if there are more than 1 person working on the same project)
- etc.
Once Dave has completed a part or the project itself within the branch analytics101
, he wishes to commit all the changes to the main
branch for anyone else to reference from.
git add -A # staging all the changes
git commit -m "Commit for Project A"
git push
He can then proceed to create a Pull Request for this commit to be merged to the main
branch.
Directory StructureIt is important to note that branching is not the same as directory. The directories within your git repository should be set up in a way that make sense for the team, similarly to how files are stored in different categories of folders and with various levels, either locally or on a shared enterprise cloud.
However, I do recommend having a folder each for a new project to not complicate things, with each folder having a README.md
to describe the project folder.
What else you can add
Once you have GitHub and a branching strategy in place, you can add on more tools that can help the team to collaborate further!
EnvironmentsA common programming tool that Data analysts use is python, an object-oriented language that is efficient in data wrangling, visualisation and machine learning etc. Most analysts will interact with python via Jupyter Notebook, either on the Conda environment itself or through IDEs such as Visual Studio Code. An issue with using python and notebooks when collaborating with numerous teammates is the risk of dependencies breaking apart as a result of different versions used.
This is where environments and git come in handy. For each of the new projects, simply initiate and activate a new environment and store all the dependencies into a file (requirements.txt
). Anyone can simply access the project subdirectory, activate a new environment and install the required dependencies to run the code successfully.
Inserting Credentials
If you find yourself having to query data into your local machine, one issue you might face is having to manually insert your credentials into an engine. While there are numerous methods to circumvent this, an example would be to leverage on environment variables via python-dotenv package.
Using this method together with Git becomes handy as you can simply add the .env
file to a .gitignore
and the file containing all your credentials will be safely stored only on your local machine.
Once you have your credentials stored on a .env
file, you can load the variables.
# load env variables
import os
%load_ext dotenv
%dotenv
USERNAME = os.getenv("USERNAME")
PASSWORD = os.getenv("PASSWORD")
Now that you have your credentials loaded as variables, you can simply insert them to your SQL engine to extract the relevant data.
Triple BAM.
Hopefully with all of these, anyone can get started with Git and incorporating them into their data analytical workflows.
Please leave comments 💬 if there’s more to add and I will be happy to include them in an edit!
Thanks for reading! 🙂