The world’s leading publication for data science, AI, and ML professionals.

Contributing to open source for beginners

Or how I went from barely knowing how to use GitHub to 4 accepted pull requests in just a few weeks

Getting Started

How I went from barely knowing how to use GitHub to contributing to pandas in just a few weeks

A quick word on open source

Let’s start off with what open source is: code that is made freely available for possible modification and redistribution, and it’s usually developed collaboratively.

A lot of tools used in Data science are open source, from Python packages like pandas and matplotlib to programs like Hadoop, and the fact that anyone can access them for free has greatly contributed to increased access in the field. Knowing how widely used and effective some of these tools are, it was surprising to hear that volunteers made them available and still regularly maintain them in their spare time.

Source control allows for multiple people to work on the same code base at once. Photo by Yancy Min on Unsplash
Source control allows for multiple people to work on the same code base at once. Photo by Yancy Min on Unsplash

How can I get started?

I first got the idea about contributing to open source while listening to the Linear Digressions podcast. Before, I thought contributing to open source is reserved for software engineers, but in their episode, they covered the Data Science open source ecosystem, and I realized a lot of the tools that I use are part of that ecosystem.

However, at the time it still seemed a bit hard to imagine how all the pieces fall together. I had looked at a few GitHub repositories and it seemed a bit counter-intuitive, so I parked the idea for later, but kept it in the back of my mind as something I’d like to do someday.

Then came September, and I saw a tweet about Hacktoberfest, this event where, for a month, you could make 4 contributions to open source and learn along the way. It seemed like the perfect opportunity to try it out in an organized manner, so I signed up.

Photo via DigitalOcean
Photo via DigitalOcean

The basic resources

The first thing I did was follow a Github tutorial. Sure, I had uploaded some code before, but I didn’t know the ins and outs of pull requests, merge conflicts or even what a branch was.

There are many GitHub tutorials out there, but I followed the one on the event’s main page and it was really useful. You can find it here.

After that, I needed to find a few projects that needed help. What helped me was finding out that many repositories mark the issues that beginners can help on with the label ‘good first issue’ or something similar. You can find examples in pandas, numpy and matplotlib.

To generalize, you can go to any github repository, switch to the ‘Issues’ tab, and filter by the appropriate label.

Matplotlib list of open issues. Source: github.com
Matplotlib list of open issues. Source: github.com

My experience with pandas and matplotlib

I’m not a software engineer. If you found this tutorial, and aren’t one either, you’re in luck, there’s still a lot we can do! Some easy things to start with are fixing documentation issues or working on a specific function you’ve used and have experience with.

For my first contribution to the data space, I fixed code style in the documentation for pandas. You can see the open issue [here](https://github.com/pandas-dev/pandas/pull/36802), which was that the docs were using certain coding style conventions that are no longer used in pandas, so I had to run some python files through a tool to re-format them, and merge back to the main branch. The final pull request I made can be seen here, and required some knowledge of python formatting, and downloading and running a tool out of GitHub, so if you know how to open a Python file and how to install something, you could do something similar easily!

My second issue was from matplotlib, and required sunsetting the glossary from the documentation page along with all its references, a task that I thought would be easy, but turned out to involve building docs locally and some html. Luckily the maintainers were super helpful and guided me along the way. My second pull request with all the details can be found here.

What did I learn on the way?

There are a few things I want to leave you with before you click away from this article.

  1. Open Source maintainers are incredible people. They are volunteers, and yet they are incredibly helpful and responsive. Everyone I interacted with was happy to help out any time they could, be it on a late evening or during the weekend, even though there was no pressing deadline or company earnings call driving the urgency. These folks are truly passionate about data, and they inspire me to do the same.
  2. Read the contribution guidelines. Most repos have a Readme file with explanations on what their preferred method is to accept contributions. Be aware that the issues might already be assigned or someone might have started working on them, so check the comments on the issue page before starting, and communicate with the maintainers that you want to start working on it.
  3. Do not spam. Open source contributors and maintainers are very busy as is, and I’ve seen a lot of back and forth on Twitter about the usefulness of Hacktoberfest and other events, where there is a tendency to spam small improvements to fulfill your quota of pull requests. Please make sure your open source contribution is useful, or in response to one of the issues mentioned on the Issues tab.
  4. Have fun in the process! I definitely enjoyed contributing to open source, and will do it again as soon as I find the next issue I can help out on.

Related Articles