The world’s leading publication for data science, AI, and ML professionals.

5 Things I’ve Learned as an Open Source Machine Learning Framework Creator

If you're an aspiring creator or maintainer of open source machine learning frameworks, you might find these tips helpful.

Notes from Industry

Photo taken by author (Bryce Canyon UT on my 2021 Road Trip)
Photo taken by author (Bryce Canyon UT on my 2021 Road Trip)

Creating a successful open source project is difficult especially in the Data Science/machine learning/deep learning space. A large number of open source projects never get used and are quickly abandoned. As the creator of Flow Forecast, an open source deep learning for time series forecasting framework, I’ve had my fair share of both successes and pitfalls. Here is a compilation of the tips I have for aspiring creators/maintainers of open source machine learning frameworks.

  1. Documentation, documentation, and documentation

Having good documentation, tutorials, and getting started guides is probably one of the most important aspects for any open source framework. While in a work setting knowledge of the codebase is usually shared over meetings, tutoring, and pair programming sessions, with open source projects the primary method of learning is through documentation, tutorials, and examples. Moreover, data scientists are often extremely busy and if your documentation/getting started guide isn’t clear they will often move on to a framework that is easier to use. This is often true regardless of the promised performance of models.

In order to help people get started using your framework easily I recommend having a simple no thrills getting started tutorial. Besides a getting started tutorial you will also need to have a place for more detailed documentation of methods and classes. This is useful both for contributors and advanced users alike. I’ve found that ReadTheDocs works well for this purpose. For data science projects (particularly ML heavy ones) you might also want a site that contains conceptual information and info on model performance. For FF in particular we decided to use Confluence from Atlassian. This provides the perfect place for more conceptual information and model results without cluttering the code documentation.

Documentation of course is quite time consuming to write and maintain. So you should devote ample time to it. A trick (if you can call it that) I discovered is that it is easiest to write the documentation as you write the code. The more time passes the more foggy your recollection of what the code does becomes. Eventually you and your other maintainers won’t even remember that a certain feature exists in the codebase. So write your docs immediately!

Another thing I found useful is before making any design changes is to start a design document. The design document then becomes a living memory of the changes and the reasons for making them to the project. Another trick is to use extensions like Python DocString Generator in VSCode. These will automatically generate doc-strings in the proper format for rendering for your online documentation site.

  1. "If you build it they will NOT come"

Unfortunately creating a good open source data science framework will not automatically result in people using it. Moreover, many times the framework that becomes the most popular isn’t necessarily the best but the one with the most marketing, branding, and keyword optimization. While many ML Developers would prefer to just focus on the technical side of things marketing and promotion is a necessary evil.

This is something we struggled with early on at Flow Forecast. Despite being the first time series framework to support transformers we did little promotion. Other frameworks quickly emerged and gained more traction on social media. Another challenge we had was our framework’s name. Although FF pays homage to the repositories’ roots it sometimes confuses people and makes people think we are just a framework for forecasting river flows and not a general purpose time series forecasting framework. Therefore I’d recommend choosing a fairly generic name from the beginning.

To market your repository well I recommend regularly writing articles on Medium and posting tutorial videos. This can help drive traffic to your framework and build up stars. You can also (infrequently) announce major releases on the appropriate Reddit subreddits. Speaking at local meetups is a great way to spread awareness of your framework in the DS community. Most meetups are always looking for new speakers and are very supportive of open source projects. Finally, tutorial notebooks using your framework on popular Kaggle datasets and competitions can also help.

Although stars really shouldn’t mean that much (I think watchers and forks are a better indicator) they play a central role GitHub’s recommendation algo and how others perceive your repo. Therefore it is important to try to build up stars on the repository over time. If you can get enough stars in a short period of time your repository will also start trending which will lead to even more stars. Remember to add appropriate tags to your repository on GitHub as well as this will rank your repository (based on stars on that tag).

Hosting periodic events can also get more people involved in contributing to your repository. In 2020 we hosted a FF sprint as part of the greater PyData Global events. This year at FF we are planning on hosting a Hacktoberfest Sprint (click for details). During these events it is important to respond to PRs in a swift manner and be ready to answer questions devolpers might have. Incentives like T-Shirts can also motivate participants to contribute more relevant features.

3. Testing

Testing is challenging particularly in the Machine Learning space as most ML models are obviously not deterministic. It is best to start out easy and utilize the standard types of unit tests (e.g. test the proper shape is returned, test the training loop runs end-to-end). Make sure to include both unit and integration tests. Tests should be run with tools like CircleCI or Travis-CI. You can use CodeCov or other tools to automatically track your code coverage. Once you have basic code coverage and you ensure these tests pass you can now start on the trickier tests.

For testing things like the validity of test, and evaluation loops you might need to create a dummy deterministic model. You can then run these functions with the dummy model and a small dataset and check that the calculations are indeed accurate. For models you can write relatively inexpensive convergence "tests" to make sure model converge after several epochs. With these tests you can have peace of mind your code is functioning properly.

Perhaps one of the trickiest and most time consuming forms of testing is ensuring model that you port to your framework match the original results mentioned in the paper. This often requires running an experiment with the exact hyper-parameters and pre-processing from the paper. These tests cannot run as a normal part of the CI as they would take too long. Instead they should be run only once in awhile to ensure model performance doesn’t change with major framework revisions.

4. Good management of dependencies and backwards compatibility is essential

Dependencies are a constant pain to deal with when maintaining an Open Source project. My number one recommendation is to set the version of dependencies in your requirements.txt file. Otherwise whenever a dependency is updated it may break your code and you’ll receive a string of messages from users wondering why their code suddenly doesn’t work. Dependabot is good at opening PRs to automatically bump dependencies, which you then can easily see if they will pass CI. See my other article for more info.

Another thing to keep track of is the versioning of your framework on PyPi and your release cycles. There is an easy to use GH workflow that will automate the push to PyPi upon release. I generally try to make release at the beginning of every month (or if I’m running behind every other month).

5. Choose system to prioritize issues/new features

This leads into our next point of creating a system to prioritize issues. Time is incredibly scarce for many open source maintainers. However, the number of bugs and requested features continues to often grow exponentially. Therefore it is important to track progress and manage open issues. At FF we have experimented with various project management tools (e.g. JIRA), but found in the end simple GH issues and projects boards worked the best.

Another difficulty that open source maintainers face as well is delegating work. Oftentimes developers will volunteer to implement a feature with the best of intentions but become bogged down with other work and are unable to complete the specified issue. You should try to check in with developers regularly particularly on important issues. If necessary you should politely as possible reassign the issue to someone else to get it done if critical.

I hope you found this guide useful! Feel free to leave any questions or comments. Please also checkout my other articles!


Related Articles