The Golden Age of Open Source in AI Is Coming to an End

NC, SA, GPL, and other acronyms you don’t want to see in the open source license of the model you are using

Clemens Mewald
Towards Data Science

--

Image by author (modified from source)

A (biased) history of open sourcing AI libraries and models

I joined the Google Brain team in 2015 right as TensorFlow was open sourced. Contrary to popular belief, TensorFlow was not the secret sauce behind Google’s success at that point in time. Only a handful of researchers had used it, and it took several years before it transformed Alphabet’s most important properties in a material way.

However, TensorFlow’s impact on the open source community was almost immediate. It ushered in an era of community-driven innovation that has directly contributed to the breakneck pace of AI advancements in the last couple of years. To be fair, TensorFlow was not the first open source deep learning library (e.g. Caffe was released in 2014), but it was the first that was backed by the level of credibility (and developer advocacy budget) of a company like Google.

But TensorFlow is just a library. Critically, you still need to provide your own data to actually train predictive models. In order to predict future housing prices, you need a dataset with historic housing prices and use TensorFlow to train a model. The model that comes out on the other end now encodes the aggregate knowledge of your data. A few years after open sourcing TensorFlow, Google took another fateful step that accelerated the path towards “free for all” AI. The decision to open source the BERT model in 2018 helped trigger an avalanche in large language models. Shortly thereafter in 2019, OpenAI (still a non-profit at that point in time) open sourced their GPT2 model. And just like that, open sourcing trained models became a thing.

The escalation from open sourcing a library like TensorFlow to a fully-trained model like BERT shouldn’t be underestimated. TensorFlow is simply a set of instructions, BERT is the result of a costly training process of applying those instructions to a large amount of data. To use a powerful analogy: If TensorFlow was a biology textbook on human reproduction, BERT would be a college graduate. Someone read the textbook, applied its instructions, and spent a lot of time and money to raise an offspring who is now a fully educated adult ready to enter the workforce (or grad school).

“If TensorFlow was a biology textbook on human reproduction, BERT would be a college graduate.”

Open Sourcing Decisions

How did we end up there? I attribute many of the open sourcing decisions during that period to the growing ​​prominence and relative bargaining power of research scientists. Researchers prefer jobs that allow them to publish in prestigious journals (like NeurIPS), and those publications are more relevant (and credible) if they come with open sourced code. That’s one of the reasons why more secretive companies like Apple find it hard to attract talent. Researchers who are allowed to talk publicly about their work are widely recognized in the industry, and their market value becomes directly correlated with their publications in top tier journals. Since rockstar AI researchers are scarce, they command annual compensation packages north of $1M.

Not a lot of economic scrutiny was applied to these open sourcing decisions, let alone the more important aspect of licensing (more on that later). What was the value of the IP Google gave up with BERT? How would Google’s competitors use these libraries and models against them? Thankfully, open source infrastructure projects and their corporate maintainers provide an informative case study on companies wisening up to the implications of open source licensing decisions.

A Sea Change in Open Source

At the same time of TensorFlow’s rise, foreshadowing what was yet to come in open source AI, enterprise software went through an open source licensing crisis. Mostly thanks to AWS, which had mastered the craft of taking open source infrastructure projects and building commercial services around them, many open source projects exchanged their permissible licenses for “Copyleft” or “ShareAlike” (SA) alternatives.

Not all open source is created equal. Permissible licenses (like Apache 2.0 or MIT) allow anyone to take an open source project and build a commercial service around it. “Copyleft” licenses (like GPL), similar to Creative Common’s “ShareAlike” terms, are one way to protect against this. They are sometimes referred to as a “poison pill”, because they require any derivative product to be licensed the same way. If AWS launched a service based on an open source project with a “Copyleft” license, the AWS service itself must be open sourced under the same license.

So, partially in response to competitive cloud services, the corporate creators and maintainers of open source projects like MongoDB and Redis switched up their licenses to less permissible alternatives. This led to a painful but entertaining back-and-forth between AWS and those companies on the principles and merits of open source, which has since calmed down a bit.

Note that this change in licensing had a deceptive impact on the open source ecosystem: There are still a lot of new open source projects being announced, but the licensing implications on what can and cannot be done with those projects are more complicated than most people realize.

Turning Tides in Open Source AI

At this point you should be asking yourself: If the corporate maintainers of open source infrastructure projects realized that others were reaping more of the commercial benefits than themselves, shouldn’t the same be happening with AI? Isn’t this an even bigger deal for open source AI models, which hold the aggregate value of compute and data that went into creating them? The answers are: Yes and yes.

Although there seems to be a Robin Hood-esque movement around open source AI, the data is pointing in a different direction. Large corporations like Microsoft are changing licensing of some of their most popular models from permissible to non-commercial (NC) licenses, and Meta has started to use non-commercial licenses for all of their recent open source projects (MMS, ImageBind, DINOv2 are all CC-BY-NC 4.0 and LLAMA is GPL 3.0). Even popular projects from universities like Stanford’s Alpaca are only licensed for non-commercial use (inherited by the non-permissible attributes of the dataset they used). Entire companies change their business models in order to protect their IP and rid themselves of the obligation to open source as part of their mission — remember when a small non-profit called OpenAI transformed itself into a capped-profit? Notice that GPT2 was open sourced, but GPT3.5 or GPT4 were not?

More generally speaking, the trend towards less permissible licenses in AI, although opaque, is noticeable. Below is an analysis of model licenses on Hugging Face. The share of permissible licenses (like Apache, MIT, or BSD) has been on a persistent decline since mid 2022, while non-permissible licenses (like GPL) or restrictive licenses (like OpenRAIL) are becoming more common.

Source: Analysis by author

To make things worse, the recent frenzy around large language models (LLMs) has further muddied the waters. Hugging Face maintains an “Open LLM Leaderboard” which aims to highlight “the genuine progress that is being made by the open-source community”. To be fair, all of the models on the board are indeed open source. However, a closer look reveals that almost none are licensed for commercial use*.

Source: Analysis by author

*Between the writing of this post and its publication, the license for Falcon models changed to the permissible Apache 2.0 license. The overall observation is still valid.

If anything, the Open LLM Leaderboard highlights that innovation from big tech (LLaMA was open sourced by Meta with a non-commercial license) dominates all other open source efforts. The bigger problem is that these derivative models are not as forthcoming about their licenses. Almost none declare their license explicitly, and you have to do your own research to find out that the models and data they are based on don’t allow for commercial use.

The Future of Open Source AI

There is a lot of virtue-signaling in the community, mostly by well-meaning entrepreneurs and VCs who hope that there is a future that is not dominated by OpenAI, Google, and a handful of others. It is not obvious why AI models should be open sourced — they represent hard-earned intellectual property that companies develop over years, spending billions on compute, data acquisition, and talent. Companies would be defrauding their shareholders if they just gave everything away for free.

“If I could invest in an ETF for IP lawyers I would.”

The trend towards non-permissible licenses in open source AI seems clear. Yet, the overwhelming volume of news fails to point out that the cumulative benefit of this work accrues almost entirely to academics and hobbyists. Investors and executives alike should be more aware of the implications and practice more care. I have a strong feeling that most startups in the emerging LLM cotton industry are building on top of non-commercially licensed technology. If I could invest in an ETF for IP lawyers I would.

My prediction is that the value capture for AI (specifically for the latest generation of large generative models) will look similar to other innovations that require significant capital investment and accumulation of specialized talent, like cloud computing platforms or operating systems. A few major players will emerge that provide the AI foundation to the rest of the ecosystem. There will still be ample room for a layer of startups on top of that foundation, but just as there are no open source projects dethroning AWS, I consider it very unlikely that the open source community will produce a serious competitor to OpenAI’s GPT and whatever comes next.

Opinions expressed in this post are my own and not the views of my employer.

Clemens Mewald leads the product team at Instabase. Previously he worked on open source AI projects like MLflow at Databricks and TensorFlow at Google.

--

--

Clemens is an entrepreneurial product leader who spent the last 8+ years bringing AI to developers and enterprises.