Should I Open-Source My Model?
Best practices for deciding whether to open-source a Machine Learning model
I have worked on the problem of open-sourcing Machine Learning versus sensitivity for a long time, especially in disaster response contexts: when is it right/wrong to release data or a model publicly? This article is a list of frequently asked questions, the answers that are best practice today, and some examples of where I have encountered them.
OpenAI created a storm this week when they released new research showing state-of-the-art results for many AI tasks, but went against a general practice in the research community when they decided not to release their Machine Learning model:
“Due to our concerns about malicious applications of the technology, we are not releasing the trained model.”
The criticism of OpenAI’s decision included how it limits the research community’s ability to replicate the results, and how the action in itself contributes to media fear of AI that is hyperbolic right now.
It was this tweet that first caught my eye. Anima Anandkumar has a lot of experience bridging the gap between research and practical applications of Machine Learning. We were colleagues at AWS and recently spoke together on the problem of taking Machine Learning from PhD-to-Product (https://vimeo.com/272703924).
Stephen Merity summed up the social media response well when he lamented that the Machine Learning community has little experience in this area:
My experience in both Machine Learning and Disaster Response is rare. I worked in post-conflict development in Sierra Leone and Liberia before I moved to US to complete a PhD in Natural Language Processing at Stanford, and have worked in both industry and disaster response since then. It’s fair to say that most people I’ve worked with have never had to consider the sensitivity of data and models before, so this article is hoping to bridge that gap.
This article is intended as a stand-alone resource for anyone asking these questions. If you want to see the article that sparked it, here’s OpenAI’s post, which contains their concerns like the ability to generate fake news-like content and to impersonate people online:
And here’s the tweet that caused so much discussion — you can read the replies that are still coming in at the time I’m writing this article:
Was OpenAI right or wrong to hold back their model, and instead only give it to journalists in advance? I’ll leave that for you to decide. For me (as I argue below) OpenAI failed in two areas which would have mitigated the problems: investigating whether the fake content could be detected; and releasing models in many languages to combat the bias towards English.
Here are some of the questions that you can use when evaluating your own decisions about whether or not to release a model or data set:
Should question whether or not to open source my model?
Yes. If your model is built on private data, it could be reverse engineered to extract that private data.
My model is 100% from public data, so do I need to question whether or not to open source my model?
Yes. Published data can become sensitive if you republish it in new contexts, and aggregated data (including Machine Learning models) can become more sensitive than its individual data points.
During the Arab Spring, I saw a lot of people tweeting about their local conditions: road closures, refugees, etc. While they were “public” tweets, the tweets were clearly written with only a handful of followers in mind, and they didn’t realize that reporting road closures would also help paint a picture of troop movements. As an example of what not to do, some of these tweets were copied to UN-controlled websites and re-published, with no mechanism for the original authors to remove them from the UN sites. Many actors within the Middle East and North Africa saw the UN as a negative foreign influence (or invader) and the people tweeting were therefore seen as collaborators — they didn’t care if these people only intended to be sharing information with a small number of followers.
So, you need to ask yourself: what is the effect of recontextualizing the data or model so that it is now published by myself or my organization?
It is also very common that aggregated data is viewed as sensitive when individual data points are not. This is standard practice for a lot of military organizations: when they aggregate data from one set of sources, they reevaluate that aggregated information for its level of sensitivity. The aggregations are typically the results of statistics or unsupervised machine learning, but a supervised model built on that data would equally apply. For a public example of how aggregation changed the sensitivity of data in the military, see the recent case where runners in the military using Strava accidentally gave away the location of bases, when it showed heat-maps of where people in the military ran the most:
Many organizations have chosen to adopt similar policies. Medium is one of them: in writing this article, I can’t “dox [expose the identity] someone including by exposing personal information or aggregating of public information”, per Medium’s curation guidelines. You should do the same.
So, you should always ask yourself: is the aggregation of data in my model more sensitive than the individual data points?
How do I evaluate risk?
Use the same model as for security: weigh up the cost of miss-use, versus the value it provides for a bad actor.
In security, you treat every strategy as ‘breakable’. The goal is to make the cost of breaking some security measure higher than the value of the data that you are protecting. Therefore, it is not worth the bad actor’s time to break it.
Is the cost to reproduce the models from your research papers worth the effort for someone who wants to use them for negative reasons? You should be explicit about this. This is then one factor that will go into your decision about whether to open source or not.
In OpenAI’s case, they might have decided to publish the risk profile of the decision along the lines of: we think that the decision not to open-source the model is enough of a barrier to stop most lone trolls on the internet from recreating the model. But we acknowledge that it is not enough for a large group of (possibly state-sponsored) scientists to recreate the model using our research paper as a guideline.
Should I trust journalists to make decisions about risk?
No. Or at least, not without question. Journalists have to sell content and the more sensational content tends to sell more. Consciously or unconsciously, a journalist might lean towards open-sourcing, as it is then easier for them to write about it. On the other hand, a decision to not open-source the data could lead to a sensationalist report about the dangers that went into that decision (like with the OpenAI case).
I’ve seen many cases of preventable deaths caused by bad journalism in disasters. In the 2014 Ebola outbreak in West Africa, I predicted:
For every person who contracts Ebola, ten people will die nearby from other preventable conditions.
I was an advisor to most major governments and aid organizations responding to the outbreak, because I had previously lived in the region and had separately worked in epidemic tracking for Ebola in East Africa. I warned almost every news organization, both local and international, about the hype around reporting and the deaths it could lead to.
A small group of press listened, but most did not. When the deputy head of Sierra Leone’s Health Ministry spoke at a conference in San Francisco following the end of the epidemic, she reported these exact same sad figures: they estimated that ten people died from other diseases because they were avoiding clinics, for every one person that died directly from Ebola.
So, be very careful when talking to journalists about your research.
How do I ensure that journalists are responsibly talking about my Machine Learning research?
Get media training! Many organizations run media training and even just a few hours will help. I can’t summarize everything you need to know in this article, but here’s the most important thing I’ve found that generally works:
Ask the journalist what their story is about.
If they are writing about advances in Machine Learning research, then you are probably ok. If they are writing about “Dangers in AI”, or “Fake News”, or “Interference in Elections”, then you should be increasingly careful about how your interview might be skewed to fit their narrative.
The strategy of asking about what a story is about has worked for me with only one exception that I remember: a BBC reporter was writing an article about how English was dominating the internet. I gave an interview saying that, no, English’s share of the internet has been steadily declining: people prefer their primary language and English is becoming a “second language of the internet”. But they reported me as saying that “English is becoming the language of the internet”. If this happens, there’s not much you can do — the BBC had broader reach than my tweet denying I said this. You can ask the media organization to amend the article or at least make a public statement that you were misquoted.
Should I trust governments to make decisions about risk?
No. Obviously, you shouldn’t break the law. But just because it is legal, that doesn’t mean it is ok. Governments are a group of people like any other, trying to get their heads around the real and not-so-real threats of Machine Learning.
Governments are also prone to media influence. The Liberian government shut its borders in response to the Ebola outbreak. Having worked on those borders, I know this was a complete farce. The borders are a series of connected villages, rivers, creeks, and forest trails that existed long before the official borders today. “Closing” the borders and sharing data about the infections along the borders made the government seem decisive, but ultimately would have protected almost no-one from Ebola, but driven them away from clinics for other treatable illnesses, making the situation much worse.
Like journalists, treat governments as important partners, but realize that you all have different agendas and many of them will be aligned, but not all of them.
Should I investigate solutions to the negative use cases from my model?
Yes! This was one of the failings of OpenAI. If they believe that the output from their model could be used to create fake news, then this is testable:
Can you create a text-classification task that is able to distinguish between human-written content and the output of OpenAI’s models?
This is an experiment that OpenAI could run in a couple of days, and it would give us better insight into how much of a problem this really is.
I was in long talks with Facebook recently about joining them in a role that would have been responsible for detecting fake news. Looking at this problem from the point of view of someone who is fighting it, this is the first thing I would want to know: can I detect this kind of model output programmatically, in order to combat it? I ultimately decided to take my current position for reasons unrelated to the role itself — I think that fighting fake news on Facebook is one of the most important things that anyone could do right now — and this additional study from OpenAI would help. Even better, if you can create a pool of models that can identify generated content, it is going to be harder to create generated content that defeats all models and gets past an automatic detection system.
If you can demonstrate, quantitively, that a negative use case for the data is easier or harder to combat, then that will be one factor in your decision-making process.
Is this a new problem in Machine Learning?
No, and you can learn a lot from past experience.
In 2014–2015, I was approached by the Saudi Arabian government on three separate occasions to help them monitor social media for dissidents. At the time I was CEO of Idibon, a ~40 person AI company who had the most accurate Natural Language Processing technology in a large number of languages, so we were naturally seen as technology that could be the best for their use case. We were first approached directly by a Saudi Arabian ministry, and then indirectly, once through a boutique consulting company and once through one of the five biggest consulting companies in the world. In every case, the stated goal was to help the people complaining about the government. After careful consultation with experts on Saudi Arabia and Machine Learning, we decided that a system that identified complaints would be used to identify dissidents. As Saudi Arabia is a country that persecutes dissidents without trial, often violently, we declined to help.
If you are facing a similar dilemma, look for people who have the depth of knowledge to talk about the community who would be most affected (ideally people from within that community) and people who have faced similar Machine Learning problems in the past.
Is fake news a new problem?
No. Propaganda is probably as old as language itself.
In 2007, I was escorting journalists reporting on the elections in Sierra Leone when we kept hearing reports of violence. We would follow those reports to find no actual violence. It turned out that a pirate radio station was broadcasting fake news, some of which was picked up by legitimate radio stations, and the intention of the fake news was to portray the supporters of one or more political parties as violent, and possibly to scare people away from voting altogether.
In the most recent elections in Sierra Leone, I saw messages going around social media with similar types of fake news about violence and election tampering. The people responsible for fake news at large social media companies have all quietly admitted to me that they can’t identify fake news in the majority of languages spoken in Sierra Leone, and many other countries.
So, the propaganda has been here for a long time, and it has used every technology available to scale the distribution of the message. The biggest gap is in ways to fight the propaganda, and this means better AI outside of English for the majority of cases.
Should I focus on balancing the bad use cases for Machine Learning with ones that are more clearly good?
Yes. It is easy to have a positive impact on the world by releasing models that have mostly positive application areas. It is difficult to have a positive impact on the world by limiting the release of a model with many negative application areas.
This is the other failing of OpenAI, their lack of diversity. More than any other research group, OpenAI has published models and research that only applies to English and (rarely) a handful of other high privilege languages. English only makes up 5% of the world’s conversations daily. English is an outlier in how strict the word order in sentences needs to be, in standardized spellings, and in how ‘words’ are useful as atomic units for Machine Learning features. OpenAI’s research relies on all three of these: word order, words as features, consistent spellings. Would it even work for the majority of the world’s languages? We don’t know as they didn’t test it. OpenAI’s research tells me that we need to worry about this kind of content generation for English, but it tells me nothing about the risk in 100s of other languages where fake news circulates today.
To be frank, OpenAI’s diversity problems run deep. When I was among dozens of people who noted that an AI Conference featured 30+ speakers who were all men, and that OpenAI’s Chief Scientist was the first featured speaker, OpenAI ignored the complaints.
Despite several public and private messages from different people, I am not aware of any action that was taken by OpenAI to address this problem in diversity representation.
I personally decline all speaking invitations where I believe that the conference lineup is perpetuating a bias in the Machine Learning community, and I know that many people do the same. It is likely that OpenAI’s more relaxed attitude to diversity in general is leading to research that isn’t diverse. I generally don’t trust English-only results for 95% of the world’s language in practice. There is a lot of good fundamental research at OpenAI, like how to make any model more lightweight and therefore usable in more contexts, but their English-language focus is limiting the positive use cases.
If you don’t want to step into the grey area of applications like fake news, then pick a research area that is inherently more impactful, like language models for health-related text in low resource languages.
How deeply do I need to consider sensitivity of the use case?
Down to the individual field level. When I was running product for AWS’s Named Entity Resolution service, we had to consider whether we wanted to identify street-level addresses as an explicit field, and potentially map coordinates to that address. We decided that this was inherently sensitive information and shouldn’t be productized in a general solution.
Consider this in any research project: are you identifying sensitive information in your models, implicitly or explicitly?
Should I open source my model, just because everybody else does?
No. You should always question your own impact.
Whether or not you agree with OpenAI’s decision, they were right in making an informed decision rather than blindly following the trend of releasing the full model.
What else should I be worried about?
Probably many things that I haven’t covered here today! I wrote this article as a quick response to OpenAI’s announcement yesterday. I will spend more time sharing best practices if there is demand for it!