The world’s leading publication for data science, AI, and ML professionals.

GitHub and Pre-Trained Models: A Keyhole View

GitHub contains pre-trained models that may contain inherent biases, and the absence of model scorecards will contribute to amplifying the…

Setting the context

GitHub is a prominent internet hosting platform for software codes and their version control. GitHub enables its 56 million users (organizations and individuals) to create repositories of their work on the platform for ease of access, version control and for sharing with or without a license (Apache License 2.0). For researchers GitHub is a source directory to point to towards their work. And, for large organizations who open source their work/ solutions, GitHub makes it accessible to the community.

There are over 190mn public repositories on GitHub. These repositories contain source code, documentation, API, data sets and other metadata binformation relating to the source code. GitHub allows the users to fork (creating a copy the repository), commit (change history) and place a pull request (notification of change to the original repository). This enables reuse of the existing content on the platform. In case of machine learning, there are over 331k repositories which contain ML models, their source data, the outcomes, metrics, and analysis associated with them.

Transfer learning and inherent challenges

GitHub remains the largest platform to enable transfer learning and to leverage existing models or source code. Transfer learning is a process of leveraging the knowledge from solving one problem to another similar or different problem. This helps in optimizing the existing content effectively through collaboration and open sourcing. While such optimization amplifies the possibilities, it also amplifies the errors and biases in source code, that can harm people and organizations.

For instance, GitGuardian in its report (2021 State of Secrets Sprawl on GitHub) mentions that information including API keys, credentials, and sensitive information are on public repositories of GitHub. This makes the repositories vulnerable to attacks (not just cyber attack, but also data poisoning attack), thereby creating harm to organizations and people (here).

Similarly, bias can exist in the nature of pre-existing bias (latent or otherwise), technical bias (feature choices and trade-off decisions) and emergent bias (mismatch in values or knowledge base between design and ground truth) in Natural Language Processing models on public repositories (word embeddings). The models are trained on public datasets and can also contribute to the harm. Research studies have established that latent bias in models or the public datasets have led to gender discriminations (here). It’s pertinent to note that optimization choices and trade-off decisions made by the data scientists can also contribute to the biases in the model. A forked or adapted model from the original repository could also contribute to or sometimes amplify the extent of harms, subject to context and the way the model was deployed.

Underlying root cause

The underlying problem is that Github as a repository contains source codes or models that are not validated before being brought on to the platform. Further GitHub’s approach is to enable the community to review, comment, use or share bugs or challenges (called as ‘issues’) with each other on the respective repository. However, such approaches may not be adequate in circumstances where the awareness of underlying problems are not uniformly understood in the community (specifically in areas of emerging research). In addition, GitHub does not mandate signing a code of conduct (recommendatory only) as part of the user onboarding, declaration process or repository creation process.

GitHub community guidelines provide an opportunity for reporting abuse and reporting content while moderating a repository, through a form submission process. The current options for such submissions include "I want to report harmful code, such as malware, phishing, or cryptocurrency abuse". However, given the implications of harms caused by machine learning models, there is a need to include more options with specific reference to bias and discrimination.

This is not an isolated challenge. There are similar challenges of bias and discrimination that exists in image recognition models that were developed using Imagenet dataset (here, here and here). If transfer learning is the way to go to adopt emerging technology that is getting democratized there is a necessity to understand that in its current form, these democratizations are not free of harms.

The Need: Scorecards for pre-trained models

There is a need for establishing a scorecard for pre-trained models or codes on GitHub. The scorecard shall contain collated responses on whether the model or its underlying training data was tested for bias, cyber-security, or adversarial attacks amongst others. It shall also gather information on limitations of using the model or contexts in which it can be used or cannot be used, do’s and don’ts associated with use of such model, etc before making the repository public. While this may appear a lot of expectation on the GitHub user, it is inevitably relevant and is intended to contribute to the wellness of the community and the people at large. There should be a mandate for the users to update such information while updating the repository with code and there should be a notification for users to review the repository from the view of the scorecard and vice-versa.

Repositories like GPT-3 contained specific model card on the repository that included expectations on model use, limitations of the model and the possibilities of underlying bias (here). This is one of the best practices, as you do not find similar references in many other public repositories on GitHub. However, even such best practice disclosures are not specific enough to share what efforts the team has made to validate the model against bias or adversarial attacks.

Conclusion

This is vitally important given the announcement on GitHub’s ‘Copilot’, an AI tool that is trained on billions of lines of code on the platform. The tool is expected to generate complementary code for the users’ projects. Besides the question on legalities of such tool (here), a question that is pertinent is "Will you be comfortable using a code or model that has latent bias?". Scorecards are not the end, but a means to bring contextual view while democratizing transfer learning in machine learning.


Related Articles