Photo by @maur1ts

Sharing is Caring with Algorithms

Open-Source Machine Learning & Sharing Code or Data

Alex Moltzau
Towards Data Science
8 min readOct 8, 2019

--

It is important to consider the openly licensed technologies and content that can be shared. There is a clear interest should we manage to share beneficial solutions to different areas of the world more rapidly. Although one solution cannot necessarily be transplanted from one location to the other, it can be important to take the aspects relevant in the local context and adapt accordingly. As such, I decided to explore the current popular frameworks that are open-source as well as others that may be less well-known. First I will start off with a few different open source solutions related to machine learning and proceed to platforms for posting code.

“Open-source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose.”

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.”

“The Apache License is a permissive free software license written by the Apache Software Foundation. It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.”

This does not mean that the whole application or platform is open for change. However, it may in some cases be developed collaboratively, although this is not always the case. Most large open-source projects are additionally affiliated with one or several of the large technology companies, and sometimes owned by Google, Microsoft, Amazon etc. In 2017 Google launched a clear focus on artificial intelligence, arguing it was going to become an ‘AI first’ company. It is therefore not unreasonable to begin exploring solutions relating to artificial intelligence with the open-source platforms available for machine learning techniques.

Open-Source Machine Learning

There are a variety of actors making products for the open-source market. Of course, rather a few seem motivated by the possibility of building software that can facilitate the avid use of their cloud platforms — thus having the most easily accessible and widespread solutions seems to be a goal for many of these. Yet, it can be used extensively, can be industry-standard and as such frameworks that many developers will be familiar with. If you are working in this area it may therefore be an advantage to familiarise yourself with a few of these different concepts.

TensorFlow

TensorFlow is an open source machine learning (ML) framework that is relatively easy to use. It is known for its possibility to deploy across a variety of platforms. It is therefore one of the most extensively used frameworks for machine learning, and perhaps the most well-maintained. TensorFlow was developed by the Google Brain team for internal Google use. It was released under the Apache License 2.0 on November 9, 2015. This Google-led project already as of the 8th of October has 69,179 commits and 2,223 contributors on GitHub, and 58,095 stars on its models repository.

Keras

Keras is an open source software library designed to simplify the creation of deep learning models. It is written in Python and can be deployed on top of other AI technologies such as TensorFlow, Microsoft Cognitive Toolkit, and Theano. François Chollet is the creator of Keras and is currently working as an AI Researcher at Google, he’s also at the core of Keras Devolopment. According to Keras it sees itself as: “… an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error.” The initial release was the 27th of March 2015 and the first stable release was the 22nd of August 2019. Keras has 5,331 commits, 821 contributors and 44,571 stars.

Scikit-learn

Initially released in 2007, scikit-learn is an open source library developed for machine learning. This traditional framework is written in Python and features several machine learning models including classification, regression, clustering, and dimensionality reduction. Scikit-learn was initially developed by David Cournapeau as a Google summer of code project in 2007. Scikit-learn has 24,582 commits, 1,439 contributors and 37,343 stars.

Microsoft Cognitive Toolkit

The Microsoft Cognitive Toolkit is an AI solution that can empower you to take your machine learning projects to the next level. Microsoft says that the open source framework is capable of “training deep learning algorithms to function like the human brain.” It was initially released in 2016. On GitHub it has 16,110 commits, 198 contributors and 16,455 stars.

Theano

Initially released in 2007, Theano is an open source Python library that allows you to easily fashion various machine learning models. Since it’s one of the oldest libraries, it is regarded as an industry standard that has inspired developments in deep learning. On GitHub Theano has 28,090 commits, 331 contributors and 8,933 stars.

Caffe

Initially released in 2017, Caffe (Convolutional Architecture for Fast Feature Embedding) is a machine learning framework that focuses on expressiveness, speed, and modularity. The open source framework is written in C++ and comes with a Python interface. CAFFE is a deep learning framework, originally developed at University of California, Berkeley. It is open source, under a BSD license. BSD licenses are a family of permissive free software licenses, imposing minimal restrictions on the use and distribution of covered software. On GitHub Caffe has 4,154 commits, 265 contributors and 29,191 stars.

Platforms for Sharing Code

It is important to consider the possibilities that lie within sharing data. There is a clear advantage if we can manage to share code responsibly. The relevance to the achievement of the 2030 agenda is important to consider. The Agenda is a commitment to eradicate poverty and achieve sustainable development by 2030 world-wide, ensuring that no one is left behind. Code or repositories on their own will of course in no way, shape or form achieve this by default. Yet the collaboration can be an important puzzle piece or aspect of our society we have to understand should we wish to achieve a more equitable and sustained presence on planet earth. In order of popularity the largest platforms are as follows:

Adjusted spreadsheet with comparison of source-code-hosting facilities ranked by Alexa rank

Most platforms for sharing code uses a distributed version-control system for tracking changes in source code during software development. This is called git and was created by Linus Torvalds (Swedish speaking Finnish-American developer) in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files.

GitHub

The American company GitHub provides hosting for software development version control using Git. The company is a subsidiary of Microsoft and was acquired in 2018 for $7.5 billion. As of May 2019, GitHub reports having over 37 million users and more than 100 million (including at least 28 million public repositories). In 2014 it was estimated to be the largest host of source code in the world. Source code is the set of instructions and statements written by a programmer using a computer programming language. The service will be led by Nat Friedman, reporting to Scott Guthrie which is the executive vice president of Microsoft Cloud and AI. It must be said that GitHub has the chance to sign up for nonprofit accounts. However, these must be nongovernment, nonacademic, noncommercial, nonpolitical in nature, and have no religious affiliation.

SourceForge

SourceForge, founded in 1999 by VA Software, was the first provider of a centralised location for free and open-source software developers to control and manage software development and offering this service without charge.SourceForge is a web-based service that offers software developers a centralised online location to control and manage free and open-source software projects. SourceForge was one of the first to offer this service free of charge to open source projects. Since 2012, the website has run on Apache Allura software (under the aforementioned Apache license). SourceForge offers free access to hosting and tools for developers of free/ open-source software. However there was a power abuse in 2015 which saw many users leaving the platform or being unhappy.

Google Repository

If we move beyond the open-source to large technology companies Google’s monolithic software repository, which is used by 95% of its software developers worldwide, meets the definition of an ultra-large-scale4 system, providing evidence the single-source repository model can be scaled successfully. Google Has Its Own Alternative to GitHub: Cloud Source Repositories.

United Nations Global Pulse

As a comparison on a completely different scale in terms of users the United Nations Global Pulse is an initiative of the United Nations that attempts to “bring real-time monitoring and prediction to development and aid programs.” As an example of data sharing there is a link at the bottom one one project page which leads to an email, which means you have to email one of the UN Global Pulse offices to request access to the code. However, sharing the code is not always the case on these pages.

Data Sharing Initiatives

Data is another point which may be shared. There was a recent article on the blog of UN Global Pulse called Data and AI for Progress During UNGA. This article highlights a few projects worth mentioning. There was a newly-appointed Assistant-Secretary General (ASG) for Strategic Coordination, Volker Türk that met with several organisations to talk about the use of data in a development perspective.

Data for Now Initiative

Data for Now initiative, a new global effort to close the gaps on data for development, Deputy Secretary-General Amina J. Mohammed stated that: “… we can blaze a trail of success, by working together to unlock data, protect people’s privacy and to fight for inclusion.” During the Data for Now event, Google and the Global Partnership for Sustainable Development Data (GPSDD) announced an agreement focused on collaborating across platforms on earth observations for the SDGs.

Hunger Map Platform

The World Food Programme (WFP) and Alibaba on Thursday presented their new Hunger Map platform which combines different streams of information — such as weather, population size, conflict, hazards, food security, nutrition and market information — to monitor and predict the global food security situation in near real-time.

Global Data Commons

Global Data Commons — for now an informal partnership of some 70 governments, organizations and companies working to create a roadmap for allowing rapid innovation and safe use of artificial intelligence for the SDGs. The meeting concluded with a plan to develop a common reference architecture and governance frameworks and then implement a working model based on concrete use cases.

This is day 126 of #500daysofAI. If you enjoy this article please give me a response as I do want to improve my writing or discover new research, companies and projects.

--

--