Legal and ethical implications of data accessibility for public welfare and AI research advancement

This article has been co-written with Gabrielle Paris Gagnon, lawyer at Propulsio 360 Business Consultants LLP

Every organization that focuses on artificial intelligence wants and needs the same thing : more data to train their algorithms. Without a doubt, a lot of the success of deep learning systems today is predicated on the availability and the collection of large datasets, often supplied by users themselves in exchange for using these services for free.

Photo by Matam Jaswanth on Unsplash

Earlier in January, at the Strategic Forum on Artificial Intelligence organized by the Chamber of Commerce of Metropolitan Montreal, Valérie Bécaert, director research group at Element AI, addressed the business community on the importance of sharing data to make AI accessible to all organizations. In a time when big data is aggregated in the hands of a few powerful companies, this perpetuates economic inequalities and undermines the use of AI for social causes. Over time, this gap is also bound to increase. Tech giants multiply their data collection applications and prevent access to this proprietary data even for users who have themselves generated this data — see shadow profiles generated by Facebook. [1]

Another positive outcome of a more open access policy to data would be the facilitation of research by the scientific community, startups and non-profit companies geared towards public welfare.

By 2030, experts predict that AI will contribute as much as $ US 15.7 trillion [2] to the world economy. If nothing is done and no standards are promulgated for data accessibility, these profits will go directly in the hands of a very few people. For the development of AI to be a vector of social mobility, these gains and the subsequent wealth creation have to be distributed in an equitable manner. We believe that accessibility to large data sets (which are crucial for training deep learning systems) can orient the collective use of this technology more towards public welfare.

Moreover, at times, individuals, companies, and nonprofits, which are often data generators for big tech companies, are at their mercy. Their lack of control on the data they generate and how it is used can also have negative outcomes. They find themselves powerless when there is a misalignment between what they derive from the use of big tech’s products and the latter’s business objectives. For example, the latest announcement that Facebook would alter their algorithm to give lesser importance to publishers’ posts is catastrophic for news providers and other businesses that have been relying on Facebook for more than a decade to spread information.

Photo by William Iven on Unsplash

Now let’s consider the following facts:

IDC, a research firm predicts that the data captured in the digital ecosystem will be 180 zettabytes (zettabyte = 1 followed by 21 zeroes) by 2025 [3]
Amazon uses trucks pulling shipping containers to manage the amount of storage space their AWS clusters demand. [3]

This gives us a strong indicator that the amount of data being collected on users and their online behaviours is only going up. One of the core reasons for the profitability of these companies comes from their ability to monetize this data and subsequently sell the data to advertisers via data brokers. [4]

Given such a heavy reliance on data to create money-making products and services, accessibility to such data in an equitable fashion bears even more importance to normalize the market and make it more competitive which could potentially lead to better outcomes from a user perspective.

Photo by Scott Webb on Unsplash

Startups asking for access to public data

As big data collected by large organizations continues to have growing market value, startups are beginning to turn to the courts for authorization to gain access to this data.

These small companies allege that the control companies have over publicly available data represents anti-competitive practices.

In 2016, hiQ Labs, a startup that uses LinkedIn’s publicly available data to build algorithms capable of predicting employee behaviours, such as when they might quit and who to promote accordingly, received a cease and desist letter from LinkedIn stating that scraping their publicly available data was in violation of the company’s terms of use. hiQ Labs took the case to court because their business model was wholly dependent on this public data they acquire from LinkedIn. [5]

In August 2017, U.S. District Judge Edward Chen in San Francisco sided with hiQ Labs and ordered LinkedIn to remove within 24 hours any technology blocking hiQ Labs’ access to public profiles. In his opinion, asserting control by preventing hiQ from accessing LinkedIn public profiles could be a means of limiting competition, which violates California state law. Final oral arguments in this case are expected in March 2018.

Access to real-estate data in Canada

In Canada, the Competition Bureau filed a lawsuit against the Toronto Real Estate Board, a not-for-profit corporation that operates an online system for collecting and distributing real estate information to their members. TREB’s policy had restrictions on the communication and distribution of some data it collected, such as sales price, by their members.

Photo by Giammarco Boscaro on Unsplash

The Competition Bureau pleaded that this restrictive distribution of digitized data prevented competition as well as deterred innovation and the emergence of new business models by prohibiting realtors from posting sales data on their websites. The Federal Court of Appeal sided with the Competition Bureau in December 2017 and ordered TREB to allow its members to share the sales histories of listed properties online. [6]

This decision is expected to have widespread ramifications in Canada for how organizations distribute data and request for more open data in the marketplace.

Thus having more open, public datasets could foster the development of goods and services that are more public-welfare oriented.

What we propose is the following: creation of data sharing standards that are driven by industry verticals to enable researchers, young entrepreneurs, etc. to build products and services for public good. General Data Protection Regulation (GDPR) [7], set to come into force on May 25, 2018, sets a loose precedent (when it asks companies that they be able to provide data in a standardized format about a user when they demand it) — as this rolls into effect, it will be interesting to see how companies decide to adopt standards which can be interpretable across different data processors (term used in GDPR to refer to different service providers that a user could switch between).

Some potential downsides to public data sharing

Having more open data policies can also have its chilling effects. Even sharing anonymized datasets publicly can be a challenge as sophisticated statistical methods coupled with the mosaic effect (a technique used to combine information from disparate sources to create a richer profile of the target) can reverse the anonymization process. Indeed, considering that 41% of Canadian companies had sensitive data stolen last year following security breaches, data sharing could jeopardize the privacy of users’ sensitive information. [8]

This has been demonstrated in the past many times. One of the cases is where the sexual orientation of a user was inferred from movie ratings in a publicly released dataset from Netflix [9] by cross-referencing ratings available on IMDB for certain rare movies that were common between those datasets.

In another case, AOL had released search queries to the public and specific users down to their home addresses, medical needs, pet ownership among other things were identified. [10]

It must also be taken into account that although users may have accepted the collection and sharing of their personal data via terms and conditions, a debate exists as to whether their consent is actually valid. There is also the risk of identity fraud that can result from the linking of different datasets when combined with deep learning methods [11]. To protect the privacy of the users, we would need to have both legal and technical mechanisms in place that encourage and balance the sharing of these datasets while protecting the privacy of the individuals.

Another idea could be to allow for remuneration models along the lines of micro-payments where explicit consent of the user can be obtained to perform specific activities with their data. This can be incentivized by a strong demand from users for the privacy of their data and their potential boycott of services that violate that right.

Do you agree with the notion that large datasets from corporations could help fuel research for public welfare? What would be the best way to go about it? Share your thoughts in the comments section and let us know.

For more information on the work that I do in the ethical development of AI, visit


[1] How Facebook figures out everyone you’ve ever met —

[2] AI Will Add $15.7 Trillion to the Global Economy —

[3] Data is giving rise to a new economy —

[4] DATA BROKERS A Call for Transparency and Accountability —

[5] LinkedIn Cannot Block Startup From Scraping Public Profile Data, US Judge Rules —

[6] Appeal court upholds ruling ordering real estate agents to make home sale data public —

[7] EU GDPR —

[8] Security Breaches prove costly —

[9] Robust De-anonymization of Large Datasets by Arvind Narayanan and Vitaly Shmatikov —

[10] A Face Is Exposed for AOL Searcher №4417749 —

[11] The Evolution of Fraud: Ethical Implications in the Age of Large-Scale Data Breaches and Widespread Artificial Intelligence Solutions Deployment by Abhishek Gupta —