The Most Important Court Decision For Data Science and Machine Learning

Training algorithms on copyrighted data is not illegal, according to the United States 2nd Circuit Court.

Matthew Stewart, PhD
Towards Data Science

--

This article will discuss the Authors Guild v. Google case and the ramifications set by its precedent on the fields of artificial intelligence for the foreseeable future.

Overview of the Case

Author’s Guild v. Google has easily set one of the most important precedents for the field of artificial intelligence, and more explicitly, machine learning. The case debates the legal right for Google to use copyrighted books in its training database in order to train its Google Book Search algorithm. The Author’s guild alleged that the development of the Google Book Search database infringed upon the copyright of millions of books.

In the latter months of 2005, the Author’s Guild of America and the Association of American Publishers both sued Google, claiming the company had committed “massive copyright infringement” due to their use of copyrighted books for training a book search algorithm. Google claimed that its project represented fair use of the data and that its implementation was the equivalent of a digital age card catalog.

The Authors Guild of America and the Publishers Association teamed up against Google and a settlement was proposed after several years of litigation. For various reasons, the settlement was rejected on March 22, 2011. The Publishers Association settled with Google, but the lawsuit with the Author’s Guild continued.

In 2011 the Author’s Guild’s proposed class was certified. Google appealed that decision, with a number of amici asserting the inadequacy of the class, and the Second Circuit rejected the class certification in July 2013, remanding the case to the District Court for consideration of Google’s fair use defense.

In the latter months of 2013, U.S. Circuit Judge Denny Chin dismissed the lawsuit and affirmed that the Google Books program meets all legal requirements for “fair use,”.

Source

The District Circuit Ruling

In his November 2013 ruling, Judge Chin wrote:

In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders.

Chin’s ruling analyzed the four traditional factors that determine whether the use of a copyrighted work is categorized as fair use under U.S. copyright law. His conclusion was that Google Books meets all legal requirements of fair use and thus not a violation of the copyright as the Author’s Guild had purported. The most important of these factors was possible economic damage to the copyright owner. Chin stated that “Google Books enhances the sales of books to the benefit of copyright holders”, meaning that since there is no negative influence on the copyright holder it does not violate fair use.

However, the case was not over at this point as the plaintiff (the suing party) has the opportunity to appeal the case to a higher court.

Source

Appeal to the Second Circuit

On April 11, 2014, the Author’s Guild appealed to the ruling by the District Court to the U.S. Second Circuit. This is essentially a higher court that had the potential to override the decision of the District Circuit if they deemed the ruling unsatisfactory. They also began lobbying Congress to create a non-profit organization that would digitize and license books from authors to organizations choosing to pay subscriptions, fearing the impacts that such a ruling could have on the publishing industry and individual authors.

Oral arguments were held on December 3, 2014, and continued until on October 16, 2015, the Second Circuit unanimously affirmed the judgment in Google’s favor.

The court’s summary of its opinion is:

Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use.

Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement.

Nor, on this record, is Google a contributory infringer.

This effectively made the precedent even stronger, but the Author’s Guild — still convinced that they were in the right — decided to appeal directly to the Supreme Court.

Source

Supreme Court Petition

The Author’s Guild was not happy with the outcome of this lawsuit, so on December 31, 2015, they filed a writ of certiorari with the Supreme Court, which is essentially an appeal to a higher court to review the verdict of a lower court, in this case, the Second Circuit.

On April 18, 2016, the Supreme Court denied the petition for writ of certiorari, leaving the Second Circuit ruling in Google’s favor intact.

This does not imply that the Supreme Court endorsed or opposed the ruling, it merely states that less than 4 of the Supreme Court Justices voted to review the case.

This sets a precedent within the 2nd circuit but says nothing about a precedent in other circuits. It is likely that the precedent would be reviewed if a different circuit came to a different opinion, in which case the Supreme Court might decide to review the case.

Ramifications

The legal ramifications of this precedent could have far-reaching consequences. The decision of the 2nd District Court has given somewhat of a green light to tech companies to use copyrighted material in the development of deep learning algorithms largely because the use of this does not directly affect the earnings of the individual articles under the copyright. If I wrote one of the books that Google used to train their algorithm, I suffer no adverse effects from their use of my book in the training of their algorithm.

Silicon Valley is part of the 9th District Court, meaning that this is not a precedent there, but it does provide companies that are thinking of utilizing copyrighted data in their models with additional confidence.

One could then assume that this precedent would also extend to images, songs, and potentially any other data produced by individuals that is accumulated by tech conglomerates.

Things get more interesting when we go from a search algorithm, which are discriminative algorithms, to generative algorithms.

A discriminative algorithm takes the original data and essentially tries to break it down into a single result — think of a classification algorithm taking a data point and putting it into a certain group.

A generative algorithm takes the original data and uses this to make new data. In this sense, it is a data-generating process. Deep generative models such as generative adversarial networks and variational autoencoders are commonly used for generating and manipulating image data.

The Google Book Search algorithm is clearly a discriminative model — it is searching through a database in order to find the correct book. Does this mean that the precedent extends to generative models? It is not entirely clear and was most likely not discussed due to a lack of knowledge about the field by the legal groups in this case.

This gets into some particularly complicated and dangerous territory, especially regarding images and songs. If a deep learning algorithm is trained on millions of copyrighted images, would the resulting image be copyrighted? Similarly with songs, if I created an algorithm that could write songs like Ed Sheeran because I had trained it on his songs, would this be infringing upon his copyright? Even from the precedent set in this case, the ramifications are not completely clear, but this result does give a compelling case to presume that this would also be considered acceptable.

Of course, one could take a different view that using generative models and trying to commercialize these would directly compete with the copyrighted material, and thus could be argued to infringe upon their copyright. However, due to the black-box nature of most machine learning models, this would be extremely difficult to both prove and disprove, which leaves us in some form of limbo regarding the legality of such a case.

Until some brave soul goes out and tries generating movies, music, or images based on copyrighted material and tries to commercialize these, and is subsequently legally challenged on this, it is hard to speculate upon the legality of such an action. That being said, I am absolutely sure that this is not a matter of if, but when, this particular case will arrive.

The important things to take away from this case are:

  • Using copyrighted material in a dataset that is used to train a discriminative machine-learning algorithm (such as for search purposes) is perfectly legal.
  • Using copyrighted material in a dataset that is used to train a generative machine-learning algorithm has precedent on its side in any future legal challenge.

Final Comments

I hope you enjoyed this article discussing the Author’s Guild v. Google District Court case. Deep learning is a very recent and hot topic and I believe we have not seen the end of legal cases regarding the use of copyrighted data for the purpose of training large-scale deep learning models. The field is moving so fast and is so new that deep generative models such as GANs did not even exist at the beginning of this legal case (they were proposed in 2014 by Ian Goodfellow). Clearly, the legality of data usage will become more and more important as increasing amounts of companies choose to incorporate artificial intelligence into their business operations. Watch this space, we live in interesting times.

Newsletter

For updates on new blog posts and extra content, sign up for my newsletter.

References

Google Books ruled legal in massive win for fair use (updated), Ars Technica Nov 14 2013.

Google Wins: Court Issues a Ringing Endorsement of Google Books, Publishers Weekly, Nov 14, 2013.

Google book-scanning project legal, says U.S. appeals court, Reuters, October 16, 2015.

“We trust that the Supreme Court will see fit to correct the Second Circuit’s reductive understanding of fair use….”, Authors Guild, Oct. 16, 2015, “2nd Circuit Leaves Authors High and Dry” (Press Release).

Liptak, Adam (April 18, 2016). “Challenge to Google Books Is Declined by Supreme Court”. New York Times. Retrieved April 18, 2016.

People’s Daily Online (August 15, 2005). “Google’s digital library suspended”.

Siva Vaidhyanathan. “The Googlization of Everything and the Future of Copyright”, University of California Davis Law Review, volume 40 (March 2007), pp. 1207–1231, (pdf).

Robert B. Townsend, Google Books: Is It Good for History?, Perspectives (September 2007).

Copyright infringement suits against Google and their settlement: “Copyright Accord Would Make Millions More Books Available Online”. Google Press Center. Retrieved November 22, 2008.

Authors Guild, Inc. v. Google, Inc., 721 F.3d 132 (2d Cir. 2013).

“Google Online Book Deal at Risk”.

“Google Book Settlement Site Is Up; Paying Authors $60 Per Scanned Book”, by Erick Schonfeld on February 11, 2009, at TechCrunch

American Society of Journalists and Authors Archived February 25, 2012, at the Wayback Machine

Flood, Alison (January 22, 2010). “Ursula Le Guin leads revolt against Google digital book settlement”. The Guardian. London.

BBC: Google hits back at book critics

“Openbookalliance.org”. Archived from the original on 2013–09–10. Retrieved 2013–08–14.

Google Books Is Not a Library

“The Case for Book Privacy Parity: Google Books and the Shift from Offline to Online Reading”. Harvard Law and Policy Review. May 16, 2010. Archived from the original on August 12, 2010. Retrieved September 8, 2010.

Pohl, R.D. (November 17, 2009). “Google Books Settlement gets a makeover”. The Buffalo News. Retrieved March 26, 2010.

Hagey, Keach (March 17, 2010). “Understanding the Google publishing settlement”. The National. Retrieved March 26, 2010.

Siy, Sherwin (November 17, 2009). “The New Google Book Settlement: First Impressions on Orphan Works”. Public Knowledge. Archived from the original on June 9, 2010. Retrieved March 26, 2010.

Full text of Judge Chin’s ruling.

Amir Efrati and Jeffrey A. Trachtenberg (March 23, 2011). “Judge Rejects Google Books Settlement”. Wall Street Journal.

“Opinion in Authors Guild v. Google”, Circuit Judge Chin, Case 1:05-cv-08136-DC Document 1088, November 14, 2013. Retrieved November 17, 2013.

Why Google’s Fair Use Victory In Google Books Suit Is A Big Deal — And Why It Isn’t, Forbes magazine, 11–14–2013

ARL Policy Notes Archived November 20, 2013, at the Wayback Machine, 11–14–2013

http://www.infodocket.com/2014/04/11/authors-guild-files-brief-in-google-books-appeal-says-congress-should-create-a-national-digital-library/

Oral Argument in Authors Guild v. Google Archived May 15, 2015, at the Wayback Machine, 12–10–14

Slip opinion Archived 2017–09–04 at the Wayback Machine.

“Slip op. at 46”. Archived from the original on 2017–09–04. Retrieved 2015–10–17.

Authors Guild v. Google, Inc., №15–849 (Dec. 31, 2013).

Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994) (2 Live Crew “Pretty Woman” parody case).

--

--

ML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.io