In recent months, rightsholders of all ilks have filed lawsuits against companies that develop AI models.
The list includes record labels, individual authors, visual artists, and even the New York Times. These rightsholders all object to the presumed use of their work without proper compensation.
The Books3 dataset was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models.
The vision wasn’t wrong; large text archives are great training material for Large Language Models, but many authors disapprove of their works being used in this manner, without permission or compensation.
Authors Sue, OpenAI Responds
In a lawsuit filed last June, authors Paul Tremblay and Mona Awad accused OpenAI of direct and vicarious copyright infringement, among other things. Soon after, writer/comedian Sarah Silverman teamed up with authors Christopher Golden and Richard Kadrey in an identical suit.
The complaints allege that the authors’ books were sourced from pirate sites. They specifically mention the controversial Books3 dataset, as well as data from other shadow libraries such as LibGen, Z-Library, and Sci-Hub.
“The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community..,” the authors wrote.
OpenAI didn’t deny these allegations directly but nevertheless disagreed that using books to train AI amounts to vicarious copyright infringement or violations of the DMCA.
In a motion to dismiss, OpenAI asked the California federal court to ‘trim’ the scope of the case. The only claim that should be able to survive is direct copyright infringement, but OpenAI said it expects to defeat that at a later stage.
Court Dismisses Copyright and DMCA Claims
After reviewing input from both sides, California District Judge Araceli Martínez-Olguín ruled on the matter. In her order, she largely sides with OpenAI.
The vicarious copyright infringement claim fails because the court doesn’t agree that all output produced by OpenAI’s models can be seen as a derivative work. To survive, the infringement claim has to be more concrete.
“Plaintiffs’ allegation that ‘every output of the OpenAI Language Models is an infringing derivative work’ is insufficient. Plaintiffs fail to explain what the outputs entail or allege that any particular output is substantially similar – or similar at all – to their books,” the order reads,
In addition to copyright infringement, the authors accused OpenAI of violating the DMCA by intentionally altering the copyright management information (CMI). Details such as the title, the author, and the copyright owner, were allegedly stripped to “enable” or “conceal” infringement.
Judge Martínez-Olguín sees no evidence for the intentional removal of this copyright information. And, even if these allegations are true, there’s no evidence that it was done for nefarious reasons.
“Plaintiffs argue that OpenAI’s failure to state which internet books it uses to train ChatGPT shows that it knowingly enabled infringement, because ChatGPT users will not know if any output is infringing.
“However, Plaintiffs do not point to any caselaw to suggest that failure to reveal such information has any bearing on whether the alleged removal of CMI in an internal database will knowingly enable infringement.”
The authors further claimed that OpenAI distributed its works without CMI, which would also violate the DMCA. This argument fails too, the court ruled, as OpenAI didn’t distribute full copies of books.
“Instead, [the authors] have alleged that ‘every output from the OpenAI Language Models is an infringing derivative work’ without providing any indication as to what such outputs entail – i.e., whether they are the copyrighted books or copies of the books,” the order reads.
Direct Copyright Infringement Claim Remains
In addition to the vicarious copyright infringement and the DMCA violations, Judge Martínez-Olguín also dismissed the California Unfair Competition Law (UCL) claims for ‘unlawful business practice’, ‘fraudulent conduct’, ‘negligence’, and ‘unjust enrichment’. The UCL claim for ‘unfair practices’ can proceed.
This isn’t the end of the legal battle. The authors have the chance to file an amended complaint to correct any shortcomings, should they wish to proceed with the dismissed claims.
Finally, it’s worth reiterating that the direct copyright infringement claim wasn’t covered by OpenAI’s motion to dismiss, so that will move forward as well. As will many of the other AI copyright lawsuits.
A copy of California District Judge Araceli Martínez-Olguín’s order on the motion to dismiss is available here (pdf).
From: TF, for the latest news on copyright battles, piracy and more.