Duck Soup: The Unbelievable Scale of AI’s Pirated-Books Problem

When employees at Meta started developing their flagship AI model, Llama 3, they faced a simple ethical question. The program would need to be trained on a huge amount of high-quality writing to be competitive with products such as ChatGPT, and acquiring all of that text legally could take time. Should they just pirate it instead?

Meta employees spoke with multiple companies about licensing books and research papers, but they weren’t thrilled with their options. This “seems unreasonably expensive,” wrote one research scientist on an internal company chat, in reference to one potential deal, according to court records. A Llama-team senior manager added that this would also be an “incredibly slow” process: “They take like 4+ weeks to deliver data.” In a message found in another legal filing, a director of engineering noted another downside to this approach: “The problem is that people don’t realize that if we license one single book, we won’t be able to lean into fair use strategy,” a reference to a possible legal defense for using copyrighted books to train AI.

Court documents released last night show that the senior manager felt it was “really important for [Meta] to get books ASAP,” as “books are actually more important than web data.” Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.

This act, along with other information outlined and quoted here, recently became a matter of public record when some of Meta’s internal communications were unsealed as part of a copyright-infringement lawsuit brought against the company by Sarah Silverman, Junot Díaz, and other authors of books in LibGen. Also revealed recently, in another lawsuit brought by a similar group of authors, is that OpenAI has used LibGen in the past. (A spokesperson for Meta declined to comment, citing the ongoing litigation against the company. In a response sent after this story was published, a spokesperson for OpenAI said, “The models powering ChatGPT and our API today were not developed using these datasets. These datasets, created by former employees who are no longer with OpenAI, were last used in 2021.”)

Until now, most people have had no window into the contents of this library, even though they have likely been exposed to generative-AI products that use it; according to Zuckerberg, the “Meta AI” assistant has been used by hundreds of millions of people (it’s embedded in Meta products such as Facebook, WhatsApp, and Instagram). (...)

Meta and OpenAI have both argued in court that it’s “fair use” to train their generative-AI models on copyrighted work without a license, because LLMs “transform” the original material into new work. The defense raises thorny questions and is likely a long way from resolution. But the use of LibGen raises another issue. Bulk downloading is often done with BitTorrent, the file-sharing protocol popular with pirates for its anonymity, and downloading with BitTorrent typically involves uploading to other users simultaneously. Internal communications show employees saying that Meta did indeed torrent LibGen, which means that Meta could have not only accessed pirated material but also distributed it to others—well established as illegal under copyright law, regardless of what the courts determine about the use of copyrighted material to train generative AI. (Meta has claimed that it “took precautions not to ‘seed’ any downloaded files” and that there are “no facts to show” that it distributed the books to others.) OpenAI’s download method is not yet known.

Meta employees acknowledged in their internal communications that training Llama on LibGen presented a “medium-high legal risk,” and discussed a variety of “mitigations” to mask their activity. One employee recommended that developers “remove data clearly marked as pirated/stolen” and “do not externally cite the use of any training data including LibGen.” Another discussed removing any line containing ISBN, Copyright, ©, All rights reserved. A Llama-team senior manager suggested fine-tuning Llama to “refuse to answer queries like: ‘reproduce the first three pages of “Harry Potter and the Sorcerer’s Stone.”’” One employee remarked that “torrenting from a corporate laptop doesn’t feel right.”

It is easy to see why LibGen appeals to generative-AI companies, whose products require huge quantities of text. LibGen is enormous, many times larger than Books3, another pirated book collection whose contents I revealed in 2023. Other works in LibGen include recent literature and nonfiction by prominent authors such as Sally Rooney, Percival Everett, Hua Hsu, Jonathan Haidt, and Rachel Khong, and articles from top academic journals such as Nature, Science, and The Lancet. It includes many millions of articles from top academic-journal publishers such as Elsevier and Sage Publications.

by Alex Reisner, The Atlantic | Read more:
Image: Matteo Giuseppe Pani
[ed. Zuckerberg should have his own chapter in the Book of Liars (a notable achievement, given the competition). See also: These People Are Weird (WWL). But there's also some good news: First of its kind” AI settlement: Anthropic to pay authors $1.5 billion (ArsT):]

"Today, Anthropic likely breathes a sigh of relief to avoid the costs of extended litigation and potentially paying more for pirating books. However, the rest of the AI industry is likely horrified by the settlement, which advocates had suggested could set an alarming precedent that could financially ruin emerging AI companies like Anthropic."

Monday, September 8, 2025

The Unbelievable Scale of AI’s Pirated-Books Problem