Meta Accused of Plagiarism by Book Publishers

⚡

Key Takeaways

1Meta is being sued by five publishers and one author for illegally using their protected works to train its AIs.

2The complaint alleges that Meta extracted protected content from piracy sites to feed its Llama models.

3A previous ruling recognized fair use for AI training, but class action lawsuits continue.

💡Why it matters — These lawsuits could redefine the legal boundaries of using protected content for AI training.

Meta Accused of Plagiarism by Renowned Publishers

Meta, the tech giant, finds itself at the center of a legal battle after a group of five book publishers and an author filed a collective lawsuit against the company. These publishers, including prestigious names such as Macmillan, McGraw Hill, Elsevier, Hachette, and Cengage, along with author Scott Turow, accuse Meta of committing one of the largest copyright violations in history. According to them, Meta allegedly used their books and articles without permission to train its artificial intelligence models, including Llama. This case was reported earlier by the New York Times.

Accusations of Piracy to Train AI

The lawsuit filed by the publishers accuses Meta of knowingly extracting protected works from well-known piracy sites, such as LibGen, Anna’s Archive, Sci-Hub, and Sci-Mag. These contents were allegedly integrated into Meta's AI model. The plaintiffs also claim that the Llama model was trained on the Common Crawl dataset, which purportedly contains numerous unauthorized copies of protected works. This training method would have allowed Llama to produce texts that are very similar, if not identical, to those of the original works.

Concrete Examples of Textual Reproduction

The publishers provided specific examples to support their accusations. For instance, when Llama is prompted with a few sentences from the textbook "Calculus: Early Transcendentals," 9th edition, by James Stewart from Cengage, it reproduces the subsequent text word for word. This type of textual reproduction raises questions about the use of protected works in the development of AI models.

A Legal Precedent and Similar Lawsuits

This is not the first time Meta has faced accusations of copyright infringement. In a previous lawsuit, a federal judge ruled in favor of Meta, but emphasized that this decision did not mean that Meta's use of copyrighted materials to train its language models was legal. Additionally, a group of authors has also sued Anthropic for similar copyright violations. Although a judge deemed that training on legally acquired books could be considered fair use, Anthropic had to settle a class action by paying $1.5 billion to the affected authors.

The Plaintiffs' Demands

The publishers and Scott Turow are seeking damages from Meta and want the court to order the company to cease its alleged illegal activities. They also demand that Meta provide a complete list of the books, journal articles, and other protected works used to train its Llama AI models.

Meta's Position

In response to these accusations, Meta is vigorously defending itself. Dave Arnold, the company's spokesperson, stated that artificial intelligence drives significant innovations and that courts have recognized that training AI on protected material can be considered fair use. Meta plans to aggressively contest this lawsuit, asserting that its practices comply with existing laws. Furthermore, internal discussions at Meta have revealed concerns about how to manage media coverage suggesting the use of a pirated dataset.