Like many tech companies, Adobe has made significant strides in incorporating AI into its operations over the past few years. Since 2023, the company has rolled out various AI services, including Firefly, its AI-driven media generation suite. However, this enthusiastic adoption of AI may have backfired, as a new lawsuit alleges that Adobe used pirated books to train one of its AI models.
A proposed class-action lawsuit has been filed on behalf of Elizabeth Lyon, an author from Oregon, claiming that Adobe used pirated versions of numerous books, including her own, to develop the SlimLM program.
Adobe describes SlimLM as a series of small language models optimized for document assistance on mobile devices. The company states that SlimLM was pre-trained using SlimPajama-627B, an “open-source dataset” released by Cerebras in June 2023. Lyon, who has authored several guidebooks on non-fiction writing, claims her works were included in the dataset Adobe utilized.
Lyon’s lawsuit, initially reported by Reuters, states that her writing was part of a processed subset of a manipulated dataset that Adobe based its program on. “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3),” the lawsuit claims. “As a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, which includes the copyrighted works of Plaintiff and the Class members.”
“Books3” refers to a vast collection of 191,000 books used to train generative AI systems and has been a source of ongoing legal disputes in the tech world. RedPajama has also been involved in several lawsuits. In September, Apple faced allegations of using copyrighted materials to train its Apple Intelligence model, citing the dataset and claiming the company copied protected works “without consent and without credit or compensation.” In October, Salesforce was similarly sued for using RedPajama in its training processes.
Such lawsuits have become increasingly common in the tech industry. AI algorithms rely on extensive datasets, and there have been multiple allegations that these datasets contain pirated materials. Recently, Anthropic agreed to pay $1.5 billion to a group of authors who accused it of training its chatbot, Claude, using unauthorized versions of their work. This case is seen as a potential turning point in the ongoing legal struggles surrounding copyrighted materials in AI training data.
