Britannica and Merriam-Webster sue OpenAI for copyright infringement

    Encyclopedia Britannica and Merriam-Webster have filed a copyright lawsuit against OpenAI, alleging the company used nearly 100,000 of their copyrighted online articles to train its large language models without permission. The suit is one of the more precisely scoped complaints filed against OpenAI to date, targeting not just the training data question that has dominated earlier lawsuits, but also OpenAI's retrieval-augmented generation workflow and a separate claim under the Lanham Act tied to AI-generated hallucinations.

    Britannica has been publishing reference content since 1768. Merriam-Webster's dictionary dates to 1828. Both companies have spent years building subscription-based digital businesses on the back of that content. The lawsuit is a direct statement that they consider OpenAI's use of their material to be unauthorized commercial exploitation, not a form of research or fair use.

    Britannica and Merriam-Webster take legal action against OpenAI over training data
    Britannica and Merriam-Webster take legal action against OpenAI over training data

    The three legal claims explained

    The lawsuit brings three distinct claims. The first is straightforward copyright infringement: the publishers allege OpenAI scraped approximately 100,000 articles from their websites and used that text to train its models without a license. This mirrors the structure of earlier suits filed by The New York Times and other media organizations, which also center on training data scraped without authorization.

    The second claim targets retrieval-augmented generation, known as RAG. RAG is a technique OpenAI uses where the model pulls in content from external sources at query time to generate more accurate or current answers. The publishers allege that this workflow reproduces Britannica content in OpenAI's outputs without permission. That claim goes beyond what most training data lawsuits cover, because RAG involves active retrieval and reproduction rather than passive influence from pre-trained weights.

    The third claim is under the Lanham Act, which governs false advertising and brand misrepresentation. The publishers allege that when ChatGPT produces hallucinated information and attributes it, even implicitly, to authoritative reference sources like Britannica, it creates a false impression about the origin and accuracy of that content. This is an unusual application of the Lanham Act in an AI context, and if it gains traction in court it could open a new liability category for AI companies whose models generate false information under the perceived authority of established publishers.

    Why the RAG claim is legally significant

    Most copyright suits against AI companies have focused on the training phase, arguing that ingesting copyrighted text to build a model constitutes infringement. OpenAI has consistently argued that training on publicly available text is transformative and qualifies as fair use under US copyright law. Courts have not yet settled that question definitively, but the argument has held up well enough in early procedural rulings to keep the fair use defense viable.

    The RAG claim is harder to deflect with a fair use argument. When a model retrieves a chunk of text from a Britannica article and incorporates it into a response, that looks more like direct reproduction than transformation. The publishers will likely argue that the commercial context, OpenAI charging subscribers for access to ChatGPT while that service pulls and displays their content, makes fair use particularly difficult to establish. Courts have historically given less weight to fair use claims when the defendant profits commercially from the copied material.

    The Lanham Act hallucination angle

    The Lanham Act claim deserves attention on its own. AI hallucinations, where a model generates confident but factually incorrect information, have been documented across all major language models. When those hallucinations involve topics covered by authoritative reference publishers, there is a question of whether the false output damages the publisher's reputation or misleads users about the source's reliability.

    Britannica's brand is built on accuracy. A user who asks ChatGPT about a historical event and receives a hallucinated answer that draws on or resembles Britannica-style encyclopedic content may wrongly attribute the error to Britannica, particularly if the model cites or gestures toward authoritative reference sources. The lawsuit argues that this constitutes a form of false designation of origin under the Lanham Act. Whether courts accept that framing will depend on how broadly they interpret the statute's language around commercial misrepresentation.

    Where this lawsuit fits in the broader legal picture

    OpenAI is currently defending copyright claims from The New York Times, which filed in December 2023, as well as suits from a group of authors including John Grisham and George R.R. Martin, and a separate action from Getty Images over image training data. The Britannica and Merriam-Webster suit adds two publishers with strong brand recognition and well-documented digital archives, which makes it easier to establish what was scraped and what the commercial value of that content is.

    OpenAI has reached licensing agreements with some publishers, including the Associated Press and Axel Springer, and was reported to be in talks with others. Britannica and Merriam-Webster apparently chose litigation rather than negotiation, at least at this stage. That decision may reflect dissatisfaction with the licensing terms OpenAI has offered other publishers, or a strategic judgment that the RAG and Lanham Act claims give them stronger legal leverage than a pure training data suit would provide.

    What OpenAI is likely to argue

    OpenAI will almost certainly invoke fair use as its primary defense on the training data claim. On the RAG claim, the company may argue that retrieving and summarizing content for a user query is analogous to how a search engine indexes and displays snippets, a practice that has survived copyright challenges in US courts, most notably in the long-running litigation over Google Books, which the Second Circuit ruled in Google's favor in 2015.

    The Lanham Act claim will be harder to dismiss at the pleading stage because the publishers can point to specific instances where ChatGPT produced inaccurate information on topics covered by their publications. OpenAI will likely argue that users understand AI outputs are not sourced from specific publishers and that no actual false designation of origin occurred. That argument may or may not hold once the publishers produce examples of hallucinated content that closely mirrors Britannica article topics and structure.

    The case was filed in a US federal court. No trial date has been set, and given the complexity of the claims and OpenAI's history of seeking early dismissals in other copyright suits, a ruling on the merits is likely at least two years away.

    Love this story? Explore more trending news on openai

    Share this story

    Frequently Asked Questions

    Q: What is retrieval-augmented generation and why is it part of this lawsuit?

    Retrieval-augmented generation is a technique where an AI model pulls content from external sources at the moment a user submits a query, rather than relying solely on its pre-trained knowledge. The publishers allege that OpenAI's use of this technique to retrieve and reproduce Britannica content in ChatGPT responses constitutes active copyright infringement separate from the training data issue.

    Q: What does the Lanham Act have to do with AI hallucinations?

    The Lanham Act covers false advertising and commercial misrepresentation. Britannica argues that when ChatGPT produces hallucinated content on topics covered by their publications, it can mislead users about the accuracy or origin of the information, effectively damaging Britannica's reputation as a reliable source. This is a novel application of the statute that courts have not yet evaluated in an AI context.

    Q: Has OpenAI settled similar copyright claims with other publishers?

    OpenAI has reached licensing deals with the Associated Press and Axel Springer, among others. However, The New York Times, a group of authors including John Grisham, and now Britannica and Merriam-Webster have chosen to litigate rather than license, suggesting not all publishers find OpenAI's offered terms acceptable.

    Q: How many articles are at the center of the Britannica and Merriam-Webster lawsuit?

    The lawsuit alleges OpenAI scraped approximately 100,000 copyrighted online articles from the two publishers without authorization to use as training data for its large language models.

    Q: How long could this lawsuit take to reach a verdict?

    Given the legal complexity involved and OpenAI's pattern of seeking early dismissals in other copyright cases, a ruling on the merits is realistically at least two years away. No trial date has been set as of the filing.

    Read More