A new study has shaken one of the tech industry’s core defenses: that large language models don’t memorize copyrighted content. Researchers have found that Meta’s LLaMa 3.1 model can reproduce up to 42% of Harry Potter and the Philosopher’s Stone, word for word, raising fresh legal and ethical concerns.
LLaMa 3.1 shows an unprecedented level of memorization
The researchers evaluated LLaMa 3.1 by feeding it 100-token sequences and checking if it could predict the next 50 tokens with over 50% certainty. When successful, this indicates not just pattern recognition, but almost exact memory recall of the original text. On average, LLaMa 3.1 assigned a 98.5% probability to each correct continuation, suggesting it had internalized a large portion of the book.
Popular books are remembered far more than obscure ones
This behavior isn’t uniform across all texts. LLaMa 3.1 tends to memorize very popular titles like The Hobbit or 1984, but performs poorly with lesser-known books. For example, it retained just 0.13% of Sandman Slim, a 2009 novel by Richard Kadrey, who ironically is suing Meta over training practices.
Legal risks are mounting for AI training practices
The findings could support the argument that AI models may contain infringing material in their internal weights. The U.S. Copyright Office recently stated that if models reproduce “relevant portions” of protected works, those internal weights might constitute illegal copies. This undermines tech companies’ claims that memorization is marginal.
Meta faces fallout beyond legal pressure
Internally, Meta is dealing with major setbacks: the loss of key engineers, delayed model launches, and a 14 billion USD investment in data sourcing. These revelations add pressure as the company prepares to defend its methods in court.