OpenAI just rolled out something called Deployment Simulation ahead of the GPT-5 family launch. It’s a pre-release testing method meant to answer a pretty practical question: once a model is actually out in the world, how often is it likely to mess up?
The setup is different from the usual synthetic-prompt testing. OpenAI takes real, anonymized chats from an older model, replays them, and has the unreleased model write the next reply. According to the company, that approach got the direction of error trends right 92% of the time.
The idea is also supposed to cut down on the distortion you get when a model can, in effect, tell it’s being evaluated. Using de-identified conversation logs means the test is built around the kind of messy, unpredictable prompts people really type. OpenAI says that helped surface issues it hadn’t seen before, including something it calls “calculator hacking,” and shifted the emphasis toward how often failures are likely to happen in real use, not just whether a failure can happen in theory.
If you pay attention to AI safety, this one’s worth watching.
OpenAI is clear about the limits, too. Deployment Simulation isn’t meant to replace red-teaming or targeted evaluations, and the company says it should sit alongside both. Rare failures can still slip through. So can new attack techniques and odd behavior that doesn’t show up often enough to get caught this way. OpenAI researchers also ran the method on the public WildChat dataset so outside auditors would have a version they could use without private logs, though OpenAI says it’s probably less accurate than the internal-data version, especially as pressure keeps building from the US National Institute of Standards and Technology (NIST), new European Union rules, and safety institutes in other countries.
The full research paper is up on OpenAI’s website.