OpenAI and Anthropic have announced their collaboration with the governments of the United States and the United Kingdom with the aim of strengthening the security of their language models. Through a series of initiatives, the two companies are allowing government researchers to assess the vulnerability of their systems to potential cyber attacks.
A noble end that has more behind it than it seems
In recent posts on their blogs, OpenAI and Anthropic revealed that they have been working with the National Institute of Standards and Technology (NIST) and the UK AI Safety Institute. This cooperation includes access to models, classifiers, and training data, allowing independent experts to examine the resilience of these models against external attacks and their effectiveness in preventing ethically questionable uses.
OpenAI identified critical vulnerabilities that could allow sophisticated attackers to take control of computer systems and impersonate users, with a success rate of 50% in an AI hijacking method. Although engineers initially believed these vulnerabilities were irrelevant, the research showed that their combination with hijacking techniques could be effective.
Both OpenAI and Anthropic are implementing “red-teaming” processes to quickly identify and fix these vulnerabilities, aiming to prevent the misuse of their technology. However, some security experts have expressed concern that this collaboration could lead to a decrease in attention to technical security, due to increased competitiveness in the global market.
However, researchers like Md Raz, a PhD student at New York University, argue that the models are becoming more resilient and harder to breach with each new version, suggesting a more rigorous approach to security in the latest developments like GPT-5.