The group responsible for red teaming of over 100 generative AI products at Microsoft has concluded that the work of building safe and secure AI systems will never be complete.
In a paper published this week, the authors, including Microsoft Azure CTO Mark Russinovich, described some of the team’s work and provided eight recommendations designed to “align red teaming efforts with real world risks.”
Lead author Blake Bullwinkel, a researcher on the AI Red Team at Microsoft, and his 25 co-authors wrote in the paper, “as generative AI (genAI) systems are adopted across an increasing number of domains, AI red teaming has emerged as a central practice for assessing the safety and security of these technologies.”
At its core, they said, “AI red teaming strives to push beyond model-level safety benchmarks by emulating real-world attacks against end-to-end systems. However, there are many open questions about how red teaming operations should be conducted and a healthy dose of skepticism about the efficacy of current AI red teaming efforts.”
The paper noted that, when it was formed in 2018, the Microsoft AI Red Team (AIRT) focused primarily on identifying traditional security vulnerabilities and evasion attacks against classical ML models. “Since then,” it said, “both the scope and scale of AI red teaming at Microsoft have expanded significantly in response to two major trends.”
The first, it said, is that AI has become more sophisticated, and the second is that Microsoft’s recent investments in AI have resulted in the development of many more products that require red teaming. “This increase in volume and the expanded scope of AI red teaming have rendered fully manual testing impractical, forcing us to scale up our operations with the help of automation,” the authors wrote.
“[To achieve] this goal, we developed PyRIT, an open-source Python framework that our operators utilize heavily in red teaming operations. By augmenting human judgement and creativity, PyRIT has enabled AIRT to identify impactful vulnerabilities more quickly and cover more of the risk landscape.”
Based on their experiences, Bullwinkel and the team of authors shared eight lessons they have learned, and elaborated on them in the paper through detailed explanations and case studies. They included:
Understand what the system can do and where it is applied: The first step in an AI red teaming operation is to determine which vulnerabilities to target, they said. They suggest: “starting from potential downstream impacts, rather than attack strategies, makes it more likely that an operation will produce useful findings tied to real world risks. After these impacts have been identified, red teams can work backwards and outline the various paths that an adversary could take to achieve them.”
You don’t have to compute gradients to break an AI system: To prove the point, the paper points to a study on the gap between adversarial ML research and practice. The study found “that although most adversarial ML research is focused on developing and defending against sophisticated attacks, real-world attackers tend to use much simpler techniques to achieve their objectives.” Gradient-based attacks are powerful, the authors said, “but they are often impractical or unnecessary. We recommend prioritizing simple techniques and orchestrating system-level attacks because these are more likely to be attempted by real adversaries.”
AI red teaming is not safety benchmarking: The two, authors said, are distinct yet “both useful and can even be complimentary. In particular, benchmarks make it easy to compare the performance of multiple models on a common dataset. AI red teaming requires much more human effort but can discover novel categories of harm and probe for contextualized risks.” Novel harms resulting from new capabilities in AI systems may not be fully understood, so the team must define them and build tools to measure them.
Automation can help cover more of the risk landscape: According to the authors, the “complexity of the AI risk landscape has led to the development of a variety of tools that can identify vulnerabilities more rapidly, run sophisticated attacks automatically, and perform testing on a much larger scale.” Automation in AI red teaming plays a critical role, which led to the development of an open source framework, PyRIT.
The human element of AI red teaming is crucial: Automation may be important, but the authors emphasized that, while “automation like PyRIT can support red teaming operations by generating prompts, orchestrating attacks, and scoring responses,” humans are needed for their cultural and subject matter knowledge, and for their emotional intelligence. They noted, “these tools are useful but should not be used with the intention of taking the human out of the loop.”
Responsible AI (RAI) harms are pervasive but difficult to measure: The bottom line here: RAI harms are more ambiguous than security vulnerabilities and it all has to do with “fundamental differences between AI systems and traditional software.” Most AI safety research, the authors noted, focus on adversarial users who deliberately break guardrails, when in truth, they maintained, benign users who accidentally generate harmful content are as or more important.
LLMs amplify existing security risks and introduce new ones: The advice here? The integration of generative AI models into a variety of applications has introduced novel attack vectors and shifted the security risk landscape. The authors wrote that “we therefore encourage AI red teams to consider both existing (typically system-level) and novel (typically model-level) risks.”
The work of securing AI systems will never be complete: The idea that it is possible to guarantee or ‘solve’ AI safety through technical advances alone is unrealistic and overlooks the roles that can be played by economics, break-fix cycles, and regulation, they stated. With that in mind, the paper pointed out that “in the absence of safety and security guarantees, we need methods to develop AI systems that are as difficult to break as possible. One way to do this is using break-fix cycles, which perform multiple rounds of red teaming and mitigation until the system is robust to a wide-range of attacks.”
Authors of the report concluded that AI red teaming is a nascent and rapidly evolving practice for identifying safety and security risks posed by AI systems. But they also raised a number of questions.
“How should we probe for dangerous capabilities in LLMs such as persuasion, deception, and replication?” they asked. “Further, what novel risks should we probe for in video generation models and what capabilities may emerge in models more advanced than the current state-of-the-art?”
Secondly, they asked how red teams can adjust their practices to accommodate different linguistic and cultural contexts. And thirdly, they wonder in what ways red teaming practices should be standardized to make it easier for teams to communicate their findings.
They also stated, “as companies, research institutions, and governments around the world grapple with the question of how to conduct AI risk assessments, we provide practical recommendations based on our experience red teaming over 100 genAI products at Microsoft. … We encourage others to build upon these lessons and to address the open questions we have highlighted.”