We Need to Integrate and Unify for AI Security

Posted by Sven Cattell 2025-02-12

This is the first part of a series. Later installments will include more details and references, especially with findings.

Introduction

An LLM like DeepSeek is a good demonstration of technical talent, but it’s unusable for most commercial applications. Model reliability is needed for LLMs to become commercially viable. If we want agents to help manage our calendars or write code we need them to be secure and reliable. If we want customer service chatbots we need to know they’re not going to expose deployers to liability by insulting their customer or offering to sell them a truck for $1. Managing this risk is different to traditional security as the attack surface is nearly infinite. Preventing a black box no one really understands from misbehaving when adversaries are controlling the inputs is impossible. However, the ML security community has over 20 years of experience with AI risk management and has a track record of securing AI models against persistent adversaries. Mature teams focus on discovering and minimizing the impact of attacks once they’ve reached a suitable level of robustness. And it works. Established AI models are far more reliable, just look at the hallucination rate of the latest model release from Google. However, there’s public mistrust in AI, and as we deploy these systems we will find more flaws that need to be addressed. The challenge we face with LLMs is proving to customers and the public that these models are ready to use in their applications.

The solution people turned to was AI Red teaming. This basically meant that the risk assessment of the model systems would be done by a third party. After running the first two Generative Red Teams at DEF CON 31 and 32, I believe that the focus on AI red teaming is missing the forest for the trees. A company’s traditional software reliability is proven by the Coordinated Vulnerability Enumeration (CVE) and other Vulnerability Disclosure Programs (VDP). A penetration test report from a consulting firm that red teamed some software is an indicator they’ve done the work to make their software secure, but it’s all for naught if a major vulnerability is discovered after release. Requiring submission to a single gate keeper that blocks the release until they’ve done an assessment is just red tape if their report isn’t comprehensive. While there will always be an edge case that the assessor missed, the effective way to prove security is to do your best before release and then effectively respond to your mistakes. The ecosystem and practices of disclosure is how companies prove that they build secure systems.

This is effective because it doesn’t stop innovation. The prospect of public documentation incentivizes an investment in security where appropriate. Folks who want to move fast and break things can. Institutions with a reputation to uphold will invest in security. No one tells you what to do, they just record when you screw up. This notification system, and the culture that surrounds it, is extremely efficient at securing our systems over the long term. It helps downstream developers and consumers who need to know about vulnerabilities to mitigate their effect. It lowers the costs for vendors because best efforts are all that’s needed. And best of all, it doesn’t impact innovation.

We need to bring this to AI.

We tested AI vulnerability reporting at DEF CON 32’s second GRT. Before the event we identified two major problems which we addressed. First, identifying reportable AI system issues is difficult, which we addressed by defining model intent in the model card. Second, these are statistical beasts which makes proving and documenting AI errors difficult. The solution involved single-topic reports using the UK AISI Inspect framework. We paid bounties for good reports of violations of model card statements that were well supported by an Inspect dataset. There was a language barrier between the data scientists reviewing submissions and DEF CON attendees, but once we overcame this, the feedback we received was overwhelmingly positive. Hackers enjoy discovering idiosyncrasies in the model’s behavior and building arguments as much as they enjoy other puzzles.

This approach revealed its own problems. Creating the model card was challenging due to a non-existent standard, and the evaluations that were used to support the intent statements didn’t always align well with researchers’ goals. Even when the evaluations did align they were incomplete. In particular, Harmbench had gaps in areas like malicious code tasks and copyright violations. Going forward we need more test coverage with smaller focused evaluations that can be combined.

However the most important finding is: the idea of taking reports of flaws against a model using a documented “contract” supported with evaluations doesn’t scale. Reports can impact several different aspects of this proposed process and individual companies handling them manually would be onerous. A report could indicate that a broad category of evaluations missed a vital subcategory. This would impact the model card and evaluation system, but not the model. For example, a new class of vulnerabilities is discovered and the old ‘malicious code tasks’ evaluation category does not include them. Another report could indicate that an evaluation made a mistake and needs some additional samples to appropriately test the models. This is most likely discovered by a flaw in a model and would impact the evaluation and model but not the model card. Appropriately directing the creation of new evaluations and updating the model card standard needs to be handled at a higher level.

Fortunately, security has already solved some of these problems. The scope of “vulnerability” is very broad and we deal with this through the Common Weakness Enumeration (CWE). It is a taxonomy of all known weaknesses in software, and is essential to handling CVEs. It is updated through a transparent process managed by the CWE committee regularly. For AI models we don’t need to document weaknesses, but uses and restrictions. We already create evaluations with uses and restrictions in mind, so a taxonomy of use is a natural place to start. These need to be tied to evaluations that are in standard, yet flexible, formats. Those evaluations need to be red teamed, and bounties need to be awarded against them. Model vendors then can choose what uses and restrictions they want their model to support, and ignore the rest. This could be made as simple as a menu of checkboxes that automatically creates a model card that the public can use. This is boiling the ocean, but there are great potential benefits for having a unified reporting ecosystem. Having this ecosystem be robust and coordinated means releases like DeepSeek would be immediately evaluated for trust and security and found to be lacking.

We need to iterate on the GRT at a small scale one more time. A collaborative live bug bash at DEF CON or NeuRIPS that tests the next version of these ideas is the best crucible to refine these processes to the point where we can set up an AI disclosure ecosystem.