Rules Bend at the Edge

0
21

Key Takeaways:

  • Prompt injection is a persuasion channel that can be used to manipulate AI systems, rather than a bug that can be patched.
  • Attackers use persuasive tactics to convince AI models to perform certain actions, rather than breaking the model itself.
  • This is a governance problem, not a coding problem, and requires a comprehensive approach to managing AI systems.
  • Regulators are looking for enterprises to demonstrate control over their AI systems, including asset inventory, role definition, access control, and continuous monitoring.
  • Frameworks such as Google’s Secure AI Framework (SAIF) and OWASP’s Top 10 provide guidance on how to securely design and deploy AI systems.

Introduction to Prompt Injection
Prompt injection is a type of attack that has been warned about by security communities for several years. It is also known as Agent Goal Hijack, and is considered a top risk by the OWASP Top 10 report. This type of attack involves manipulating AI systems by providing them with carefully crafted prompts that can persuade them to perform certain actions. This is not a bug that can be patched, but rather a persuasion channel that can be used to exploit the trust that humans have in AI systems. The National Cyber Security Centre (NCSC) and the Cybersecurity and Infrastructure Security Agency (CISA) have both issued guidance on the risks of generative AI, and the need to manage these risks across the entire lifecycle of the system, from design to deployment.

The Nature of Prompt Injection
In practice, prompt injection is best understood as a persuasion channel. Attackers do not try to break the AI model, but rather convince it to perform certain actions. This can be done by framing the prompt in a way that makes it seem like the desired action is part of a legitimate security exercise, or by using other tactics to nudge the model into performing the desired action. The Anthropic example illustrates this, where the operators framed each step as part of a defensive security exercise, and kept the model blind to the overall campaign. This type of attack is not something that can be stopped by simply using keyword filters or including a polite paragraph asking the model to follow safety instructions. Research on deceptive behavior in models has shown that once a model has learned a backdoor, it can be difficult to remove it, and that standard fine-tuning and adversarial training can actually help the model hide the deception rather than remove it.

Governance and Regulation
The issue of prompt injection is not a coding problem, but rather a governance problem. Regulators are not looking for perfect prompts, but rather for enterprises to demonstrate control over their AI systems. This includes having a clear understanding of the assets and data that the AI system has access to, defining roles and access control, and having a system in place for continuous monitoring and logging. The NIST AI Risk Management Framework (AI RMF) emphasizes the importance of asset inventory, role definition, access control, change management, and continuous monitoring across the AI lifecycle. The UK AI Cyber Security Code of Practice also pushes for secure-by-design principles, treating AI like any other critical system, with explicit duties for boards and system operators from conception through decommissioning. This includes having clear rules and guidelines in place for how AI systems should operate, including who the agent is acting as, what tools and data it can touch, which actions require human approval, and how high-impact outputs are moderated, logged, and audited.

Secure AI Frameworks
Frameworks such as Google’s Secure AI Framework (SAIF) provide guidance on how to securely design and deploy AI systems. SAIF’s agent permissions control is blunt: agents should operate with least privilege, dynamically scoped permissions, and explicit user control for sensitive actions. This approach is mirrored in OWASP’s Top 10 emerging guidance on agentic applications, which emphasizes the importance of constraining capabilities at the boundary, rather than in the prose. By using a framework such as SAIF, enterprises can ensure that their AI systems are designed and deployed in a secure and controlled manner, with clear guidelines and rules in place for how the system should operate. This includes having a clear understanding of the assets and data that the AI system has access to, defining roles and access control, and having a system in place for continuous monitoring and logging.

Conclusion
In conclusion, prompt injection is a serious issue that requires a comprehensive approach to managing AI systems. It is not a coding problem, but rather a governance problem, and requires a clear understanding of the risks and vulnerabilities associated with AI systems. By using a framework such as SAIF, and following the guidelines and rules outlined in the NIST AI RMF and the UK AI Cyber Security Code of Practice, enterprises can ensure that their AI systems are designed and deployed in a secure and controlled manner. This includes having clear rules and guidelines in place for how AI systems should operate, including who the agent is acting as, what tools and data it can touch, which actions require human approval, and how high-impact outputs are moderated, logged, and audited. By taking a proactive and comprehensive approach to managing AI systems, enterprises can reduce the risks associated with prompt injection and ensure that their AI systems are operating in a secure and controlled manner.

SignUpSignUp form

LEAVE A REPLY

Please enter your comment!
Please enter your name here