Threat Types
- Jailbreak / Prompt Injection: Detects attempts to override the AI agent’s built-in restrictions through prompt injection or jailbreak attacks. This applies to both user input and data retrieved or used by the agent.
- Malicious Code: Identifies harmful or unsafe code in user input and the AI-generated response that could lead to unintended execution or vulnerabilities.
- Harmful Content: Detects hate speech, violent rhetoric, and harmful misinformation in both user input and the AI-generated response.
Policies and Actions
Every policy specifies its target agents and an enforcement action. A policy might apply to all agents and be set to:- “Block and fail” if a prompt-injection is detected
- “Flag for review” to log the event without failing
Findings dashboard
When a policy triggers, a finding is created in the Governance dashboard.- For a block policy, the agent run is stopped and a finding is logged
- For a flag policy, the run continues and a finding is logged