AI Security protects AI agents from malicious prompts, jailbreaks, and hidden commands. It monitors both user prompts and data retrieved by agents to detect and mitigate these threats. Administrators can create and configure policies that determine which agents are monitored and what action to take when a rule is triggered.

Threat Types

  • Jailbreak / Prompt Injection: Detects attempts to override the AI agent’s built-in restrictions through prompt injection or jailbreak attacks. This applies to both user input and data retrieved or used by the agent.
  • Malicious Code: Identifies harmful or unsafe code in user input and the AI-generated response that could lead to unintended execution or vulnerabilities.
  • Harmful Content: Detects hate speech, violent rhetoric, and harmful misinformation in both user input and the AI-generated response.

Policies and Actions

Every policy specifies its target agents and an enforcement action. A policy might apply to all agents and be set to:
  • “Block and fail” if a prompt-injection is detected
  • “Flag for review” to log the event without failing

Findings dashboard

When a policy triggers, a finding is created in the Governance dashboard.
  • For a block policy, the agent run is stopped and a finding is logged
  • For a flag policy, the run continues and a finding is logged
Admins can review these findings in the UI dashboard or via BigQuery.