Skip to main content

AI Security Overview

AI Security protects AI agents from malicious prompts, jailbreaks, and hidden commands. It monitors both user prompts and data retrieved by agents to detect and mitigate these threats.

Administrators can create and configure policies that determine which agents are monitored and what action to take when a rule is triggered.

Threat Types

  • Jailbreak / Prompt Injection: Detects attempts to override the AI agent's built-in restrictions through prompt injection or jailbreak attacks. This applies to both user input and data retrieved or used by the agent.
  • Malicious Code: Identifies harmful or unsafe code in user input and the AI-generated response that could lead to unintended execution or vulnerabilities.
  • Harmful Content: Detects hate speech, violent rhetoric, and harmful misinformation in both user input and the AI-generated response.

Policies and Actions

Every policy specifies its target agents and an enforcement action. A policy might apply to all agents and be set to:

  • “Block and fail” if a prompt-injection is detected
  • “Flag for review” to log the event without failing

Findings dashboard

When a policy triggers, an issue is created in the Findings dashboard.

  • For a block policy, the agent run is stopped and an issue is logged
  • For a flag policy, the run continues and an issue is logged

Admins can triage these issues in the UI dashboard with a complete workflow including status tracking (Open, In Progress, Rejected, Allowed) and assignment management. Issues can also be investigated via BigQuery for deeper analysis.