Data Analysis: Technical Overview

This document provides an overview of the Data Analysis feature’s architecture and request flow. For additional context, please refer to our Data Analysis overview and Security Whitepaper.

Architecture

All data analysis requests are processed as standard /chat requests through the Query Endpoint (QE). The data analysis flow is triggered when a user submits an analytical question about an uploaded or tagged spreadsheet (.xlsx, .xls, or .csv files). Data analysis is implemented as a Glean Action that iteratively generates and executes Python code to determine the answer. This action leverages LLMs for code generation and utilizes a dedicated Python sandbox for code execution.

Components

Cloud SQL

Stores uploaded files. For detailed information about file storage, see the File Upload Feature in Assistant - Technical Details document.

Query Endpoint

A Glean Kubernetes service that handles /chat requests. Data analysis is triggered when:

The conversation contains uploaded or tagged spreadsheets
The user query is determined to be an analytical question requiring data analysis

Sandbox

A dedicated environment for executing Python code generated during data analysis. Key characteristics:

Each chat session uses a dedicated sandbox instance
Uploaded files are copied into the sandbox for code execution
Isolated execution environment prevents data leakage between sessions

Sandbox Orchestrator

Manages the provisioning and lifecycle of sandboxes.

Sandboxes for Data Analysis

The sandbox orchestrator and sandboxes are deployed as Kubernetes pods in the Glean cluster with a dedicated namespace. Currently, all pods operate within a single node.

Sandbox Implementation

The sandbox is implemented as a Flask server that provides APIs for:

File uploads
Python code execution
Local filesystem access for code operations

This allows us to execute code that can read and work with the files. Each data analysis session uses a dedicated sandbox so there is no data leakage between sessions.

Security Measures

The sandbox environment implements several security restrictions:

Resource limits:
- CPU: 500mCPU
- Memory: 500MiB
Network restrictions:
- No network egress (no internet access or access to other Glean services)
- Limited network ingress (only from QE pods)
Security controls:
- Non-root permissions
- GVisor implementation to prevent side-channel attacks (this prevents one sandbox being able to read data from another sandbox).
- Isolation between sandboxes

Sandbox Orchestrator Functionality

The orchestrator is a Flask server that manages the lifecycle of the sandboxes themselves. It exposes APIs to requests for sandboxes and handles the initialization of the pod pool to fit the node and the destruction of stale sandboxes. It is responsible for the following operations:

Initialization

Assigns unique sandbox instances per chat session
Enforces one sandbox per user limit
Resets and re-provisions sandbox on new session start

Scaling

Handles concurrent file analysis executions
Enforces usage limits:
- Per-user sandbox limits
- Total concurrent sandbox limits

Cleanup

Performs periodic cleanup of inactive sandboxes
Removes instances after specified inactivity period (e.g., 10 minutes)

Resource Management

Enforces fixed memory and CPU limits per sandbox pod
Manages network policies:
- Blocks all network egress
- Allows ingress only from QE pods

Data Analysis Flow

General

Identity

Search

Assistant

Embedded Integrations

Glean MCP Servers

Protect

Knowledge

Management

Insights

Glean Customer Event Logs

Developer

Managing Agents

Managing Actions

Data Analysis: Technical Overview

Architecture

Components

Cloud SQL

Query Endpoint

Sandbox

Sandbox Orchestrator

Sandboxes for Data Analysis

Sandbox Implementation

Security Measures

Sandbox Orchestrator Functionality

Initialization

Scaling

Cleanup

Resource Management

Data Analysis Flow

General

Identity

Search

Assistant

Embedded Integrations

Glean MCP Servers

Protect

Knowledge

Management

Insights

Glean Customer Event Logs

Developer

Managing Agents

Managing Actions

​Architecture

​Components

​Cloud SQL

​Query Endpoint

​Sandbox

​Sandbox Orchestrator

​Sandboxes for Data Analysis

​Sandbox Implementation

​Security Measures

​Sandbox Orchestrator Functionality

​Initialization

​Scaling

​Cleanup

​Resource Management

​Data Analysis Flow

Architecture

Components

Cloud SQL

Query Endpoint

Sandbox

Sandbox Orchestrator

Sandboxes for Data Analysis

Sandbox Implementation

Security Measures

Sandbox Orchestrator Functionality

Initialization

Scaling

Cleanup

Resource Management

Data Analysis Flow