Data Flow

The Glean platform architecture consists of three primary components that work together to provide secure and effective enterprise data access:

Query Path

Handles user search requests and authentication

Data Ingestion Path

Manages data collection from enterprise sources

Data Processing Pipeline

Processes and indexes collected data

Query Path

Web Application Overview

Initial Access

Users access Glean through the web application at https://app.glean.com, hosted within Glean’s central cloud infrastructure. The application serves static assets including images, CSS, and JavaScript.

Session Check

The web client checks for an existing session state in the user’s local storage. If none exists, authentication is required as anonymous searching is not supported.

Authentication Process

Users begin by entering their email address (e.g., user@company.com).

Tenant Resolution

Each customer tenant requires a list of company domain names for authentication. These domains are mapped to a tenant-specific Query Endpoint (QE) of the form <tenant_id>-be.glean.com.

The authentication process follows these steps:

Domain Lookup

When a user submits their email, the web app performs a domain lookup to determine the appropriate QE domain.

QE Assignment

The QE domain resolves to a static IP uniquely assigned to your company’s Glean tenant, whether deployed in Glean SaaS or your own cloud environment.

SSO Integration

Unauthenticated users are redirected to your configured SSO provider for authentication.

Authentication Flow Diagram

The following diagram illustrates the complete process from initial access to query execution:

Query Endpoint Communication

When users perform searches, requests are sent to:

https://<tenant_id>-be.glean.com/api/v1/search

Example Request Header

{
    "cursor": "[...snip...]",
    "maxSnippetSize": 324,
    "pageSize": 10,
    "people": [],
    "query": "expense policy",
    "requestOptions": {
        "debugOptions": {},
        "disableQueryAutocorrect": false,
        "facetBucketSize": 0,
        "facetFilters": [],
        "timezoneOffset": -660
    },
    "sc": "",
    "sessionInfo": {
        "lastSeen": "2023-12-13T05:03:49.808Z",
        "sessionTrackingToken": "[...snip...]",
        "lastQuery": "expense policy"
    },
    "sourceInfo": {
        "clientVersion": "fe-release-2023-12-05-86ae10d",
        "initiator": "MORE",
        "modality": "FULLPAGE"
    },
    "timeoutMillis": 10000,
    "timestamp": "2023-12-13T05:04:14.093Z",
    "trackingToken": "[...snip...]"
}

Example Response

{
    "trackingToken": "[...snip...]",
    "sessionInfo": {
        "sessionTrackingToken": "[...snip...]",
        "lastSeen": "2023-12-13T05:04:14.385838873Z",
        "lastQuery": "expense policy"
    },
    "results": [
        {
            "trackingToken": "[...snip...]",
            "document": {
                "id": "GDRIVE_11[...snip...]Kp-P",
                "datasource": "gdrive",
                "docType": "pdf",
                "parentDocument": {
                    "id": "GDRIVE_1t[...snip...]qqsy",
                    "datasource": "gdrive",
                    "docType": "Folder",
                    "title": "Company Policies",
                    "url": "https://drive.google.com/drive/folders/1t[...snip...]qqsy"
                },
                "title": "CompanyExpensePolicy-sept2023.pdf",
                "url": "https://drive.google.com/file/d/11[...snip...]Kp-P",
                "metadata": {
                    "datasource": "gdrive",
                    "datasourceInstance": "gdrive",
                    "objectType": "pdf",
                    "container": "Insurance Policies",
                    "containerId": "GDRIVE_1t[...snip...]qqsy",
                    "mimeType": "application/pdf",
                    "documentId": "GDRIVE_11f...snip...]Kp-P",
                    "createTime": "2023-06-05T20:00:25Z",
                    "updateTime": "2023-06-16T11:59:42Z",
                    "author": {
                        "name": "Sam Sample",
                        "obfuscatedId": "B79[...snip...]3D8"
                    },
                    "owner": {
                        "name": "Sam Sample",
                        "obfuscatedId": "B79[...snip...]3D8"
                    },
                    "visibility": "SPECIFIC_PEOPLE_AND_GROUPS",
                    "assignedTo": {
                        "name": "Sam Sample",
                        "obfuscatedId": "B79[...snip...]3D8"
                    },
                    "updatedBy": {
                        "name": "Sam Sample",
                        "obfuscatedId": "B79[...snip...]3D8"
                    },
                    "datasourceId": "11[...snip...]Kp-P",
                    "interactions": {},
                    "documentCategory": "COLLABORATIVE_CONTENT"
                }
            },
            "snippets": [
                {
                    "snippet": "",
                    "mimeType": "text/plain",
                    "text": "You can submit them to your manager using the current expense reporting method (current method here) within three months after the date of each expense. If your manager approves your expenses, you will receive your reimbursement within two pay periods on your regular paycheck."
                }
            ]
        }
    ]
}

API Documentation

Find detailed field descriptions in our Developer Documentation

Data Ingestion Flow

Glean’s data ingestion process is built around specialized connectors deployed within your tenant’s dedicated cloud project. These connectors serve multiple purposes:

Content Retrieval

Fetches content from connected enterprise sources

Activity Tracking

Monitors user interaction data

Permission Mapping

Maps and maintains access controls

Connection Methods

Data retrieval occurs via HTTPS, with two primary connection patterns depending on the data source location.

SaaS Applications

For services like Google Drive, connections occur over the public internet using HTTPS

On-Premises Systems

For internal systems like on-prem Jira, secure private connections are established via VPN or Shared VPC

Data Processing Pipelines

All data processing occurs within your tenant’s project using Google Dataflow pipelines. Your data never leaves your tenant’s environment.

The processing pipeline combines:

Content from connected sources
Permission mappings
User data
Activity metrics (creation, edits, views)

This combined data is then indexed to create a secure, searchable knowledge base within your tenant.

Security & Architecture

Architecture

Networking

Cloud-Prem

Sensitive findings

Query Path

Data Ingestion Path

Data Processing Pipeline

Query Path

Web Application Overview

Tenant Resolution

Authentication Flow Diagram

Query Endpoint Communication

API Documentation

Data Ingestion Flow

Content Retrieval

Activity Tracking

Permission Mapping

Connection Methods

SaaS Applications

On-Premises Systems

Data Processing Pipelines

Security & Architecture

Architecture

Networking

Cloud-Prem

Sensitive findings

Query Path

Data Ingestion Path

Data Processing Pipeline

​Query Path

​Web Application Overview

​Tenant Resolution

​Authentication Flow Diagram

​Query Endpoint Communication

API Documentation

​Data Ingestion Flow

Content Retrieval

Activity Tracking

Permission Mapping

​Connection Methods

SaaS Applications

On-Premises Systems

​Data Processing Pipelines

Query Path

Web Application Overview

Tenant Resolution

Authentication Flow Diagram

Query Endpoint Communication

Data Ingestion Flow

Connection Methods

Data Processing Pipelines