The Glean S3 connector allows you to index text-based documents from your Amazon S3 buckets into your Glean instance. By integrating S3 with Glean, you enable users to search and discover content stored in S3, including PDFs, Office documents, spreadsheets, presentations, and other text files. The connector surfaces indexed S3 content in Glean Search as either searchable metadata or, optionally, via time-limited pre-signed URLs. The connector can be deployed on both GCP-based and AWS-based Glean instances.

Supported Features and Limitations

The S3 connector enables ingestion and indexing of a range of text-based content from S3 buckets into your Glean environment. Both GCP and AWS-hosted Glean deployments can utilize this connector, with slight differences in setup depending on the platform.

Supported Objects/Entities

You can index and search the following file types stored in S3:
  • PDF documents
  • Document files (e.g., Word)
  • Presentation files (e.g., PowerPoint)
  • Spreadsheet files (e.g., Excel)
  • Text-based files

Authentication Flow

For GCP Deployments:
  1. Glean creates a service account in the deployment which can be used to access the buckets.
  2. A dedicated IAM role should be created in the AWS account which should have access to the buckets which need to be indexed.
    1. The GCP service account should be added as a trusted entity to the IAM Role so that Glean can assume the IAM Role to get access to the buckets.
For AWS Deployments:
  1. Glean creates an IAM role in the deployment which can be used to access the buckets.
  2. A dedicated IAM role should be created in the AWS account which should have access to the buckets which need to be indexed.
    1. Glean’s IAM role should be added as a trusted entity to this IAM role so that Glean can assume it to get access to the buckets.

Limitations

  • Encrypted S3 objects (e.g., client-side encrypted, PII encryption) are not supported for search indexing; the connector cannot retrieve or index content from buckets containing encrypted data.
  • There is a file size limit of 64MB per document. Files larger than this will not be crawled or indexed.
  • S3 object permissions are NOT enforced in Glean. All indexed S3 content is accessible to all Glean users regardless of original S3 access controls.
  • Only specific file types (as listed above) are indexed; other file types are ignored.

Requirements

The following sections explain each prerequisite:

Technical Requirements

  • Glean instance deployed on either GCP or AWS.

Permissions & Security

Data and Metadata Ingested:
  • File content (for supported types)
  • File metadata (name, path, last modified, etc.)
Permission Propagation Logic:
  • S3 object permissions and access controls are NOT propagated. All content indexed by the connector is visible to all users in the Glean instance.
Security & Compliance Notes:
  • Authentication is via cloud service account (GCP) or cross-account IAM role (AWS), using AWS Web Identity Federation or AWS AssumeRole, respectively.
  • Required scopes and permissions are limited to read-only access (unless further restricted in your IAM policy).
Known Security Restrictions:
  • Multi-instance S3 connections may require separate IAM role and greenlist configurations.
  • If your S3 data contains sensitive or regulated content, be aware that any user in Glean can access all indexed documents.

Configuration and Setup Instructions

Prerequisites

  • Access to Glean admin console with appropriate admin permissions.
  • Access to the AWS account with permissions to create IAM roles.
For GCP Deployments:
  • A dedicated IAM role needs to be created in AWS accounts which has access to the buckets.
    • Navigate to the AWS home page and search for IAM.
    • From the sidebar menu, select “Roles”, followed by selecting the “Create Role” option
    • Choose “Web Identity” as the trusted entity type, designate Google as the identity provider, and input the audience ID (can be found on the datasource setup page)
    • Click Next. Search for AmazonS3ReadOnlyAccess policy and then click next.
    • Enter the role name and description, then click on Create Role.
For AWS Deployments:
  • A dedicated IAM role needs to be created in AWS accounts which has access to the buckets.
    • Navigate to the AWS home page and search for IAM.
    • From the sidebar menu, select “Roles”, followed by selecting the “Create Role” option
    • Choose “Custom Trust Policy” as the trusted entity type, replace the custom trust policy with the new policy (can be found on the datasource setup page)
    • Click on Next.
    • Search for AmazonS3ReadOnlyAccess policy and then click next.
    • Enter `GleanS3Crawler` in the role name and add description, then click on Create Role.

Step-by-Step Setup

  1. Create the IAM role by following the steps mentioned in the Prerequisites section.
  2. In the Glean admin console, navigate to Data sources > S3 and click “Add Data Source.”
  3. Enter the required fields:
    • Display Name
    • IAM Role ARN: provide the ARN of the IAM Role created before.
    • Bucket greenlist: comma separated list of bucket name e.g. bucket1,bucket2,bucket3
  4. Save and test the connection. Resolve any permission or credential issues if connection fails.
  5. Finish configuration. The initial full crawl will start according to the connector’s schedule.
Edge Case: If you need to re-index or remove certain buckets, adjust the bucket greenlist accordingly and schedule another crawl.

Crawl configuration options

  • Greenlist: Specify the S3 buckets to be indexed. Only these buckets will be crawled.