The Glean S3 connector enables you to ingest and index text-based documents from your Amazon S3 buckets into your enterprise search environment. By integrating Amazon S3 with Glean, you make content such as PDFs, Office documents, spreadsheets, presentations, and other text files searchable and discoverable within your organization. The connector supports deployment on both GCP-based and AWS-based Glean instances and surfaces indexed S3 content in Glean Search either as searchable metadata or, optionally, through time-limited pre-signed URLs.

Supported Features and Limitations

The S3 Connector indexes a range of text-based documents from specified S3 buckets, providing broad search capabilities within Glean. While the connector works across both GCP and AWS platforms, the technical setup varies slightly for each environment.

Supported Objects/Entities

You can index and search the following file types from S3:
  • PDF documents
  • Office documents (Word)
  • Presentation files (PowerPoint)
  • Spreadsheet files (Excel)
  • Text-based files (TXT and similar)

Supported API Endpoints/Features

  • Full corpus indexing for document types.
  • Search access to indexed S3 documents using Glean’s platform.
  • Optional surfacing of documents as time-limited pre-signed URLs.

Limitations

  • Encrypted objects (client-side encryption, PII encryption) are not supported for search indexing.
  • Files above 64MB in size are omitted from indexing. Files larger than this limit are not crawled or indexed.
  • Permissions native to S3 are not enforced within Glean. All indexed documents are available to all Glean users irrespective of their original S3 restrictions.
  • Only listed file types are indexed; unsupported types are ignored.

Crawling Strategy

Crawl TypeFull CrawlIncremental CrawlPeople DataActivityUpdate RateWebhookNotes
S3 ConnectorScheduled, entire bucket(s) scanModified/new documents since last crawlN/ATracks additions/updates/deletes via full crawlsDefault scheduling can be tunedN/ADeletion reflected at next full crawl; no webhook support.

Requirements

You must meet these prerequisites to successfully deploy and utilize the S3 Connector.

Technical Requirements

  • Active Glean instance deployed on either GCP or AWS.
  • Ability to access your organization’s S3 buckets on AWS.
  • For GCP: Creation of a service account that can be federated as a trusted entity in AWS IAM roles.
  • For AWS: Usage of a cross-account trust relationship between IAM roles.

Credential Requirements

  • AWS IAM setup:
    • For GCP deployments: Service account credentials from Glean, trusted in AWS.
    • For AWS deployments: IAM role in your AWS account, with Glean’s role as a trusted entity.
    • Access keys, tokens, or service account credentials must remain securely managed and only provide read-only access unless otherwise restricted by your policy.

Permission Requirements

  • The IAM role used for crawling must have AmazonS3ReadOnlyAccess (or equivalent) permissions.
  • Role trust policy must allow Glean’s role or service account to assume it.
  • Multi-instance S3 configurations may require separate IAM roles per instance and greenlist configuration.

Preliminary Source/System Setup

  • Access to your AWS account with permissions to create IAM roles and policies.
  • For GCP:
    • Create an AWS IAM role with access to desired buckets.
    • Designate the Glean GCP service account as a trusted entity through web identity federation.
    • Input the correct audience ID as per the Glean data source setup page.
  • For AWS:
    • Create a dedicated IAM role for S3 bucket access.
    • Replace trust policy with the template provided on the Glean data source setup page.
    • Assign AmazonS3ReadOnlyAccess permissions.
  • Record all necessary role ARNs and bucket lists for later configuration steps.

Configuration and Setup Instructions

Follow these steps to configure and deploy the S3 connector for either GCP or AWS platforms, primarily through the Glean admin console.

Prerequisites

  • Glean admin console access (admin privileges)
  • AWS account access with permission to create IAM roles

Authentication and Credentials

  • In the Glean admin console, locate the S3 data source setup.
  • Enter:
    • Display Name for the data source.
    • IAM Role ARN
    • Comma-separated list of bucket names to index (bucket greenlist).

Step-by-Step Setup

Creating IAM role for GCP deployment

  1. Navigate to the AWS home page and search for IAM.
  2. From the sidebar menu, select Roles -> Create Role.
  3. Choose Web Identity as the trusted entity type,
  4. Designate Google as the identity provider, and input the audience ID. The audience ID is available on the datasource setup page.
  5. Click Next.
  6. Search for AmazonS3ReadOnlyAccess policy and click Next.
  7. Enter the role name and description.
  8. Click Create Role.

Creating IAM role for AES deployment

Follow these steps to create the AWS IAM Role and obtain the Role ARN:
  1. Navigate to the AWS home page and search for IAM.
  2. From the sidebar menu, select Roles -> Create Role.
  3. Choose Web Identity as the trusted entity type.
  4. Designate Google as the identity provider, and input the audience ID.
  5. Click Next.
  6. Search for AmazonS3ReadOnlyAccess policy and click Next.
  7. Enter the role name and description.
  8. Click Create Role.
  9. Navigate to the IAM Roles page and search for the role you created.
  10. Click on the role and copy the Role ARN.

Connecting Glean to S3

After you have the Role ARN, follow these steps to connect Glean to S3:
  1. In Glean admin console, navigate to Data Sources > S3 and click Add Data Source.
  2. Fill in:
    • Display Name
    • IAM Role ARN
    • Bucket Greenlist (for example bucket1,bucket2,bucket3)
  3. Save and test the connection. Resolve any permission or credential issues if connection fails.
  4. Complete configuration. The initial full crawl will execute per connector schedule.
  5. To re-index or update, edit the bucket greenlist and schedule another crawl.

Crawl Configuration Options

  • Greenlist: Specify which buckets to index (only listed buckets are crawled).

Data and Metadata Ingested

  • Document content (for supported file types)
  • Metadata (name, path, last modified, etc.)

Permission Propagation Logic

  • S3-origin permissions do not propagate. All indexed documents are accessible to all users on Glean.

Security & Compliance Notes

  • Authentication through IAM cross-account role (AWS) or federated service account (GCP).
  • Required scope: read-only access to listed buckets.
  • Sensitive or regulatory content is not restricted after indexed, consider this prior to setup.

Known Security Restrictions

  • Multi-instance configurations require individual IAM roles and bucket greenlists.
  • There is no permissions granularity available within the connector.

Data Privacy Implications

  • Documents indexed from S3 are searchable by all organization members with access to Glean.