Skip to main content

S3

The Glean Amazon S3 connector indexes text-based objects from your buckets so they are searchable in Glean. It supports common office formats, PDFs, and plain text. Glean can run on GCP or AWS; IAM setup differs slightly by platform. Results can use indexed text and metadata, or optional time-limited pre-signed URLs.

Supported features and limitations

The connector reads from the buckets you specify. Platform-specific authentication is covered under Requirements and Setup.

Supported objects

You can index the following file types from S3:

  • PDF
  • Microsoft Word
  • Microsoft PowerPoint
  • Microsoft Excel
  • Plain text and similar text-based formats

Supported behavior

  • Full-corpus indexing for supported document types.
  • Search over indexed S3 content in Glean.
  • Optional display of objects using time-limited pre-signed URLs.

Limitations

  • Client-side encrypted objects and similar encryption modes are not supported for indexing.
  • Objects larger than 64 MB are skipped.
  • S3 permissions are not enforced in Glean. Everyone who can use Glean in your organization can see anything that was indexed from S3.
  • Only the listed file types are indexed; other types are ignored.
  • OCR is disabled for S3 content by default. Text within scanned documents or image-based PDFs is not extracted or indexed during the crawl.
note

OCR behavior varies by feature. While Glean does not run OCR on S3 content during a crawl, it is enabled by default for direct file uploads in Chat. As a result, the same scanned PDF may yield results in Chat but not in Search. To enable OCR for your S3 connector, contact Glean Support.

Crawling strategy

Crawl typeFull crawlIncremental crawlPeople dataActivityUpdate rateWebhookNotes
S3 connectorScheduled scan of configured bucket(s)Picks up new or modified objects since last crawlN/AAdditions, updates, and deletes reflected via crawl cycleTunable scheduleN/ADeletions appear after the next full crawl; no webhooks

Requirements

You need the following before you turn on the connector.

Technical requirements

  • A running Glean instance on GCP or AWS.
  • Access to the S3 buckets you want to index.
  • GCP-hosted Glean: Ability to create or use a Google Cloud service account that AWS IAM can trust (web identity federation).
  • AWS-hosted Glean: Ability to configure a cross-account IAM trust to Glean’s AWS principal, per your setup page.

Credential requirements

  • IAM: A role Glean can assume to read S3, configured per your deployment type (federated trust for GCP-hosted Glean, or cross-account trust for AWS-hosted Glean).
  • Credentials and keys must stay confidential and should grant read-only access to the buckets you intend to index unless your security team approves otherwise.

Permission requirements

  • The crawl role should include AmazonS3ReadOnlyAccess or a tighter custom policy with the same effect.
  • The role trust policy must allow Glean’s service account or AWS role to assume it, exactly as shown on the Glean S3 connector setup page.
  • Multi-instance setups may need a separate IAM role and bucket allowlist per instance.
  • You need enough access in AWS to create IAM roles and policies.
  • GCP-hosted Glean: Create an IAM role that can read your buckets, trust the Glean GCP service account through web identity federation, and use the audience value from the Glean setup page.
  • AWS-hosted Glean: Create a dedicated IAM role for bucket access, apply the trust policy template from the Glean setup page, and attach AmazonS3ReadOnlyAccess (or equivalent read-only scope).
  • Keep the role ARN and bucket list ready for the Glean Admin console.

Configuration and setup

Configuration is mostly in the Glean Admin console plus IAM in your AWS account.

Prerequisites

  • Glean Admin console access
  • AWS access to create IAM roles

Authentication and fields

In the Glean Admin console, open the S3 connector and provide:

  • Display name
  • IAM role ARN
  • Bucket allowlist: Comma-separated bucket names to crawl (only these buckets are indexed)

Create an IAM role (GCP-hosted Glean)

  1. In AWS, open IAM.
  2. Choose RolesCreate role.
  3. For trusted entity, choose Web identity.
  4. Choose Google as the identity provider and enter the Audience value from the Glean S3 connector setup page.
  5. Click Next.
  6. Attach the AmazonS3ReadOnlyAccess managed policy (or your approved read-only policy), then click Next.
  7. Enter a role name and description, then click Create role.

Create an IAM role (AWS-hosted Glean)

  1. In AWS, open IAM.
  2. Choose RolesCreate role.
  3. Use the trusted entity type and trust policy from the Glean S3 connector setup page for AWS-hosted Glean (typically a cross-account trust to Glean’s AWS account—not the Google web identity flow).
  4. Attach AmazonS3ReadOnlyAccess (or your approved read-only policy).
  5. Name the role, create it, then open the role and copy Role ARN.

Connect Glean to S3

  1. In the Glean Admin console, go to ConnectorsS3Add connector.
  2. Enter Display name, IAM role ARN, and the bucket allowlist (for example bucket1,bucket2,bucket3).
  3. Save and test the connection. Fix IAM or trust issues if the test fails.
  4. Finish setup. The first full crawl runs on the connector schedule.
  5. To change scope later, edit the bucket allowlist and run another crawl.

Crawl configuration

  • Bucket allowlist: Only buckets you list are crawled.

Data and metadata ingested

  • Document content (supported types)
  • Metadata such as name, path, and last modified time

Permission behavior

S3 ACLs and bucket policies are not replayed inside Glean. Indexed documents are visible to everyone who can use Glean in your organization.

Security notes

  • Access uses a cross-account IAM role (AWS-hosted) or federated web identity (GCP-hosted), with read-only scope to listed buckets.
  • After content is indexed, Glean does not re-check S3 ACLs on every search—plan what you put in the allowlist accordingly.

Limitations

  • Multi-instance setups need separate roles and bucket allowlists per instance.
  • The connector does not offer per-user S3 permission mirroring.

Privacy

Anyone in the organization who can use Glean can search indexed S3 documents.