Learn about the S3 connector, which enables indexing of S3 buckets into Glean’s enterprise search platform.
The Glean S3 connector allows you to index text-based documents from your Amazon S3 buckets into your Glean instance. By integrating S3 with Glean, you enable users to search and discover content stored in S3, including PDFs, Office documents, spreadsheets, presentations, and other text files. The connector surfaces indexed S3 content in Glean Search as either searchable metadata or, optionally, via time-limited pre-signed URLs. The connector can be deployed on both GCP-based and AWS-based Glean instances.
The S3 connector enables ingestion and indexing of a range of text-based content from S3 buckets into your Glean environment. Both GCP and AWS-hosted Glean deployments can utilize this connector, with slight differences in setup depending on the platform.
Encrypted S3 objects (e.g., client-side encrypted, PII encryption) are not supported for search indexing; the connector cannot retrieve or index content from buckets containing encrypted data.
There is a file size limit of 64MB per document. Files larger than this will not be crawled or indexed.
S3 object permissions are NOT enforced in Glean. All indexed S3 content is accessible to all Glean users regardless of original S3 access controls.
Only specific file types (as listed above) are indexed; other file types are ignored.
Access to Glean admin console with appropriate admin permissions.
Access to the AWS account with permissions to create IAM roles.
For GCP Deployments:
A dedicated IAM role needs to be created in AWS accounts which has access to the buckets.
Navigate to the AWS home page and search for IAM.
From the sidebar menu, select “Roles”, followed by selecting the “Create Role” option
Choose “Web Identity” as the trusted entity type, designate Google as the identity provider, and input the audience ID (can be found on the datasource setup page)
Click Next. Search for AmazonS3ReadOnlyAccess policy and then click next.
Enter the role name and description, then click on Create Role.
For AWS Deployments:
A dedicated IAM role needs to be created in AWS accounts which has access to the buckets.
Navigate to the AWS home page and search for IAM.
From the sidebar menu, select “Roles”, followed by selecting the “Create Role” option
Choose “Custom Trust Policy” as the trusted entity type, replace the custom trust policy with the new policy (can be found on the datasource setup page)
Click on Next.
Search for AmazonS3ReadOnlyAccess policy and then click next.
Enter `GleanS3Crawler` in the role name and add description, then click on Create Role.