Crawler and Indexing Size Limits

Introduction

When Glean is crawling an organization’s corpus using the configured datasources, the crawler will pull down the following information that will then be fed to the indexer:

Item content (e.g., title, body, comments, media)
Metadata (e.g., created by, created time, updated time, type, facets, folder)
Permissions (i.e., who is allowed to view the item)

Of the three types of data indexed, the first one (Item content) must take into consideration file size.

Item Content Indexing Limit

Before the crawler downloads the item content to be fed to the indexer, it will first ascertain the content item’s size. If the size is GREATER than 64 MB, the crawler WILL NOT download the item contents and will only download the metadata and permissions information for indexing. If the file size is LESS than 64 MB, the item content will also be downloaded and fed to the indexer.

Q&A

Is Content Over the Size Limit still Searchable via Glean?

Do these Limits Apply for All Datasources?

When the Content is Less Than the Size Limit, is All of the Content Indexed?

If the Crawler Defaults to Converting and Stripping Content, what other Options are Available?

Will Chat (AI Assistant) functionality as well as Summarize Capabilities in the Search Engine Result Page be affected for Non-Indexed Content?

These Limits Seem Small, what’s the Reason?

Is this a Hard Limit that can Never Change?

What types of content are affected?

General

Native Connectors

Push API Connectors

Crawler and Indexing Size Limits

Introduction

Item Content Indexing Limit

Q&A

General

Native Connectors

Push API Connectors

​Introduction

​Item Content Indexing Limit

​Q&A

Introduction

Item Content Indexing Limit

Q&A