Crawler and indexing size limits
This page covers indexing size constraints, not crawl freshness or timing. To monitor crawl activity in your deployment, see Managing connectors. For default crawl frequencies, see Crawling strategy.
Introduction
When Glean is crawling an organization’s corpus using the configured connectors, the crawler pulls down the following information that is then fed to the indexer:
- Item content (for example, title, body, comments, media)
- Metadata (for example, created by, created time, updated time, type, facets, folder)
- Permissions (for example, who is allowed to view the item)
Of the three types of data indexed, the first one (Item content) must take into consideration file size.
Item content indexing limit
Before the crawler downloads the item content to be fed to the indexer, it will first ascertain the content item’s size. If the size is GREATER than 64 MB, the crawler WILL NOT download the item contents and will only download the metadata and permissions information for indexing. If the file size is LESS than 64 MB, the item content will also be downloaded and fed to the indexer.