Introduction

When Glean is crawling an organization’s corpus using the configured datasources, the crawler will pull down the following information that will then be fed to the indexer:

  • Item content (e.g., title, body, comments, media)
  • Metadata (e.g., created by, created time, updated time, type, facets, folder)
  • Permissions (i.e., who is allowed to view the item)

Of the three types of data indexed, the first one (Item content) must take into consideration file size.

Item Content Indexing Limit

Before the crawler downloads the item content to be fed to the indexer, it will first ascertain the content item’s size. If the size is GREATER than 64 MB, the crawler WILL NOT download the item contents and will only download the metadata and permissions information for indexing. If the file size is LESS than 64 MB, the item content will also be downloaded and fed to the indexer.

Q&A