Introduction
When Glean is crawling an organization’s corpus using the configured datasources, the crawler will pull down the following information that will then be fed to the indexer:- Item content (e.g., title, body, comments, media)
- Metadata (e.g., created by, created time, updated time, type, facets, folder)
- Permissions (i.e., who is allowed to view the item)
Item Content Indexing Limit
Before the crawler downloads the item content to be fed to the indexer, it will first ascertain the content item’s size. If the size is GREATER than 64 MB, the crawler WILL NOT download the item contents and will only download the metadata and permissions information for indexing. If the file size is LESS than 64 MB, the item content will also be downloaded and fed to the indexer.Q&A
Is Content Over the Size Limit still Searchable via Glean?
Is Content Over the Size Limit still Searchable via Glean?
Do these Limits Apply for All Datasources?
Do these Limits Apply for All Datasources?
When the Content is Less Than the Size Limit, is All of the Content Indexed?
When the Content is Less Than the Size Limit, is All of the Content Indexed?
- Download the content
- Convert the content from its RAW state (PDF, Word, Google Doc, etc…) to a text based indexable format. During this process multimedia content will be removed (Images, Videos, etc…). This process has been proven to drastically reduce document size while preserving the quality of the indexed content. (95%+ reduction in PDFs!).
- If the content (converted or not) is less than 16.875 MB, the content is stored and fed to the indexer. If the content is greater than 16.875 MB, then the first 16.875 MB of content will be stored and fed to the indexer. The rest of the content will be cut from indexing.
If the Crawler Defaults to Converting and Stripping Content, what other Options are Available?
If the Crawler Defaults to Converting and Stripping Content, what other Options are Available?
Will Chat (AI Assistant) functionality as well as Summarize Capabilities in the Search Engine Result Page be affected for Non-Indexed Content?
Will Chat (AI Assistant) functionality as well as Summarize Capabilities in the Search Engine Result Page be affected for Non-Indexed Content?
These Limits Seem Small, what’s the Reason?
These Limits Seem Small, what’s the Reason?
Is this a Hard Limit that can Never Change?
Is this a Hard Limit that can Never Change?
What types of content are affected?
What types of content are affected?