Crawler and Indexing Size Limits
This article explains the size limits and processing rules for item content, metadata, and permissions that Glean’s crawler and indexer apply to all datasources.
Introduction
When Glean is crawling an organization’s corpus using the configured datasources, the crawler will pull down the following information that will then be fed to the indexer:
- Item content (e.g., title, body, comments, media)
- Metadata (e.g., created by, created time, updated time, type, facets, folder)
- Permissions (i.e., who is allowed to view the item)
Of the three types of data indexed, the first one (Item content) must take into consideration file size.
Item Content Indexing Limit
Before the crawler downloads the item content to be fed to the indexer, it will first ascertain the content item’s size. If the size is GREATER than 64 MB, the crawler WILL NOT download the item contents and will only download the metadata and permissions information for indexing. If the file size is LESS than 64 MB, the item content will also be downloaded and fed to the indexer.
Q&A
Is Content Over the Size Limit still Searchable via Glean?
Is Content Over the Size Limit still Searchable via Glean?
Yes, the item is still searchable via Glean based upon the metadata and it will still be ranked and secured based upon permissions and the personalization features of the Glean Platform.
Do these Limits Apply for All Datasources?
Do these Limits Apply for All Datasources?
Yes, the crawler applies this logic for all datasources.
When the Content is Less Than the Size Limit, is All of the Content Indexed?
When the Content is Less Than the Size Limit, is All of the Content Indexed?
Not by default. By default the crawler will do the following:
- Download the content
- Convert the content from its RAW state (PDF, Word, Google Doc, etc…) to a text based indexable format. During this process multimedia content will be removed (Images, Videos, etc…). This process has been proven to drastically reduce document size while preserving the quality of the indexed content. (95%+ reduction in PDFs!).
- If the content (converted or not) is less than 16.875 MB, the content is stored and fed to the indexer. If the content is greater than 16.875 MB, then the first 16.875 MB of content will be stored and fed to the indexer. The rest of the content will be cut from indexing.
If the Crawler Defaults to Converting and Stripping Content, what other Options are Available?
If the Crawler Defaults to Converting and Stripping Content, what other Options are Available?
It is possible to turn on OCR (Optical Character Recognition) for certain datasources. If OCR is enabled, then the crawler WILL NOT strip the multimedia content and will attempt to extract text from it. Once this process is complete, then the default crawling/indexing process continues as described above (content conversion, total text indexable size of 16.875 MB, etc…)
Even with OCR enabled, the initial size limit of 64 MB still applies as described in the Item Content Indexing Limit section.
Will Chat (AI Assistant) functionality as well as Summarize Capabilities in the Search Engine Result Page be affected for Non-Indexed Content?
Will Chat (AI Assistant) functionality as well as Summarize Capabilities in the Search Engine Result Page be affected for Non-Indexed Content?
Yes. Because the content of the document(s) in question will have not been indexed, Chat will be unable to utilize their content as part of its standard functionality. This also means that you will not see the “Summarize” button in the Search Engine Result Page.
These Limits Seem Small, what’s the Reason?
These Limits Seem Small, what’s the Reason?
The burden is on the Glean Platform to provide a performant, secure, and scalable platform for organizations to quickly find and understand their internal knowledge. Limits had to be applied in order for the platform to achieve these goals. This may mean that some larger content will not be indexed, but the value tradeoffs for the rest of the content far outweigh the potential value lost in these large documents.
Customers have the opportunity to break up large documents into a logical series and/or downsample large multimedia content within them for crucial content whenever necessary.
Google.com’s upper bound limit for content size is 16 MB. Also as a point of reference the complete works of Shakespeare contain approximately 5 MB of text.
Is this a Hard Limit that can Never Change?
Is this a Hard Limit that can Never Change?
No. The Glean Product Team is willing to revisit these limits as technology and content changes occur. Feel free to reach out to Glean Support or your Glean Account Team if you feel these limits are limiting for your business.
What types of content are affected?
What types of content are affected?
Essentially all content is affected by these limits. However, in practicality it would be highly unlikely that certain content sources would produce enough text data to ever hit these high limits. (Teams, Slack, Salesforce, ServiceNow, etc…)
It would be considered a safe assumption that most customers would only run into these limits for unstructured Office (Powerpoint, Word, etc..), Google Docs (Slides, Docs, etc…), and PDFs.