Crawling & Learning Process
Now that your company apps are connected to Glean, three essential background processes will take place; each crucial for Glean’s functionality
In the previous section, you successfully linked all your company apps to Glean. Now, three essential background processes will take place, each crucial for optimizing Glean’s functionality:
- Crawling - Glean fetches data from your connected apps.
- Indexing - Glean creates a model of the data that was fetched and incorporates it into your organization’s search index.
- Learning - Glean processes the data that was fetched using Machine Learning (ML) to create a search and ranking algorithm tailored to your organization’s data and users.
Timelines for Completion
The time required to complete all 3 processes will vary depending on the size of your organization and the volume of content that Glean needs to process. The combined crawling and indexing processes can take approximately:
- 2-3 days to complete for a typical small organization, or small volume of content.
- 10-14 days for a typical large organization, or large volume of content.
The M/L process can take an additional 2-14 days, depending on:
- The GCP or AWS region that your Glean tenant was deployed to (and the tier of TPU/GPU hardware available in that region).
- The amount of content that needs to be processed as part of each M/L workflow.
These times should be used as an estimate only.
Your Glean engineer will advise you once all crawling, indexing, and learning processes have been completed. The remainder of this document will cover these processes in more detail.
For now, you can proceed to the next step: Populate Content.
About Crawling & Indexing
When you initiate a crawl for a data source for the first time, the crawling and indexing processes are initiated. During this time, Glean will:
- Crawl the content (and associated permissions & activity metadata) for the selected data source.
- Create the Glean Knowledge Graph by indexing the crawled content, mapping it together, and creating a real-time model that can be referred to in response to a user’s query.
Crawling is the process in which Glean fetches data from within your organization’s sources of data for the purposes of creating the search index.
The Knowledge Graph is a real-time model of your organization’s indexed information. It is a map that links all content, people, permissions, language, and activity within your organization. It is designed to provide users with the most personalized and relevant results for their queries in a matter of milliseconds.
Indexing is the process in which Glean makes content ready for display in search results by creating (or updating) your organization’s Knowledge Graph: the mapping between all content, people, permissions, language, and activity in the company.
Checking the Crawling & Indexing Status
You can check the status of your in-progress crawls at any time by going to Admin Console > Data sources and reviewing the table of configured apps.
Here, you will see information about the progress of the crawl, including how many documents have been indexed, and any errors that may have occurred.
Check the status of your in-progress crawls at any time by going to Workspace Settings > Data sources
Crawling & Indexing FAQ
About Machine Learning (M/L)
Once the crawling and indexing processes have been completed, Glean will initiate several Machine Learning (M/L) workflows that will run on all indexed content.
The M/L process is critically important and is responsible for:
- Optimizing search query understanding and spellcheck.
- Understanding synonyms, acronyms, and semantics used in documents and between employees within your organization.
- Enhancing relevance rankings for search results and people suggestions.
- Enabling query suggestions, predictive text, and autocomplete.
- Training the unique language model for your organization; which is essential for operation of Glean Chat and Glean Assistant.
Usage of Glean is not supported until the M/L process has completed successfully. You should not allow users access to Glean until all M/L has completed.
Checking the Machine Learning (M/L) Status
The M/L workflows are background processes - it is not currently possible to check the status of these inside the Glean UI.
Your Glean engineer will notify you on the progress of these workflows and when they complete successfully.
Machine Learning (M/L) FAQ
Was this page helpful?