The initial crawl duration for any datasource varies based on two key factors:
1
Datasource Size
The total volume of content, including the number of documents/messages and their individual sizes, directly impacts crawl time.
2
API Rate Limits
The datasource’s API rate limit affects how quickly Glean can retrieve items. Lower rate limits result in longer crawl times.
For typical enterprise datasources, initial crawls generally take between 3 to 10 days, with larger datasources or those with low API rate limits trending toward the longer end of this range.
If your datasource supports initial crawl estimates, you’ll have the option to enter an estimated document count during setup.
Based on this input and historical data, you’ll see a projected time range for when the initial crawl is expected to finish.This feature is available for select data sources and is designed to give you a data-driven estimate, so you can better anticipate when your content will be ready to use in Glean.
Initial crawl estimates are historical averages computed from past datasource crawls. Please note that actual crawl time can vary due to factors such as data volume, change frequency and structure.
What if my crawl is taking a long time?
Crawl duration is primarily determined by your datasource’s size. Larger datasets or applications with low API rate limits naturally require more time to process. If your crawl duration exceeds expectations, we recommend contacting Glean support for assistance.
How can I restrict what Glean crawls?
While some connectors (like GitHub) offer restriction configuration during setup through the UI, most datasources require Glean support assistance for implementing crawl restrictions. Available restriction methods include:
Time-based
Limit crawling to content created or accessed within a specific timeframe (e.g., last 6 months)
User-based
Restrict crawling to content from specified users
Group-based
Limit crawling to content from specific AD groups
Site/Channel-based
Restrict crawling to specific sites or channels
Available restrictions depend on the datasource’s API capabilities. Most applications support both greenlisting (explicit inclusion) and redlisting (explicit exclusion).
What should I do if I see errors in my crawl status?
If you encounter errors in your crawl status, this may indicate connectivity issues with your datasource or problems with the data itself. We recommend:
Verifying your datasource configuration
Contacting Glean support if issues persist
Can I crawl multiple datasources simultaneously?
Yes, Glean fully supports concurrent crawling of multiple datasources.
How can I monitor crawl progress?
Crawl status can be monitored under the Content crawling heading in the apps table:
Job in progress: Indicates an active crawl
Synced: Indicates a completed crawl
Detailed progress indicators are not currently available.
If your data source supports initial crawl estimates, you can enter an estimated document count to receive a projected time range for crawl completion during setup.This feature is available for select data sources and uses historical averages to estimate timing, though actual crawl time may vary depending on data volume, change frequency, and content structure.
Why is my crawl stuck at 'Job in progress'?
The Job in progress status indicates an active crawl of your datasource. Since full crawls typically take several days to complete, this status will persist throughout the crawling process.
How do I delete a datasource?
To delete a datasource after initial setup, navigate to the Data Sources page on the admin console and click the datasource instance you wish to delete. On the “Overview” tab, open the “Extreme Measures” section and click the “Delete instance” button.Once you have confirmed the deletion, it may take up to 5 minutes for documents from this datasource instance to no longer appear in Glean search results. All associated data will be removed in the background.
Currently, some datasources do not support deletion. Datasources that you cannot create multiple instances of cannot be deleted. In addition, you will not be able to delete your active People Data source. If you wish to delete this datasource, first set a new People Data source on the People Data page.
How do I stop or restart a crawl?
Crawl management operations must be performed by Glean support. Please contact them for assistance with stopping or restarting crawls.
What is Crawl Rate?
Crawl Rate is the hourly rate of crawling tasks across document parts, for example, content, metadata, permissions during the initial crawl. It serves as a heartbeat to confirm active progress.
What is Change Rate?
Change Rate is the number of user‑driven changes, for example, edits, additions, deletions synced in the past 24 hours after the initial crawl completes. It indicates ongoing freshness.
When do the Crawl Rate and Change Rate columns appear?
You will see Crawl Rate only during the initial crawl. After the initial crawl is complete, Change Rate appears for that data source.
How often does Crawl Rate and Change Rate metrics update?
Approximately every 5–10 minutes.
What does a “0” Crawl Rate mean?
This can mean the data source is being initialized or no crawling tasks were performed in the last hour, for example, due to health checks. If 0 persists longer than expected, investigate configuration and permissions.
What does a “0” Change Rate mean?
Either no user‑driven changes occurred in the last 24 hours or updates are not being synced. If you expect activity, investigate potential sync or permission issues.
Why don’t I see Crawl Rate and Change Rate metrics for some connectors?
Certain connector types are excluded, for example, federated‑fetch only, customer‑managed, and web connectors.
Does a high Crawl Rate mean everything is healthy?
Not necessarily. These metrics reflect activity, not a complete health assessment. Use them alongside status indicators and error surfacing.
For any questions not addressed here or for specific assistance with your crawls, don’t hesitate to contact Glean support.