Native Connectors
- Asana
- Box
- Confluence
- Confluence Data Center
- GitHub
- GitHub Enterprise
- GitLab
- GitLab Server
- Gmail
- Google Drive
- Jira
- Jira Data Center
- Monday
- Salesforce
- ServiceNow
- Slack
- Slack Enterprise Grid
- Tableau
- Workday
- Zendesk
- Zoom
Push API Connectors
- Custom
Crawling FAQ
A list of Frequently Asked Questions regarding Glean Connectors and the Crawling of datasources.
Common Questions About Crawling
The initial crawl duration for any datasource varies based on two key factors:
Datasource Size
The total volume of content, including the number of documents/messages and their individual sizes, directly impacts crawl time.
API Rate Limits
The datasource’s API rate limit affects how quickly Glean can retrieve items. Lower rate limits result in longer crawl times.
For typical enterprise datasources, initial crawls generally take between 3 to 10 days, with larger datasources or those with low API rate limits trending toward the longer end of this range.
Crawl duration is primarily determined by your datasource’s size. Larger datasets or applications with low API rate limits naturally require more time to process. If your crawl duration exceeds expectations, we recommend contacting Glean support for assistance.
While some connectors (like GitHub) offer restriction configuration during setup through the UI, most datasources require Glean support assistance for implementing crawl restrictions. Available restriction methods include:
Time-based
Limit crawling to content created or accessed within a specific timeframe (e.g., last 6 months)
User-based
Restrict crawling to content from specified users
Group-based
Limit crawling to content from specific AD groups
Site/Channel-based
Restrict crawling to specific sites or channels
Available restrictions depend on the datasource’s API capabilities. Most applications support both greenlisting (explicit inclusion) and redlisting (explicit exclusion).
If you encounter errors in your crawl status, this may indicate connectivity issues with your datasource or problems with the data itself. We recommend:
- Verifying your datasource configuration
- Contacting Glean support if issues persist
Yes, Glean fully supports concurrent crawling of multiple datasources.
Crawl status can be monitored under the Content crawling heading in the apps table:
- Job in progress: Indicates an active crawl
- Synced: Indicates a completed crawl
Detailed progress indicators and ETAs are not currently available.
The Job in progress status indicates an active crawl of your datasource. Since full crawls typically take several days to complete, this status will persist throughout the crawling process.
Datasource deletion must be handled through Glean support. Please contact them for assistance with removing any datasources.
Crawl management operations must be performed by Glean support. Please contact them for assistance with stopping or restarting crawls.
For any questions not addressed here or for specific assistance with your crawls, don’t hesitate to contact Glean support.
Was this page helpful?