Skip to main content

Crawling strategy

Overview

The Glean crawling system retrieves and indexes content from source applications. It balances two goals: minimizing the time between a change in the source and its appearance in Glean, and keeping API call volumes within limits that prevent source application overload.

The refresh patterns on this page are defaults based on typical deployments. Actual crawl timing in your environment depends on corpus size, API rate limits, webhook coverage, connector configuration, and overall crawl health. For the most accurate view of your deployment's crawl activity, use the Admin console.

warning

The values in the tables below are default refresh patterns, not deployment-specific guarantees. To monitor your deployment's actual crawl health, go to Admin console → Platform → Data sources and review Status, Items synced, Crawl rate, and Change rate. See Managing connectors for guidance on interpreting these metrics.

Configuration flexibility

info

All crawling frequencies described on this page are default settings. These values can be customized to meet specific organizational needs. Contact Glean support to adjust crawl frequencies for your deployment.

Organizations can fine-tune their crawling configuration in several ways:

Core concepts

Full Content CrawlDefinition

A comprehensive process that indexes the entire corpus of a datasource. These crawls are scheduled at regular intervals to ensure complete dataset accuracy in the search index.

Incremental Content CrawlDefinition

An efficient update strategy that focuses on modified or newly added content since the previous crawl, optimizing resource usage by avoiding full repository scans.

People DataDefinition

Organizational information about individuals, encompassing names, titles, email addresses, departmental affiliations, and other relevant attributes.

Identity CrawlDefinition

A specialized process for retrieving and updating identity-related information across various datasources.

Activity CrawlDefinition

A continuous monitoring process that tracks and indexes specific changes within a datasource, including content additions, updates, deletions, and permission modifications.

Update RateDefinition

The frequency at which the system performs incremental fetches to update or refresh data from different sources to ensure that the latest information is available.

Default refresh patterns by connector

note

These tables present the default refresh patterns for all supported datasources. Time periods are denoted as:

  • d: days
  • h: hours
  • m: minutes

These values can be customized to meet specific organizational needs. Actual timing in your deployment may differ from these defaults depending on corpus size, API rate limits, and other factors.

Communication & collaboration platforms

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
GChat28d5m30mN/AN/AN/A
Microsoft Teams30d1h1hN/AN/AN/A
Slack28d3h1hN/A<5mwebhookIncremental crawl addresses cases where webhook wasn't delivered
Slack Enterprise28d3h1h10m5mN/ASEG uses activity crawl to find channels with recent activity, then queues incremental crawls
Yammer1hN/A1hN/AN/AN/A
Zoom7d6h1hN/A6hN/AIndexes transcripts of Zoom cloud recordings and play URLs. Identity and recordings refresh hourly via API polling (no content webhooks), so new transcripts typically appear in Glean within about six hours after Zoom generates them.

Document management & storage

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Box28d10m1h1m10mAPIUses Events API to identify new/modified/deleted docs
GDrive28d3h1h1m10mAPIUses Reports API for 10-minute updates. Activity reports every 10m, with 12h recrawl for missed events
OneDrive/SharePoint28d1h10m10m10mAPIUses User Insights API for 10-minute updates with hourly incremental fetch
Notion6hN/A1hN/AN/AN/A
Quip28d10m10m10mN/AAPI

Development & project management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Asana7d10m1h10m10mwebhook
Bitbucket28d1h1hN/AN/AN/A
GitHub28d1h1hN/AN/Awebhook
GitLab28d1h1hN/A<5mwebhookPR/issue/comment updates trigger immediate crawls; hourly incremental crawls
Jira28d6h3h30m<5mwebhookWebhooks for issue/comment modifications
Trello1dN/A30mN/AN/AwebhookRequires Jira or Confluence setup

HR & people management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
AzureN/AN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
BambooHRN/AN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
Greenhouse28d1h10mN/AN/AN/A
Lattice1dN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
Lever1h1h10mN/AN/AN/A
People Data APIN/AN/A1hN/A1hAPIReal-time updates via push API
PingboardN/AN/A1hN/AN/AN/A

Productivity & knowledge management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Aha1d1h1hN/AN/AN/A
Airtable1d1h1hN/AN/AN/A
Coda7dN/A1hN/AN/AN/A
Confluence7d (server) / 30d (cloud)1h4h (server) / 8h (cloud)N/A5mwebhookNew DC versions use webhooks for 5m updates; hourly fetch for older versions
Guru1d10m1h10m1hN/A
Miro1d1h30m10m1mN/A
SmartSheet1h10m10mN/AN/AN/A

Business applications

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Salesforce28d10m1hN/A10mN/A
ServiceNow3d/30m*1h1h30m3dN/A*3d for Knowledge Articles, 30m for Catalog Items
Zendesk28d1h1hN/A1hN/A
Freshservice1h10m10mN/AN/AN/A
Monday.com1hN/A10mN/AN/AN/A
Pager Duty1h10m10m10mN/AAPI
Affinity1hN/A1hN/A1hN/AFull crawls only (no incremental) for People, Companies, Lists, Opportunities, and schemas. Changes can take up to about one hour to appear; cadence is configurable.

Web & content management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Brightspot1hN/A10mN/AN/AN/A
Google Sites1d4hN/AN/AN/AN/A
Web pages28dN/AN/AN/A28dN/ADefault 28-day crawl; configurable by Glean support
WordPress12h10mN/AN/AN/AN/A

Additional services

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Docebo1hN/A1hN/AN/AN/A
Egnyte28d10m1h1dN/AN/A
Fifteen Five1hN/A1hN/AN/AN/A
Figma30mN/A30m10mN/AN/A
GmailMonthly (every ~28–30 days)Continuous, activity-basedN/AContinuous0mN/AActivity-based updates via the Gmail History API (historyId). We poll for changes every few minutes and batch updates to refresh modified threads efficiently.
Gong12h5m1hN/A5mN/ACalls, transcripts, and call contents are picked up by 5m incremental crawls
Google Groups7d1dN/AN/AN/AN/A
Highspot1dN/A1hN/AN/AN/A
Looker1hN/A1hN/AN/AN/AExternal connector
LumApps1hN/A10m10m10mN/AHourly document fetch with corpus size-dependent processing
Okta3hN/A3h1hN/AN/AAdditional hour for people data indexing
Seismic28d1d10m1dN/AAPI
Simpplr28d1h1hN/AN/AN/A
Stack Overflow3hN/A1hN/A3hN/A
Tableau1d45m45mN/AN/AN/A

API services

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Push API for ContentN/AN/AN/AN/AN/AN/ACustomer controls update frequency via bulk upload APIs and real-time CRUD operations

Content deletion handling

info

Glean handles content deletion through two primary mechanisms, depending on the capabilities of the source application.

API/Webhook Deletion

For applications that provide deletion notifications through APIs or webhooks, content is removed from the index immediately upon notification.

Full Crawl Cleanup

For applications without deletion notifications, stale content is identified and removed during scheduled full crawls.

warning

Deletion of derived information used in models and other auxiliary systems is governed by Glean's privacy policy.

Support for real-time updates

Webhook Integration

Many modern platforms support webhook-based updates, enabling near real-time content synchronization. For example:

  • GitHub/GitLab: Updates within 5 minutes of changes
  • Slack: Near immediate updates via webhook
  • Confluence: 5-minute update window for newer versions
note

For datasources that don't support webhooks, Glean implements regular polling with optimized intervals to maintain data freshness while respecting API limits.