Overview

The Glean crawling system implements a sophisticated approach to data retrieval and indexing. The system is built around two primary optimization goals:

  1. Minimizing latency when incorporating updates from source applications
  2. Maintaining API call volumes within acceptable limits to prevent source application overload

Configuration Flexibility

All crawling frequencies described in this documentation represent default settings based on customer feedback. These values can be customized to meet specific organizational needs.

Organizations can fine-tune their crawling configuration in several ways:

Core Concepts

Full Content Crawl
Definition

A comprehensive process that indexes the entire corpus of a datasource. These crawls are scheduled at regular intervals to ensure complete dataset accuracy in the search index.

Incremental Content Crawl
Definition

An efficient update strategy that focuses on modified or newly added content since the previous crawl, optimizing resource usage by avoiding full repository scans.

People Data
Definition

Organizational information about individuals, encompassing names, titles, email addresses, departmental affiliations, and other relevant attributes.

Identity Crawl
Definition

A specialized process for retrieving and updating identity-related information across various datasources.

Activity Crawl
Definition

A continuous monitoring process that tracks and indexes specific changes within a datasource, including content additions, updates, deletions, and permission modifications.

Data Refresh Rates By Category

These tables present the complete refresh rate information for all supported datasources. Time periods are denoted as:

  • d: days
  • h: hours
  • m: minutes

Communication & Collaboration Platforms

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
GChat28d5m30mN/AN/AN/A
Microsoft Teams30d1h1hN/AN/AN/A
Slack28d3h1hN/A<5mwebhookIncremental crawl addresses cases where webhook wasn’t delivered
Slack Enterprise28dN/A1h10m5mN/ASEG uses activity crawl to find channels with recent activity, then queues incremental crawls
Yammer1hN/A1hN/AN/AN/A

Document Management & Storage

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Box28d10m1h1m10mAPIUses Events API to identify new/modified/deleted docs
GDrive28d3h1h1m10mAPIUses Reports API for 10-minute updates. Activity reports every 10m, with 12h recrawl for missed events
OneDrive/SharePoint28d1h10m10m10mAPIUses User Insights API for 10-minute updates with hourly incremental fetch
Notion6hN/A1hN/AN/AN/A
Quip28d10m10m10mN/AAPI

Development & Project Management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Asana7d10m1h10m10mwebhook
Bitbucket28d10m1hN/AN/AN/A
GitHub28d10m10mN/AN/Awebhook
GitLab28d10m1hN/A&lgt;5mwebhookPR/issue/comment updates trigger immediate crawls; 10m incremental crawls
Jira7d3h10m30m&lgt;5mwebhookWebhooks for issue/comment modifications
Trello1dN/A30mN/AN/AwebhookRequires Jira or Confluence setup

HR & People Management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
AzureN/AN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
BambooHRN/AN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
Greenhouse28d1h10mN/AN/AN/A
Lattice1dN/A1hN/AN/AN/AHourly people data crawl with additional hour for indexing
Lever1h1h10mN/AN/AN/A
People Data APIN/AN/A1hN/A1hAPIReal-time updates via push API
PingboardN/AN/A1hN/AN/AN/A

Productivity & Knowledge Management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Aha1d1h1hN/AN/AN/A
Airtable1d1h1hN/AN/AN/A
Coda7dN/A1hN/AN/AN/A
Confluence7d1h10mN/A5mwebhookNew DC versions use webhooks for 5m updates; hourly fetch for older versions
Guru1d10m1h10m1hN/A
Miro1d1h30m10m1mN/A
SmartSheet1h10m10mN/AN/AN/A

Business Applications

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Salesforce28d10m1hN/A10mN/A
ServiceNow3d/30m*1h1h30m3dN/A*3d for Knowledge Articles, 30m for Catalog Items
Zendesk28d1h1hN/A1hN/A
Freshservice1hN/A10mN/AN/AN/A
Monday.com1hN/A10mN/AN/AN/A
Pager Duty1h10m10m10mN/AAPI

Web & Content Management

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Brightspot1hN/A10mN/AN/AN/A
Google Sites1d4hN/AN/AN/AN/A
Web pages1dN/AN/AN/A1dN/ADefault daily crawl; fully configurable
Wordpress12h10mN/AN/AN/AN/A

Additional Services

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Docebo1hN/A1hN/AN/AN/A
Egnyte1d10m1hN/AN/AN/A
Fifteen Five1hN/A1hN/AN/AN/A
Figma30mN/A30mN/AN/AN/A
GmailN/AN/AN/AN/A0mN/AReal-time updates via federated API
Gong1hN/A1hN/AN/AN/A
Google Groups7d1dN/AN/AN/AN/A
Highspot4hN/A1hN/AN/AN/A
Looker1hN/A1hN/AN/AN/AExternal connector
LumApps1hN/A10m10m1mN/AHourly document fetch with corpus size-dependent processing
Okta3hN/A3h1hN/AN/AAdditional hour for people data indexing
Seismic28d1d10m1dN/AAPI
Simpplr28d1h1hN/AN/AN/A
Stack Overflow2hN/A1hN/A3hN/A
Tableau1d45m45mN/AN/AN/A

API Services

PlatformFull CrawlIncrementalPeople DataActivityUpdate RateWebhookNotes
Push API for ContentN/AN/AN/AN/AN/AN/ACustomer controls update frequency via bulk upload APIs and real-time CRUD operations

Content Deletion Handling

Glean handles content deletion through two primary mechanisms, depending on the capabilities of the source application.

API/Webhook Deletion

For applications that provide deletion notifications through APIs or webhooks, content is removed from the index immediately upon notification.

Full Crawl Cleanup

For applications without deletion notifications, stale content is identified and removed during scheduled full crawls.

Deletion of derived information used in models and other auxiliary systems is governed by Glean’s privacy policy.

Support for Real-time Updates

Webhook Integration

Many modern platforms support webhook-based updates, enabling near real-time content synchronization. For example:

  • GitHub/GitLab: Updates within 5 minutes of changes
  • Slack: Near immediate updates via webhook
  • Confluence: 5-minute update window for newer versions

For datasources that don’t support webhooks, Glean implements regular polling with optimized intervals to maintain data freshness while respecting API limits.