Crawling Strategy

Overview

The Glean crawling system implements a sophisticated approach to data retrieval and indexing. The system is built around two primary optimization goals:

Minimizing latency when incorporating updates from source applications
Maintaining API call volumes within acceptable limits to prevent source application overload

Configuration Flexibility

All crawling frequencies described in this documentation represent default settings based on customer feedback. These values can be customized to meet specific organizational needs.

Organizations can fine-tune their crawling configuration in several ways:

API Call Rate Management

Time-Based Controls

Core Concepts

Full Content Crawl

Definition

A comprehensive process that indexes the entire corpus of a datasource. These crawls are scheduled at regular intervals to ensure complete dataset accuracy in the search index.

Incremental Content Crawl

Definition

An efficient update strategy that focuses on modified or newly added content since the previous crawl, optimizing resource usage by avoiding full repository scans.

People Data

Definition

Organizational information about individuals, encompassing names, titles, email addresses, departmental affiliations, and other relevant attributes.

Identity Crawl

Definition

A specialized process for retrieving and updating identity-related information across various datasources.

Activity Crawl

Definition

A continuous monitoring process that tracks and indexes specific changes within a datasource, including content additions, updates, deletions, and permission modifications.

Data Refresh Rates By Category

These tables present the complete refresh rate information for all supported datasources. Time periods are denoted as:

d: days
h: hours
m: minutes

Communication & Collaboration Platforms

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
GChat	28d	5m	30m	N/A	N/A	N/A
Microsoft Teams	30d	1h	1h	N/A	N/A	N/A
Slack	28d	3h	1h	N/A	<5m	webhook	Incremental crawl addresses cases where webhook wasn’t delivered
Slack Enterprise	28d	N/A	1h	10m	5m	N/A	SEG uses activity crawl to find channels with recent activity, then queues incremental crawls
Yammer	1h	N/A	1h	N/A	N/A	N/A

Document Management & Storage

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Box	28d	10m	1h	1m	10m	API	Uses Events API to identify new/modified/deleted docs
GDrive	28d	3h	1h	1m	10m	API	Uses Reports API for 10-minute updates. Activity reports every 10m, with 12h recrawl for missed events
OneDrive/SharePoint	28d	1h	10m	10m	10m	API	Uses User Insights API for 10-minute updates with hourly incremental fetch
Notion	6h	N/A	1h	N/A	N/A	N/A
Quip	28d	10m	10m	10m	N/A	API

Development & Project Management

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Asana	7d	10m	1h	10m	10m	webhook
Bitbucket	28d	10m	1h	N/A	N/A	N/A
GitHub	28d	10m	10m	N/A	N/A	webhook
GitLab	28d	10m	1h	N/A	&lgt;5m	webhook	PR/issue/comment updates trigger immediate crawls; 10m incremental crawls
Jira	7d	3h	10m	30m	&lgt;5m	webhook	Webhooks for issue/comment modifications
Trello	1d	N/A	30m	N/A	N/A	webhook	Requires Jira or Confluence setup

HR & People Management

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Azure	N/A	N/A	1h	N/A	N/A	N/A	Hourly people data crawl with additional hour for indexing
BambooHR	N/A	N/A	1h	N/A	N/A	N/A	Hourly people data crawl with additional hour for indexing
Greenhouse	28d	1h	10m	N/A	N/A	N/A
Lattice	1d	N/A	1h	N/A	N/A	N/A	Hourly people data crawl with additional hour for indexing
Lever	1h	1h	10m	N/A	N/A	N/A
People Data API	N/A	N/A	1h	N/A	1h	API	Real-time updates via push API
Pingboard	N/A	N/A	1h	N/A	N/A	N/A

Productivity & Knowledge Management

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Aha	1d	1h	1h	N/A	N/A	N/A
Airtable	1d	1h	1h	N/A	N/A	N/A
Coda	7d	N/A	1h	N/A	N/A	N/A
Confluence	7d	1h	10m	N/A	5m	webhook	New DC versions use webhooks for 5m updates; hourly fetch for older versions
Guru	1d	10m	1h	10m	1h	N/A
Miro	1d	1h	30m	10m	1m	N/A
SmartSheet	1h	10m	10m	N/A	N/A	N/A

Business Applications

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Salesforce	28d	10m	1h	N/A	10m	N/A
ServiceNow	3d/30m*	1h	1h	30m	3d	N/A	*3d for Knowledge Articles, 30m for Catalog Items
Zendesk	28d	1h	1h	N/A	1h	N/A
Freshservice	1h	N/A	10m	N/A	N/A	N/A
Monday.com	1h	N/A	10m	N/A	N/A	N/A
Pager Duty	1h	10m	10m	10m	N/A	API

Web & Content Management

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Brightspot	1h	N/A	10m	N/A	N/A	N/A
Google Sites	1d	4h	N/A	N/A	N/A	N/A
Web pages	1d	N/A	N/A	N/A	1d	N/A	Default daily crawl; fully configurable
Wordpress	12h	10m	N/A	N/A	N/A	N/A

Additional Services

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Docebo	1h	N/A	1h	N/A	N/A	N/A
Egnyte	1d	10m	1h	N/A	N/A	N/A
Fifteen Five	1h	N/A	1h	N/A	N/A	N/A
Figma	30m	N/A	30m	N/A	N/A	N/A
Gmail	N/A	N/A	N/A	N/A	0m	N/A	Real-time updates via federated API
Gong	1h	N/A	1h	N/A	N/A	N/A
Google Groups	7d	1d	N/A	N/A	N/A	N/A
Highspot	4h	N/A	1h	N/A	N/A	N/A
Looker	1h	N/A	1h	N/A	N/A	N/A	External connector
LumApps	1h	N/A	10m	10m	1m	N/A	Hourly document fetch with corpus size-dependent processing
Okta	3h	N/A	3h	1h	N/A	N/A	Additional hour for people data indexing
Seismic	28d	1d	10m	1d	N/A	API
Simpplr	28d	1h	1h	N/A	N/A	N/A
Stack Overflow	2h	N/A	1h	N/A	3h	N/A
Tableau	1d	45m	45m	N/A	N/A	N/A

API Services

Platform	Full Crawl	Incremental	People Data	Activity	Update Rate	Webhook	Notes
Push API for Content	N/A	N/A	N/A	N/A	N/A	N/A	Customer controls update frequency via bulk upload APIs and real-time CRUD operations

Content Deletion Handling

Glean handles content deletion through two primary mechanisms, depending on the capabilities of the source application.

API/Webhook Deletion

For applications that provide deletion notifications through APIs or webhooks, content is removed from the index immediately upon notification.

Full Crawl Cleanup

For applications without deletion notifications, stale content is identified and removed during scheduled full crawls.

Deletion of derived information used in models and other auxiliary systems is governed by Glean’s privacy policy.

Support for Real-time Updates

Webhook Integration

Many modern platforms support webhook-based updates, enabling near real-time content synchronization. For example:

GitHub/GitLab: Updates within 5 minutes of changes
Slack: Near immediate updates via webhook
Confluence: 5-minute update window for newer versions

For datasources that don’t support webhooks, Glean implements regular polling with optimized intervals to maintain data freshness while respecting API limits.

General

Native Connectors

Push API Connectors

Overview

Configuration Flexibility

Core Concepts

Data Refresh Rates By Category

Communication & Collaboration Platforms

Document Management & Storage

Development & Project Management

HR & People Management

Productivity & Knowledge Management

Business Applications

Web & Content Management

Additional Services

API Services

Content Deletion Handling

API/Webhook Deletion

Full Crawl Cleanup

Support for Real-time Updates

Webhook Integration

General

Native Connectors

Push API Connectors

​Overview

​Configuration Flexibility

​Core Concepts

​Data Refresh Rates By Category

​Communication & Collaboration Platforms

​Document Management & Storage

​Development & Project Management

​HR & People Management

​Productivity & Knowledge Management

​Business Applications

​Web & Content Management

​Additional Services

​API Services

​Content Deletion Handling

API/Webhook Deletion

Full Crawl Cleanup

​Support for Real-time Updates

Webhook Integration

Overview

Configuration Flexibility

Core Concepts

Data Refresh Rates By Category

Communication & Collaboration Platforms

Document Management & Storage

Development & Project Management

HR & People Management

Productivity & Knowledge Management

Business Applications

Web & Content Management

Additional Services

API Services

Content Deletion Handling

Support for Real-time Updates