Glean Web Connector

The Glean Web Connector enables you to crawl and index both internal and external-hosted websites as dedicated Glean data sources. When you connect a website, Glean discovers and indexes its pages either through a sitemap or by crawling from a set of seed URLs. This allows you to search and access web content alongside your other organizational data within the Glean platform.

Supported Features and Limitations

The Glean Web Connector functions as a configurable web crawler that fetches content from websites and makes it discoverable in Glean. It is ideal for bringing in content from sources that lack a native product connector.

Supported Objects/Entities

HTML web pages (via HTTP or HTTPS)
Content reachable through crawled links or listed in provided sitemaps

Supported API Endpoints/Features

HTTP(S) endpoints for public or authenticated web content
Supports standard authentication schemes:
- Basic or bearer authentication
- Custom headers (e.g., for API tokens)
- Cookies
- NTLMv2 (for Windows-authenticated resources)
Optional Client-Side Rendering (CSR) using Chrome to extract JavaScript-generated content
Respects robots.txt for crawl compliance

Limitations

Unstructured or highly dynamic web pages (e.g., sites with complex JavaScript) may require CSR mode, but there is limited support for advanced dynamic behaviors or web animations
Does not support crawling password-protected content without proper credentials
Some edge-case sitemaps with atypical formatting may not be processed
Dynamic web page indexing is not supported on:
- AWS deployments
- For non-public websites
- On-prem websites which are accessed via VPN
The connector does not enforce source system permissions on indexed data; access controls should be managed at the source

Crawling Strategy

Crawl type	Full Crawl	Incremental Crawl	People Data	Activity	Update Rate	Webhook	Notes
Seed Crawl	Yes	No	No	No	Configurable (default weekly)	No	Uses user-specified seed URLs and regex filters
Sitemap Crawl	Yes	Partial	No	No	Configurable (default weekly)	No	Uses sitemap.xml for discovery
Client-Side Rendering	Yes	Partial	No	No	Configurable (default weekly)	No	For JavaScript-heavy sites, requires enabling CSR

Results Display

When you search Glean, results from sites indexed with the Web Connector are shown with a separate data source label and (optionally) a site-specific branded icon. The content preview and metadata reflect the crawled page structure.

Requirements

The following requirements must be met before you begin configuring the Glean Web Connector.

Technical Requirements

The target website must be accessible from the Glean platform (either public or via appropriate network configuration)
Supported protocols: HTTP or HTTPS
For internal/private websites, one of the following network configurations is required:
- VPN linking your private network to Glean’s environment
- Shared VPC on Google Cloud
- DNAT/static IP whitelisting on your firewall
If Client-Side Rendering is needed, ensure the site is compatible with headless Chrome for JavaScript execution
No special hardware or software is needed; all crawling is performed within the Glean cloud environment

Credential Requirements

To crawl protected content, you must provide appropriate credentials using one of the following supported methods:

Basic authentication (username/password)
Bearer tokens or custom headers
Cookies (for session-based authentication)
NTLMv2 (requires domain, username, and password)

Credentials should be securely stored and managed through Glean’s admin UI. How to provide these is detailed in the setup instructions below.

Permission Requirements

The Web Connector itself does not enforce or propagate fine-grained document permissions from the source website. Any web page accessible based on provided credentials will be indexed and made discoverable to users in Glean as configured. For internal-only websites, ensure only authorized users receive access to the Glean data source containing this content.

Preliminary Source/System Setup

For public sites: confirm stable access to all page URLs and sitemaps intended for indexing
For private sites: coordinate with your IT/network team to set up VPN/VPC routing or static IP allowlists
If authentication is required, validate accounts, tokens, or cookies before beginning configuration
Optionally, select or source a branded icon (favicon) for visual identification in search results

Permissions & Security

Data and Metadata Ingested:
The connector collects all HTML content and associated metadata (e.g., titles, URLs), limited to pages reachable via configured seeds, sitemaps, and allowed by credential scope. No people or activity data is ingested from websites, and the connector does not fetch or propagate per-user permissions for crawled content. Permission Propagation Logic:
All users with access to the Glean data source can search and view indexed pages, subject to the crawl configuration and any source-side authentication applied during crawl. Security & Compliance Notes:
All data transfer is encrypted in transit. Credentials are stored securely in Glean. The connector respects robots.txt unless explicitly configured to ignore it. The data is isolated within the customer’s tenant; private data is protected through the initial credentialing and network configuration steps. Known Security Restrictions:

The connector does not support per-user source system permissions mapping.
On multi-instance or highly restricted networks, connectivity may require IT/admin coordination.

Data Privacy Implications:
All indexed website content becomes searchable internally to your organization based on the access to the data source you configure in Glean. Review internal compliance policies before enabling crawl of sensitive internal web properties.

Configuration and Setup Instructions

Configuration is performed in the Glean admin console and, if necessary, by preparing access on the web server you wish to crawl.

Prerequisites

Admin access to your Glean environment
Access to the target website and its sitemap (if applicable)
Credentials and/or network routing in place as detailed above

Authentication and credentials

You enter required credentials during connector configuration:

For Basic Auth: username and password fields
For bearer/custom headers: provide headers in “key,value” pairs
For cookies: enter as “domain=name=value”
For NTLMv2: specify domain, username, and password, and enable NTLM
Fields for all of the above appear during the connector creation flow in the Glean admin UI; the system will prompt you to validate access before enabling the initial crawl.

Step-by-Step Setup

In the Glean admin console, choose to add a new data source and select “Web Connector”.
Enter a display name and optionally provide an icon URL (using the website’s favicon or a custom image).
Choose a crawl method:
- For sitemap crawl: enter the full sitemap XML URL.
- For seed crawl: enter one or more seed page URLs and specify allowed URL patterns using regular expressions.
Configure crawling filters, if desired (e.g., to limit to certain path prefixes or content selectors).
(Optional) Enable Client-Side Rendering (CSR) if the site requires JavaScript evaluation for content.
Enter authentication details per your requirements.
Test connectivity and credential validation through the admin UI.
Complete setup and trigger the initial crawl. Review results and adjust crawling scope or credential setup as needed.

For internal/private websites:

Coordinate with your network team to ensure the necessary VPN or static IP routing is active before triggering the crawl.

Crawl configuration options

You can further adjust how the crawler operates through:

Redlist/greenlist: Include/exclude specific URL patterns using regex.
Custom content selectors: Specify HTML selectors to extract (or filter) page content during processing.
Branding: Upload a custom icon, shown in Glean search results as the data source badge.

General

Native Connectors

Push API Connectors

Supported Features and Limitations

Supported Objects/Entities

Supported API Endpoints/Features

Limitations

Crawling Strategy

Results Display

Requirements

Technical Requirements

Credential Requirements

Permission Requirements

Preliminary Source/System Setup

Permissions & Security

Configuration and Setup Instructions

Prerequisites

Authentication and credentials

Step-by-Step Setup

Crawl configuration options

General

Native Connectors

Push API Connectors

​Supported Features and Limitations

​Supported Objects/Entities

​Supported API Endpoints/Features

​Limitations

​Crawling Strategy

​Results Display

​Requirements

​Technical Requirements

​Credential Requirements

​Permission Requirements

​Preliminary Source/System Setup

​Permissions & Security

​Configuration and Setup Instructions

​Prerequisites

​Authentication and credentials

​Step-by-Step Setup

​Crawl configuration options

Supported Features and Limitations

Supported Objects/Entities

Supported API Endpoints/Features

Limitations

Crawling Strategy

Results Display

Requirements

Technical Requirements

Credential Requirements

Permission Requirements

Preliminary Source/System Setup

Permissions & Security

Configuration and Setup Instructions

Prerequisites

Authentication and credentials

Step-by-Step Setup

Crawl configuration options