The Glean Web Connector enables you to crawl and index both internal and external-hosted websites as dedicated Glean data sources. When you connect a website, Glean discovers and indexes its pages either through a sitemap or by crawling from a set of seed URLs. This allows you to search and access web content alongside your other organizational data within the Glean platform.

Supported Features and Limitations

The Glean Web Connector functions as a configurable web crawler that fetches content from websites and makes it discoverable in Glean. It is ideal for bringing in content from sources that lack a native product connector.

Supported Objects/Entities

  • HTML web pages (via HTTP or HTTPS)
  • Content reachable through crawled links or listed in provided sitemaps

Supported API Endpoints/Features

  • HTTP(S) endpoints for public or authenticated web content
  • Supports standard authentication schemes:
    • Basic or bearer authentication
    • Custom headers (e.g., for API tokens)
    • Cookies
    • NTLMv2 (for Windows-authenticated resources)
  • Optional Client-Side Rendering (CSR) using Chrome to extract JavaScript-generated content
  • Respects robots.txt for crawl compliance

Limitations

  • Unstructured or highly dynamic web pages (e.g., sites with complex JavaScript) may require CSR mode, but there is limited support for advanced dynamic behaviors or web animations
  • Does not support crawling password-protected content without proper credentials
  • Some edge-case sitemaps with atypical formatting may not be processed
  • Dynamic web page indexing is not supported on:
    • AWS deployments
    • For non-public websites
    • On-prem websites which are accessed via VPN
  • The connector does not enforce source system permissions on indexed data; access controls should be managed at the source

Crawling Strategy

Crawl typeFull CrawlIncremental CrawlPeople DataActivityUpdate RateWebhookNotes
Seed CrawlYesNoNoNoConfigurable (default weekly)NoUses user-specified seed URLs and regex filters
Sitemap CrawlYesPartialNoNoConfigurable (default weekly)NoUses sitemap.xml for discovery
Client-Side RenderingYesPartialNoNoConfigurable (default weekly)NoFor JavaScript-heavy sites, requires enabling CSR

Results Display

When you search Glean, results from sites indexed with the Web Connector are shown with a separate data source label and (optionally) a site-specific branded icon. The content preview and metadata reflect the crawled page structure.

Requirements

The following requirements must be met before you begin configuring the Glean Web Connector.

Technical Requirements

  • The target website must be accessible from the Glean platform (either public or via appropriate network configuration)
  • Supported protocols: HTTP or HTTPS
  • For internal/private websites, one of the following network configurations is required:
    • VPN linking your private network to Glean’s environment
    • Shared VPC on Google Cloud
    • DNAT/static IP whitelisting on your firewall
  • If Client-Side Rendering is needed, ensure the site is compatible with headless Chrome for JavaScript execution
  • No special hardware or software is needed; all crawling is performed within the Glean cloud environment

Credential Requirements

To crawl protected content, you must provide appropriate credentials using one of the following supported methods:
  • Basic authentication (username/password)
  • Bearer tokens or custom headers
  • Cookies (for session-based authentication)
  • NTLMv2 (requires domain, username, and password)
Credentials should be securely stored and managed through Glean’s admin UI. How to provide these is detailed in the setup instructions below.

Permission Requirements

The Web Connector itself does not enforce or propagate fine-grained document permissions from the source website. Any web page accessible based on provided credentials will be indexed and made discoverable to users in Glean as configured. For internal-only websites, ensure only authorized users receive access to the Glean data source containing this content.

Preliminary Source/System Setup

  • For public sites: confirm stable access to all page URLs and sitemaps intended for indexing
  • For private sites: coordinate with your IT/network team to set up VPN/VPC routing or static IP allowlists
  • If authentication is required, validate accounts, tokens, or cookies before beginning configuration
  • Optionally, select or source a branded icon (favicon) for visual identification in search results

Permissions & Security

Data and Metadata Ingested:
The connector collects all HTML content and associated metadata (e.g., titles, URLs), limited to pages reachable via configured seeds, sitemaps, and allowed by credential scope. No people or activity data is ingested from websites, and the connector does not fetch or propagate per-user permissions for crawled content.
Permission Propagation Logic:
All users with access to the Glean data source can search and view indexed pages, subject to the crawl configuration and any source-side authentication applied during crawl.
Security & Compliance Notes:
All data transfer is encrypted in transit. Credentials are stored securely in Glean. The connector respects robots.txt unless explicitly configured to ignore it. The data is isolated within the customer’s tenant; private data is protected through the initial credentialing and network configuration steps.
Known Security Restrictions:
  • The connector does not support per-user source system permissions mapping.
  • On multi-instance or highly restricted networks, connectivity may require IT/admin coordination.
Data Privacy Implications:
All indexed website content becomes searchable internally to your organization based on the access to the data source you configure in Glean. Review internal compliance policies before enabling crawl of sensitive internal web properties.

Configuration and Setup Instructions

Configuration is performed in the Glean admin console and, if necessary, by preparing access on the web server you wish to crawl.

Prerequisites

  • Admin access to your Glean environment
  • Access to the target website and its sitemap (if applicable)
  • Credentials and/or network routing in place as detailed above

Authentication and credentials

You enter required credentials during connector configuration:
  • For Basic Auth: username and password fields
  • For bearer/custom headers: provide headers in “key,value” pairs
  • For cookies: enter as “domain=name=value”
  • For NTLMv2: specify domain, username, and password, and enable NTLM
  • Fields for all of the above appear during the connector creation flow in the Glean admin UI; the system will prompt you to validate access before enabling the initial crawl.

Step-by-Step Setup

  1. In the Glean admin console, choose to add a new data source and select “Web Connector”.
  2. Enter a display name and optionally provide an icon URL (using the website’s favicon or a custom image).
  3. Choose a crawl method:
    • For sitemap crawl: enter the full sitemap XML URL.
    • For seed crawl: enter one or more seed page URLs and specify allowed URL patterns using regular expressions.
  4. Configure crawling filters, if desired (e.g., to limit to certain path prefixes or content selectors).
  5. (Optional) Enable Client-Side Rendering (CSR) if the site requires JavaScript evaluation for content.
  6. Enter authentication details per your requirements.
  7. Test connectivity and credential validation through the admin UI.
  8. Complete setup and trigger the initial crawl. Review results and adjust crawling scope or credential setup as needed.
For internal/private websites:
  • Coordinate with your network team to ensure the necessary VPN or static IP routing is active before triggering the crawl.

Crawl configuration options

You can further adjust how the crawler operates through:
  • Redlist/greenlist: Include/exclude specific URL patterns using regex.
  • Custom content selectors: Specify HTML selectors to extract (or filter) page content during processing.
  • Branding: Upload a custom icon, shown in Glean search results as the data source badge.