Skip to main content
The Glean Website connector enables you to crawl and index both internal and external-hosted websites as dedicated Glean data sources. When you connect a website, Glean discovers and indexes its pages recursively either through a sitemap or by crawling from a set of seed URLs. This allows you to search and access web content alongside your other organizational data within the Glean platform.

Supported features

The Glean Website Connector functions as a configurable web crawler that fetches content from websites and makes it discoverable in Glean. It is ideal for bringing in content from sources that lack a native product connector. The data source supports:
  • HTTP(S) endpoints for public or authenticated web content.
  • Standard authentication schemes:
    • Basic or bearer authentication.
    • Custom headers, for example, for API tokens.
    • Cookies.
    • NTLMv2 (for Windows-authenticated resources).
  • Optional Client-Side Rendering (CSR) using Chrome to extract JavaScript-generated content.
  • Respects robots.txt for crawl compliance.

Supported objects

  • HTML web pages through HTTP or HTTPS.
  • Content reachable through crawled links or listed in provided sitemaps.

Limitations

  • Pages that render content using JavaScript in the browser require Client‑Side Rendering (CSR). CSR captures the rendered HTML but has limited support for advanced dynamic behaviors or animations.
  • Does not support crawling password-protected content without proper credentials.
  • Dynamic web page indexing is not supported on:
    • AWS deployments
    • For password protected websites.
    • On-prem websites which are accessed via VPN

Crawling strategy

Crawl typeFull CrawlIncremental CrawlPeople DataActivityUpdate RateWebhookNotes
Seed CrawlYesNoNoNoConfigurable (default weekly)NoUses user-specified seed URLs and regex filters
Sitemap CrawlYesNoNoNoConfigurable (default weekly)NoUses sitemap.xml for discovery
Client-Side RenderingYesNoNoNoConfigurable (default weekly)NoFor JavaScript-heavy sites, requires enabling CSR

Requirements

The following requirements must be met before you begin configuring the Glean Web Connector.

Technical requirements

  • The target website must be accessible from the Glean platform either public or through appropriate network configuration.
  • Supported protocols are HTTP or HTTPS.
  • Website data source must be accessible by Glean through a network configuration like VPN.
  • If Client-Side Rendering is needed, ensure the site is compatible with headless Chrome for JavaScript execution.

Credential requirements

To crawl protected content, you must provide appropriate credentials using one of the following supported methods:
  • Basic authentication (username/password).
  • Bearer tokens or custom headers.
  • Cookies (for session-based authentication).
  • NTLMv2 (requires domain, username, and password).
Credentials must be securely stored and managed through Glean’s admin UI.

Permission requirements

  • The Website data source does not enforce or propagate fine-grained document permissions from the source website.
  • Any web page accessible based on provided credentials is indexed and made discoverable to all the users in Glean as configured.
  • Access to Website data source is granted to all users by default. To restrict an internal‑only website, configure product access groups and assign the data source to those groups.

Setup instructions

Perform the following steps to configure Website datasource.

Prerequisites

  • Admin access to your Glean environment.
  • Access to the target website and its sitemap (if applicable).
  • Credentials and/or network routing.

Steps

Step 1. Add Website data source.

  1. Navigate to the Glean admin console.
  2. Go to Data sources tab in the admin console.
  3. Click Add data source and select Website.
  4. Enter a display name and optionally provide an icon.

Step 2. Choose a crawl method.

  • Sitemap crawl: enter the URL to the sitemap of the website you would like to crawl.
  • Seed crawl: Enter the seed URL(s) of the website you would like to crawl using regular expressions. Glean will crawl all child pages linked from seed URLs.

(Optional) Step 3: Configure advanced settings.

The Advanced setup is collapsed by default, click Advanced settings dropdown to see the configuration options. Use the advance setup options only when setup fails validation, when troubleshooting crawl behavior, or when you must override defaults for a specific site.
  1. Choose from the following Authentication options. Depending on the authentication option you select, you will be prompted to provide the required details.
    • None.
    • Basic authentication: Provide the username and password.
    • OAuth: Provide your OAuth values after completing OAuth with your identity provider. The connector uses this information to refresh or request tokens as configured. If you enable Dynamic rendering (CSR), sensitive headers are cleared because they will not take effect with CSR.
      Glean does not support any cookies or authentication mechanisms With CSR.
  2. URL settings: URL settings section is shown only when Seed URLs are selected. These settings are only saved if you change them, otherwise, the connector uses the default options.
  3. Choose from the following Crawling options to control the crawling behaviour.
    • Honor robots.txt
    • Dynamic rendered crawl
  4. Sitemap: The sitemap option apply only if you selected sitemap as the crawling method.
  5. User agent: The user agent value is saved only if you change it, otherwise, the default option is considered.

Step 4. Save the data source configuration.

  • Click Save.

Step 5. Trigger initial crawl.

  • Trigger the initial crawl. Review results and adjust crawling scope or credential setup as needed. For internal/private websites:
  • Coordinate with your network team to ensure the necessary VPN or static IP routing is active before triggering the crawl.

Additional information

  • Start with defaults. Only use Advanced settings to address validation errors or site-specific needs.
  • Prefer URL Regex and allowlists to fine-tune scope rather than disabling robots or overusing red lists.
  • If a site requires headers (e.g., firewall or WAF rules), add them as Sensitive headers under your selected authentication method. Do not enable CSR in that case.
  • Keep overrides minimal so product-wide default improvements continue to benefit your deployment.

Permissions & Security

Data and Metadata Ingested:
The connector collects all HTML content and associated metadata like titles, URLs, limited to pages reachable through configured seeds, sitemaps, and allowed by credential scope. No people or activity data is ingested from websites, and the connector does not fetch or propagate per-user permissions for crawled content.
Permission Propagation Logic:
All users with access to the Glean data source can search and view indexed pages, subject to the crawl configuration and any source-side authentication applied during crawl.
Security & Compliance Notes:
All data transfer is encrypted in transit. Credentials are stored securely in Glean. The connector respects robots.txt unless explicitly configured to ignore it. The data is isolated within the your tenant and private data is protected through the initial credentialing and network configuration steps.
Known Security Restrictions:
  • The connector does not support per-user source system permissions mapping.
  • On multi-instance or highly restricted networks, connectivity may require IT/admin coordination.
Data Privacy Implications:
All indexed website content becomes searchable internally to your organization based on the access to the data source you configure in Glean. Review internal compliance policies before enabling crawl of sensitive internal web properties.

Troubleshooting steps

  • Getting validation failures on saving the configuration.
    • Expand Advanced and apply only the minimum required overrides (e.g., URL normalization or user agent).
  • Auth works but pages are not indexed
    • Verify robots is honored as intended, check URL Regex, and confirm you have not stripped necessary query parameters.
  • JavaScript-heavy pages not rendering
    • Try enabling CSR only if you can use no authentication and your deployment supports CSR (not supported on AWS). If the site requires auth or headers, do not use CSR.