Skip to main content

Google Sites

This article provides technical documentation for the Glean Google Sites connector, which enables you to index and search Google Sites pages. By interacting with Google Vault and Google Drive APIs, this connector makes content discoverable through Glean’s unified search experience while enforcing user-specific access permissions. All indexed data is securely stored within your designated cloud environment, guaranteeing that organizational knowledge never leaves your infrastructure.

Features & limitations

Supported objects

  • Google Sites pages: Content from New Google Sites only.
  • Metadata: Publishing dates, modification dates, author/owner information, and access control metadata.
  • Embedded content: Links to embedded Google Docs (URL references only).

Key features

  • Full-text Indexing: Indexes all content from New Google Sites pages.
  • Permission Enforcement: Inherits and enforces user-specific access permissions from Google Drive.

Limitations

  • Only New Google Sites are supported; Classic Sites are no longer indexed.
  • The connector references only ID-based “edit URLs” for Sites pages and not the human-readable published URL due to API constraints; editors are directed to the edit version of the site, non-editors are redirected to the published version.
  • Permissions are based on Drive file access. As a result, published-site viewers cannot always be mapped precisely, and in some cases, permissions may default to "everyone."
  • The connector's functionality is subject to Google Vault organizational quota and rate limits.
  • No incremental crawling is supported for individual pages; only last-modified site crawling is possible.

Requirements

Successful deployment of the Google Sites connector requires completing the following prerequisites in your environment.

  • Google Workspace: A Google Workspace edition that includes a Google Vault license (e.g., Business Plus, Enterprise, Education).
  • New Google Sites: All sites you wish to index must be in the New Google Sites format and stored in Google Drive.
  • GCP Project: Access to the GCP project where your Glean instance is hosted.
  • Service Account: A Google Cloud service account with “Google Apps Domain-wide Delegation” privileges.
  • API Access: The Google Vault API and Google Cloud Storage API must be enabled in your GCP project.
  • Google Drive Connector: The Google Drive connector must already be set up and connected to Glean.

How to set up the connector

The configuration for the Google Sites connector is performed across three main platforms: your Google Admin Console, Google Vault, and the Glean Admin Console.

Step 1: Check prerequisites

Before you begin, ensure you have already completed the Google Drive connector setup. The Google Sites connector relies on the credentials and permissions established during that process.

Step 2: Grant Vault roles to the admin user

The user account used for the GDrive setup (the “Directory Admin Email”) must have the appropriate privileges to use Google Vault.

  1. As an admin, visit the Admin Roles page in the Google Admin Console.
  2. Create (or modify) a role and grant that role the following privileges:
    • Manage Matters
    • Manage Searches
    • Manage Exports

Step 3: Enable APIs and scopes

This step ensures your service account has the necessary permissions to access Google Vault and Cloud Storage.

  1. In your GCP project, verify that the Google Vault API and Google Cloud Storage API are enabled.
  2. In the Google Admin Console, navigate to the Manage OAuth Client page.
  3. Select the Client ID that was used for the GDrive setup and add the following scopes to it:
    • https://www.googleapis.com/auth/ediscovery (Allows the client to use Google Vault)
    • https://www.googleapis.com/auth/devstorage.read_only (Allows the client to access content from Vault exports)

Step 4: Create and share a Matter in Google Vault

This step is required for Glean to be able to export and access your Sites content from Google Vault.

  1. Navigate to Google Vault and go to the Matters page.
  2. Click Create Matter and give it a name like "Glean Matter."
  3. Share the Matter with the “Directory Admin Email” account. To do this:
    • Navigate to the newly created matter’s page.
    • Click on the Share button (near the pencil icon).
    • Under Invite people, include the user account email used for GDrive setup.
  4. Ensure that the users and shared drives whose Sites you wish to crawl are included in the search parameters for the Vault export.

Step 5: Connect in Glean

This is the final step to configure the connector in the Glean Admin Console.

  1. In Glean, go to the Data Sources section and select the Google Sites connector.
  2. Enter the following information into the setup page:
    • Google Vault Matter Id: The Matter ID you obtained in the previous step.
    • Google Workspace Domain: Your company's Google Workspace domain (e.g., glean.com). This value should match the domain configured for your Google Drive connector.
  3. Click Save.

The connector will automatically begin indexing your Google Sites pages.

Permissions & security

  • Permission propagation: The connector inherits user and group permissions from the underlying Drive document for each Google Site page. Glean enforces these permissions at query time, ensuring users only see sites they are authorized to access in Google Drive.
  • Data privacy: All data is exported and stored within your cloud environment. Glean does not maintain persistent copies on external servers.
  • Security model: The connector leverages service account-based access and organization-controlled Vault Matter sharing, supporting a strong security model that adheres to the principle of least privilege.