In Pilot Mode, some crawling restrictions are automatically enforced, while others are optional but strongly recommended.

Restricting content to a smaller dataset ensures that your organization can get started using Glean quickly and evaluate the product without having to wait for a crawl to complete of all of your organization’s data.

Time Restrictions

Time-based crawling restrictions restrict the data that is crawled to a specific time range. Any data outside of this range is not crawled by Glean.

Time-based restrictions typically consider both the creation and modification dates of the data. That is, as long as a document has been modified within the time range, it will be crawled.

If a document excluded from crawling (because it was created or modified outside of the time range), is later modified to fall within the time range, it will be crawled by Glean; typically within 24 hours.

Identity Restrictions

Identity restrictions restrict the data that is crawled to a specific “pilot” identity/user group. Only data from users included in this group is crawled.

You should create a group in your Identity Provider (e.g. Entra ID, Google, Okta, etc) that contains the users who will be actively using Glean during the pilot. We suggest keeping this group relatively small, ideally fewer than 100 users.

Document Restrictions

Document restrictions restrict the data that is crawled to a specific subset of documents. This could be individual documents/assets, or a group of documents/assets (e.g. a folder, user drive, site, space, etc).

The type of document restrictions supported varies by the data source. Document restrictions can typically be applied as either greenlists (explicit inclusion) and redlists (explicit exclusion) depending on what is supported by the vendor’s API.

We recommend you apply the following restrictions to your pilot if you are using these apps:

  • Sharepoint - We require selecting specific sites to be crawled to mitigate the presence of non-relevant Sharepoint sites commonly found in large organizations.
  • Github - We recommend you select specific repositories if you have more than 50 repositories.
  • Confluence - We recommend you select specific spaces if you have more than 2 million docs.
  • Box - We recommend you select specific folders or drives to crawl if you have more than 5 million files.

You will be able to specify these values during Glean setup on the Glean Connectors Setup pages. Any datasources not explicitly mentioned will be crawled without limitation.

Restrictions Applied in Pilot Mode

DatasourceStrategic RestrictionRequired or Recommended
Google Drivedocuments created or modified in the last 12 monthsRequired
OneDrivedocuments owned by users of the pilot access groupRequired
Boxdocuments created or modified in the last 12 monthsRequired
Slackmessages created or modified in the last 12 monthsRequired
SlackEnterpriseGridmessages created or modified in the last 12 monthsRequired
Jiradocuments created or modified in the last 12 monthsRequired
Sharepointcustomer specified greenlist of specific sitesRequired
Githubcustomer specified greenlist of repositoriesRecommended
Confluencecustomer specified greenlist of spacesRecommended

Any datasources not explicitly mentioned will be crawled without limitation.