Skip to main content
banner

About Data Sources and Connectors

You will see frequent references to the terms “data source” and “connector” in the Glean documentation.

Data Sources

Data Sources are the platforms, services, or cloud apps where your data resides. These could be:
CategoryExample Apps
Cloud StorageBox, OneDrive
EmailOutlook, Gmail
CommunicationSlack, Teams
DocumentationConfluence, Docusign
Ticketing & SupportJira, Zendesk
Code & EngineeringGitHub, BitBucket
HRWorkday, Lattice
Sales & MarketingSalesforce, Marketo
Project ManagementAsana, Monday
…and more!

Connectors

Connectors are the tools/integrations that Glean uses to connect to your data sources and crawl data from them. Today, Glean has 100+ connectors already built to allow you to connect to different data sources in use at your company. Connectors typically pull data from your data sources securely over API, but may also receive data from your data sources via a webhook.

Select a Data Source to Connect

Navigate to  Admin Console > Data sources and click the Add app button at the top-right.
Select a Data Source to Connect
Select the data source that you want to connect Glean to and follow the instructions that are presented on-screen.
Configure the connector as per the Glean instructions
Connector configuration is typically achieved via OAuth and/or via installing Glean via your app’s marketplace/store (e.g. Atlassian Marketplace, Box App Center, etc). As part of the setup flow for each connector, your API credentials and permissions will be validated. For each piece of data within a datasource, Glean will crawl 3 things:
  1. The contents of the asset itself (ie: spreadsheet, document, message, email, event, etc)
  2. Access permissions for the item (ie: which users have access to the item)
  3. Activities performed on the item (ie: when was the item created/posted/modified/viewed/etc and by which users?)
Glean only asks for the most minimal permissions to perform the above, however, this varies between datasources based on the capabilities of the API provided by the cloud service.
You must apply any API access permissions in the setup documents exactly as referenced.Failure to set the correct API access permissions will cause your Glean crawl to fail, or for data to be missing from Glean.

(Optional) Apply Crawling Restrictions

If you would like to restrict the content that Glean crawls, DO NOT start crawling after saving the connector configuration. Crawling restrictions can be applied from  Admin Console > Data sources once the initial configuration for the datasource has been saved. The restrictions that are supported vary between apps, but most data sources support at least two of the following restrictions:
  1. Time-based restrictions (eg: Only crawl created or accessed in the last 6 months)
  2. User-based restrictions (eg: Only crawl content from the specified users)
  3. Group-based restrictions (eg: Only crawl content from the specified AD group)
  4. Site/channel-based restrictions (eg: Only crawl content from the specified site or channel)
  5. Folder-based restrictions (eg: Only crawl content from within the specified folders)
For most apps, greenlisting (explicit inclusion), and redlisting (explicit exclusion) are typically both supported.
Not all crawling restrictions are available in the UI: some can only be applied by Glean. Contact your Glean account team or Glean support for additional information.

Start Crawling

Once you have connected your data source, you can initiate the crawl for it. Crawling is the process in which Glean sifts through the data in each of your connected apps and indexes it for search. To start the crawl, click on the Start crawl button after setting up the connector configuration. You can also start the crawl later by selecting the app under  Admin Console > Data sources, and selecting Start crawl.
Start the crawl

Checking the Crawl Status

You can check the status of your crawl at any time by going to  Admin Console > Data sources and reviewing the table of configured apps. The data sources page organizes your connectors into sections based on their sync status:

Initial sync in progress

When a data source is undergoing its initial sync, it appears in this section and progresses through two distinct phases:
  • Crawling (step 1/2) - The data source appears here while Glean is actively fetching content, metadata, and permissions from the source. You’ll see the Items synced count increase as content is retrieved.
  • Indexing (step 2/2) - After crawling completes or as crawling nears completion, the data source moves to this phase while Glean processes the crawled content and incorporates it into the Knowledge Graph.

All data sources

Once both crawling and indexing are complete, the data source moves from Initial sync in progress to All data sources, where it appears alongside other fully-synced sources. At this point, the data source is fully indexed and available for search. For each data source, you’ll see:
  • Items synced - The total number of items (documents, messages, files, etc.) that have been crawled and indexed.
  • Change rate (items/day) - The number of changes (edits, additions, deletions) synced in the past 24 hours, reflecting ongoing freshness.
Status and metrics refresh on an hourly cadence. For crawls of large data sources, or data sources with low rate limits, it is normal for the Items synced count to be low initially and then increase over several days.If the count remains low after a few days, please check the permissions granted to the Glean connector and contact Glean support.
The initial crawl for any data source will always take a while; the total time of which is dependent on two key factors:
  1. The size of the data source (eg: number of documents/messages, and the size of each).
  2. The rate limit of the data source’s API.
If an API has a low rate limit, this will affect how quickly Glean can crawl it for items. Likewise, data sources containing a large number of documents, files, or messages, will also take longer to crawl.For a typical enterprise data source, expect the initial crawl to take anywhere from 3 days, up to 10 days for large data sources with low API rate limits.For more details on how crawling and indexing work, and how to interpret progress metrics, see the Crawling & Learning Process page.