About Data Sources and Connectors

You will see frequent references to the terms “data source” and “connector” in the Glean documentation.

Data Sources

Data Sources are the platforms, services, or cloud apps where your data resides. These could be:

CategoryExample Apps
Cloud StorageBox, OneDrive
EmailOutlook, Gmail
CommunicationSlack, Teams
DocumentationConfluence, Docusign
Ticketing & SupportJira, Zendesk
Code & EngineeringGitHub, BitBucket
HRWorkday, Lattice
Sales & MarketingSalesforce, Marketo
Project ManagementAsana, Monday
…and more!

Connectors

Connectors are the tools/integrations that Glean uses to connect to your data sources and crawl data from them. Today, Glean has 100+ connectors already built to allow you to connect to different data sources in use at your company.

Connectors typically pull data from your data sources securely over API, but may also receive data from your data sources via a webhook.

Select a Data Source to Connect

Navigate to  Admin Console > Data sources and click the Add app button at the top-right.

Click the Add app button under Admin Console > Data sources and select a connector to configure

Select the data source that you want to connect Glean to and follow the instructions that are presented on-screen.

Follow the instructions carefully to configure the connector

Connector configuration is typically achieved via OAuth and/or via installing Glean via your app’s marketplace/store (e.g. Atlassian Marketplace, Box App Center, etc).

As part of the setup flow for each connector, your API credentials and permissions will be validated.

For each piece of data within a datasource, Glean will crawl 3 things:

  1. The contents of the asset itself (ie: spreadsheet, document, message, email, event, etc)
  2. Access permissions for the item (ie: which users have access to the item)
  3. Activities performed on the item (ie: when was the item created/posted/modified/viewed/etc and by which users?)

Glean only asks for the most minimal permissions to perform the above, however, this varies between datasources based on the capabilities of the API provided by the cloud service.

You must apply any API access permissions in the setup documents exactly as referenced.

Failure to set the correct API access permissions will cause your Glean crawl to fail, or for data to be missing from Glean.

(Optional)Apply Crawling Restrictions

If you would like to restrict the content that Glean crawls, DO NOT start crawling after saving the connector configuration.

Crawling restrictions can be applied from  Admin Console > Data sources once the initial configuration for the datasource has been saved.

The restrictions that are supported vary between apps, but most data sources support at least two of the following restrictions:

  1. Time-based restrictions (eg: Only crawl created or accessed in the last 6 months)
  2. User-based restrictions (eg: Only crawl content from the specified users)
  3. Group-based restrictions (eg: Only crawl content from the specified AD group)
  4. Site/channel-based restrictions (eg: Only crawl content from the specified site or channel)
  5. Folder-based restrictions (eg: Only crawl content from within the specified folders)

For most apps, greenlisting (explicit inclusion), and redlisting (explicit exclusion) are typically both supported.

Not all crawling restrictions are available in the UI: some can only be applied by Glean. Contact your Glean account team or Glean support for additional information.

Start Crawling

Once you have connected your data source, you can initiate the crawl for it.

Crawling is the process in which Glean sifts through the data in each of your connected apps and indexes it for search.

To start the crawl, click on the Start crawl button after setting up the connector configuration.

You can also start the crawl later by selecting the app under  Admin Console > Data sources, and selecting Start crawl.

Crawling can be started once the connector configuration is saved

Checking the Crawl Status

You can check the status of your crawl at any time by going to  Admin Console > Data sources and reviewing the table of configured apps.

Here, you will see information about the progress of the crawl, including how many documents have been indexed and any errors that may have occurred.

Admin Console > Data sources will display a list of all connected apps and their crawl status

For crawls of large data sources, or data sources with low rate limits, it is normal for the document count to be low initially and then exponentially increase over a few days.

If the document count remains low after a few days, please check the permissions granted to the Glean connector and contact Glean support.