Overview

  • Glean requires authentication to the customer’s Databricks instance in order to fetch relevant Databricks permissions and content.
  • Glean uses OAuth client credentials flow for authentication. The connector uses OAuth client credentials to fetch a new access token every half an hour, with refresh handled by an automated crawl task.
  • Glean understands all user access permissions and strictly enforces permissions for users at the time of the query. This ensures that users cannot see results they do not have access to.
  • Note that all data is stored in a GCP project inside the customer’s cloud account, and no data leaves the customer’s environment.
  • The user setting up this data source must be a Databricks Account Admin or have credentials for an existing Service Principal granted the Account Admin role.

Integration Features

For Databricks, Glean indexes the following content and associated permissioning:
  • Dashboards
  • Accounts Users and Groups
  • Workspace Entitlements
Note: Glean only crawls the metadata of these documents (id, name, owner, description, createdAt/updatedAt). We don’t crawl the actual content of notebooks or sensitive data. Databricks is available across all three major cloud providers (AWS, GCP, Azure) and the Glean-Databricks integration has been engineered to work with all of them. Only Azure requires special configuration due to partial API support.

API Usage

Glean uses the Databricks REST API to ingest all data:
  • Glean begins by understanding the users who are a part of the Databricks instance and their associated roles and permissions in the system.
  • Glean crawls all dashboards, workspaces, users, groups, SQL warehouses, jobs, models, tables, and clusters in the customer’s Databricks instance using the admin credentials granted in the initial authentication flow.
  • Account Admin is needed for listing account-level users, groups, and workspaces. Workspace admin is needed to crawl dashboard data and permissions.
  • Those access policies are then enforced so there is parity between permissions for Databricks content in Glean and the original permissions in Databricks.
Glean performs a full crawl of the content every 12 hours and fetches all the databricks accounts users, groups and workspace entitlements every hour.

Setup

Prerequisites

The user setting up this data source must be a Databricks Account Admin or have credentials for an existing Service Principal granted the Account Admin role.

Installation Process

Instructions are provided onscreen in the deployment console, but a summary is also provided here:
  • Determine the URL of your Databricks instance:
    • Sign into Databricks as an Account Admin by visiting the link corresponding to your cloud provider
    • Wait until the page loads with the parameter account_id, then copy the entire URL and paste it into the Account URL field within Glean
  • Generate a new service principal:
    • While logged in as an Account Admin, navigate to the “User management” page in Databricks and click on the “Service principals” tab
    • Click “Add service principal” and enter “Glean” into the Name field before pressing “Add” to continue
    • Go to the service principal created
    • Navigate to the “Roles” tab and make sure the “Account Admin” role is toggled on before continuing
  • Add an OAuth secret to your new service principal:
    • Go to the “Principal information” tab, and under “OAuth secrets”, click “Generate secret” and set the lifetime to 720 days
    • Click “Generate” then copy and paste the Secret field value into the OAuth Secret field within Glean
    • Copy the Client ID field value and paste it into the OAuth Client ID field within Glean
  • (For Azure Databricks Workspaces only) Enter your workspace URLs:
    • Navigate back to the “Workspaces” tab in Databricks
    • For each workspace listed on this page:
      • Click “Add additional workspace” within Glean
      • In Databricks, click on a workspace name to view the workspace configuration details
      • Copy the workspace name and paste it into the Workspace name field within Glean
      • Next, copy the URL value under the URL field and paste it into the Workspace URL field within Glean
Click Save in Glean. You’re all set!

Post Setup

Glean allows configuring for only a certain user set to have access to Databricks search results.

Troubleshooting

Common Issues

Token Expiration Errors

  • Ensure the crawl schedule is maintained and token refresh runs at half hourly frequency.
  • Validate service principal credentials and role assignments if tokens fail to generate.

Missing Crawled Objects

  • Double-check correct admin access for the service principal, confirm admin assignments to all workspaces, especially if new workspaces were recently added.
  • “Glean is an eventually consistent system.”

Crawl Duration Too High/Low

  • When crawling many dashboards/workspaces, total crawl duration may approach the set crawl period. Consider increasing crawl.dashboards.fullCrawlPeriodSecs which is set to 12 hours.
  • Vice versa if the number of dashboards is less we can reduce the duration based on below estimation.
  • NOTE: For a total of 100k dashboards our crawler will take around 6 hours, so we can estimate as number_of_dashboards (in thousands) / 100 * 6 hrs.

Azure Workspace Not Found

  • Confirm that workspace URLs are correct and all required workspaces are listed in the azure_workspace_urls config for Azure deployments.

Error codes from API

  • Some failed calls are expected short-term if crawling happens before admin assignment completes for a new workspace; these resolve on subsequent full crawls.

Workspace groups and dashboards crawl failing due to 403

  • In Databricks workspace customers can limit API access to certain IP address which causes these failures. To resolve this they have to provide our crawler IP access (GCP/AWS) on their Databricks workspaces.

FAQs

Why are account_admin and workspace_admin required?

Account Admin is needed for listing account-level users, groups, and workspaces. Workspace admin is needed to crawl dashboard data and permissions. No workaround is available at present until Databricks supports granular permissions.

How does authentication/token refresh work?

The connector uses OAuth client credentials to fetch a new access token every half an hour, with refresh handled by an automated crawl task.

Why is my new workspace/dashboard missing?

If the workspace was added after the admin assignment task last ran, it may be missing. It will be picked up in the next full crawl.

Do I need to configure anything special for Azure?

Yes, since Azure does not have a list-workspaces API, you must provide an array of workspace URLs.

Can I crawl Databricks on GCP/AWS/Azure?

Yes, supported across all three. Only Azure requires special configuration and may have partial API support. For any questions or issues with this setup, please reach out to support@glean.com.
Looking for the original version of this page? You can find the archived version here.