Skip to main content

Box

The Box connector for Glean allows Glean to fetch and index content from Box, ensuring that users can search and access documents for which they have authorized permissions.

  • Authentication: Glean requires the Box admin to authenticate Glean via OAuth2 during the setup of the Glean crawler

  • Data storage: All data is stored within the customer's cloud account, ensuring no data leaves the customer's environment

API usage

  • Standard API: Glean uses Box's standard API for Box to ingest all data

Integration features

  • Content captured: Glean captures Box projects, service management, dashboards, and more.

  • Permissions enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Box web application, which enforces the permission.

Supported versions

There are no specific version limitations of the Box connector.

Supported objects

The Box connector for Glean supports the following objects:

  1. Folders
  2. Files (including slides, word documents, etc.)
  3. Box notes
  4. File comments

Authentication mechanism

During configuration, Glean initiates an authorization flow with Box to obtain both a refresh token and an access token for the customer’s Box account. The access token, which is used to make API calls, has a limited validity period. Glean therefore runs a scheduled refresh process that uses the refresh token from the initial authorization to obtain new access tokens as needed.

Requirements

Technical requirements

To use the Box connector, you must have the following:

  • A Box enterprise account.
  • A Box Admin account that can authorize an OAuth 2.0 application for your enterprise.
  • A Glean deployment with access to the Admin console to add and configure data sources.

Glean connects to Box via Box’s REST APIs over HTTPS and stores indexed data in your organization’s Glean deployment.

Permission requirements

The To configure the connector end to end, you must be the following:

  • A Box Admin for the relevant enterprise (to authorize the integration and grant required scopes).
  • A Glean admin with permissions to add and configure data sources.
  • Co-admins are not supported. Co-admins may lack the necessary privileges (e.g., managing users, running reports) to provide full content access, which can prevent Glean from crawling all expected information.

Configuration and setup

  • Glean uses Box’s file download API as part of the crawl. If your Box organization is configured to send email alerts for suspicious download activity, Glean recommends working with Box Support to suppress notifications for the client ID used by the Glean integration.
  • If you do not suppress these notifications, users across your Box organization may receive unnecessary download alerts generated by the crawler.

Box connector setup

  1. In the Glean Admin console, go to Data sources > Add data source.
  2. Search for Box and select the Box connector.
  3. Enter a Name and optional Icon for the Box data source. This label appears to users in search results.
  4. Click Authorize.

Authentication scope requirements

ScopePurpose
Read all files/folders in BoxList all files from user drives
Read and write all files and folders stored in BoxRequired to download file content into Glean (despite saying write).
Manage usersList users and associated group memberships
Manage groupsList all groups
Manage enterprise propertiesCrawl recent enterprise logs activity to ingest newly created/modified data
Admin can make calls on behalf of UsersUse the As-User header, to distribute rate limits between different owners of files.

Items crawled

Content

For Box, Glean indexes the following content and associated permissions:

  • Folders
  • Files (e.g. slides, word documents, etc.)
  • Box Notes
  • Comments on the files

Identity

  • Users: Information about users within the Box
  • Groups: Details about groups within Box

Activity

  • Adds: New files or folders added to Box.
  • Updates: Modifications made to existing files or folders.
  • Permissions changes: Changes in file or folder sharing permissions.
  • Deletions: Files or folders that have been deleted.
  • View activity: Events indicating when a file or folder has been via Glean.

The activity crawl operates with the following configurations:

  • Incremental activity crawls: These are performed every 1 minutes to capture recent changes.
  • Full activity crawls: These are conducted periodically to ensure all activity data is up-to-date.

Rate limits

Glean is restricted to a maximum of 16 QPS per individual user. Glean distributes all users across 10 distinct queues for an initial maximum of 160 QPS.

Update frequency

Content updates for the Box connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

  • Activity reports: Adds, updates, and permissions changes are crawled every minute. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.
  • Identity crawls for User Group Memberships: Modifications to group memberships are detected by the identity crawl, which operates hourly. This mechanism ensures that updates concerning user groups and their corresponding permissions are promptly reflected.
  • Incremental crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute activity reports.
  • Full crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to Glean's Crawling Strategy

How crawl works

The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:

  • Identity crawl: updating and adding of People data, including users, groups, and other information
  • Activity crawl: Adds, updates, and permissions changes to content
  • Webhooks: The system uses API to identify new/modified/deleted docs
  • Content crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.

Known limitations in crawl

  • Box has a per-user limit for API requests that we utilize for crawling. Glean runs into issues with this when customers have a large number of documents owned by a single service account. This can occur when customers do large migrations from on premise to cloud. Box itself recommends using a single service account.

  • The user setting up the connector must be a Box Admin. Co-admins do not have the necessary access permissions, which means they cannot access other co-admins items, leading to incomplete crawls.

Glean does not currently index:

  • Box web links
  • Custom metadata set on folders/files
  • Favorites collections

API endpoints

Glean systematically crawls and indexes content utilizing the designated Box API endpoints. Its application, accessible through the Box App Center, ensures comprehensive connectivity.

Authentication endpoints

Use CaseEndpointDocumentation
Refresh access token
https://api.box.com/oauth2/token
Refresh an Access Token using its client ID, secret, and refresh token.Refresh access token - Box API

Identity endpoints

Use CaseEndpointDocumentation
List enterprise users
https://api.box.com/2.0/users
Determine which users (and associated content) need to be indexed.List enterprise users - Box API
List groups for enterprise
https://api.box.com/2.0/groups
Fetch all groups within a tenant (for permissions).List groups for enterprise - Box API
List Enterprise Users
https://api.box.com/2.0//:group_id/memberships
Determine which users are members of which group (for permissions).List members of group - Box API

Content endpoints

Use CaseEndpointDocumentation
List items in folder
https://api.box.com/2.0/folders/:folder_id/items
List all items and content within a folder for indexing.List items in folder - Box API
Get file information
https://api.box.com/2.0/files/:file_id
Retrieve metadata for each specific item for indexing.Get file information - Box API
List file collaborations
https://api.box.com/2.0/files/:file_id/collaborations
Retrieve a list of all users with access to an item (for permissions).List file collaborations - Box API

Activity endpoints

Use CaseEndpointDocumentation
List user and enterprise events
https://api.box.com/2.0/events
Fetch activity data for each user for ranking signals (12 month limit).List user and enterprise events - Box API

Content configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders.

Exclusion (Red-Listing) options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

  • Users: Exclude content belonging to specific users from being crawled. How to find the User ID
  • Folders: Exclude content belonging to specific folders from being crawled.

Inclusion (Green-Listing) options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

  • Files: Include specific file content
  • Folders: Include content belonging to specific folders from being crawled.