Box Connector Overview
Introduction
The Box connector for Glean allows Glean to fetch and index content from Box, ensuring that users can search and access documents for which they have authorized permissions.
-
Authentication: Glean requires the Box admin to authenticate Glean via OAuth2 during the setup of the Glean crawler
-
Data Storage: All data is stored in the cloud project within the customer’s cloud account, ensuring no data leaves the customer’s environment
API Usage
- Standard API: Glean uses Box’s standard API for Box to ingest all data
Integration Features
-
Content Captured: Glean captures Box projects, service management, dashboards, and more.
-
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Box web application, which enforces the permission.
Versions Supported
There are no specific version limitations of the Box connector
Objects Supported
The Box connector for Glean supports the following objects:
- Folders
- Files (including slides, word documents, etc.)
- Box Notes
- Comments on the files
Authentication Mechanism
During the configuration of Glean, a systematic authorization procedure will be conducted in conjunction with Box, during which Glean will be granted both refresh and access tokens. The access token is designed to facilitate API calls to the customer’s Box account, which possesses a limited validity period. A routine refresh operation is executed periodically to acquire a new access token by utilizing the refresh token obtained from the initial authorization process.
Connector credentials requirements
- The user setting up this data source must be the Box Admin.
- Note: Co-admin does not work. Co-admins cannot access other co-admin’s items there Glean will not be able to crawl all of the expected information.
Connection instructions
Optional: Recommended Notification Suppression
- Glean uses the download files endpoint as part of the crawl logic. If the Box instance is set up to send email notifications for suspicious download behavior, Glean recommends Box support and suppress notifications for the client ID used for the integration.
- Failure to suppress notifications may result in download notifications across the entire Box organization.
Setup in Glean
- Input the data source name in the Name text box and select an icon
- Complete any outstanding setup in Show setup instructions
- Click Authorize and follow the instructions
Authentication scope requirements
Scope | Purpose |
---|---|
Read all files/folders in Box | List all files from user drives |
Read and write all files and folders stored in Box | Required to download file content into Glean (despite saying write). |
Manage users | List users and associated group memberships |
Manage groups | List all groups |
Manage enterprise properties | Crawl recent enterprise logs activity to ingest newly created/modified data |
Admin can make calls on behalf of Users | Use the As-User header, to distribute rate limits between different owners of files. |
Items crawled
Content
For Box, Glean indexes the following content and associated permissions:
- Folders
- Files (e.g. slides, word documents, etc.)
- Box Notes
- Comments on the files
Identity
- Users: Information about users within the Box
- Groups: Details about groups within Box
Activity
- Adds: New files or folders added to Box.
- Updates: Modifications made to existing files or folders.
- Permissions Changes: Changes in file or folder sharing permissions.
- Deletions: Files or folders that have been deleted.
- View Activity: Events indicating when a file or folder has been via Glean.
The activity crawl operates with the following configurations:
- Incremental Activity Crawls: These are performed every 1 minutes to capture recent changes.
- Full Activity Crawls: These are conducted periodically to ensure all activity data is up-to-date.
Rate Limits
Glean is restricted to a maximum of 16 QPS per individual user. Glean distributes all users across 10 distinct queues for an initial maximum of 160 QPS.
Update frequency
Content updates for the Box connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
- Activity Reports: Adds, updates, and permissions changes are crawled every minute. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.
- Identity Crawls for User Group Memberships: Modifications to group memberships are detected by the identity crawl, which operates hourly. This mechanism ensures that updates concerning user groups and their corresponding permissions are promptly reflected.
- Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute activity reports.
- Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days
Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
How the crawl works
The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:
- Identity Crawl: updating and adding of People data, including users, groups, and other information
- Activity Crawl: Adds, updates, and permissions changes to content
- Webhooks: The system uses API to identify new/modified/deleted docs
- Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.
Known Limitations in Crawl
-
Box has a per-user limit for API requests that we utilize for crawling. Glean runs into issues with this when customers have a large number of documents owned by a single service account. This can occur when customers do large migrations from on premise to cloud. Box itself recommends using a single service account.
-
The user setting up the connector must be a Box Admin. Co-admins do not have the necessary access permissions, which means they cannot access other co-admins items, leading to incomplete crawls.
Glean does not currently index:
- Box Web Links
- Custom metadata set on folders/files
- Favorites Collections
API endpoints
Glean systematically crawls and indexes content utilizing the designated Box API endpoints. Its application, accessible through the Box App Center, ensures comprehensive connectivity.
Authentication Endpoints
Use Case | Endpoint | Documentation |
---|---|---|
Refresh access token https://api.box.com/oauth2/token | Refresh an Access Token using its client ID, secret, and refresh token. | Refresh access token - Box API |
Identity Endpoints
Use Case | Endpoint | Documentation |
---|---|---|
List enterprise users https://api.box.com/2.0/users | Determine which users (and associated content) need to be indexed. | List enterprise users - Box API |
List groups for enterprise https://api.box.com/2.0/groups | Fetch all groups within a tenant (for permissions). | List groups for enterprise - Box API |
List Enterprise Users https://api.box.com/2.0//:group_id/memberships | Determine which users are members of which group (for permissions). | List members of group - Box API |
Content Endpoints
Use Case | Endpoint | Documentation |
---|---|---|
List items in folder https://api.box.com/2.0/folders/:folder_id/items | List all items and content within a folder for indexing. | List items in folder - Box API |
Get file information https://api.box.com/2.0/files/:file_id | Retrieve metadata for each specific item for indexing. | Get file information - Box API |
List file collaborations https://api.box.com/2.0/files/:file_id/collaborations | Retrieve a list of all users with access to an item (for permissions). | List file collaborations - Box API |
Activity Endpoints
Use Case | Endpoint | Documentation |
---|---|---|
List user and enterprise events https://api.box.com/2.0/events | Fetch activity data for each user for ranking signals (12 month limit). | List user and enterprise events - Box API |
Content Configuration
Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.
The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders.
Exclusion (Red-Listing) Options
Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.
- Users: Exclude content belonging to specific users from being crawled. How to find the User ID
- Folders: Exclude content belonging to specific folders from being crawled.
Inclusion (Green-Listing) Options
Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.
- Files: Include specific file content
- Folders: Include content belonging to specific folders from being crawled.
Was this page helpful?