The instructions below will work for on-prem instances that the Glean Crawler can access. Glean supports deployments on both GCP and AWS. Your on-prem GitHub Server instance must be network-accessible to the Glean crawler running in your cloud. You can reach out to Glean Support for any network configuration required.The GitHub Server connector for Glean allows Glean to fetch and index content from GitHub Server, ensuring that users can search and access documents for which they have authorized permissions.
Key features
- Glean requires authentication to the GitHub instance in order to fetch relevant information.
- Authentication is done by creating an application in GitHub.
- Glean understands all user access permissions and strictly enforces permissions for users at the time of the query which ensures that users are not able to see results which they do not have access to.
- It’s important to note that all data is stored in the customer’s cloud account and no data leaves the customer’s environment.
Supported objects
For GitHub Server, Glean captures the following content:- PR descriptions
- PR conversations/comments
- Issue threads
- Commit messages for main branch
- Wikis
- Code: Code Search is supported in Glean Assistant for repositories connected via GitHub Server. Code Search is enabled by default once connected; the Admin console toggle has been removed. See Code Tool for more details.
- GitHub pages: You must enter a comma-separated list of repository names to include pages in your search index. We currently only support indexing repositories that use the legacy
gh-pagesbranch based workflow, and within that branch, we index only HTML and markdown files.
Crawl strategy and indexed content
This table outlines the purpose, frequency, and corresponding API endpoints.| Crawl Scope | Purpose | Frequency | API endpoint |
|---|---|---|---|
| Repositories | Discover all repositories in the organization. | Every 4 hours | GET /orgs/{org}/repos |
| GIT | Clone repositories and crawl files, commits, and READMEs. | Full: 28 days; Incremental: 10m | |
| PRs | Crawl pull requests (PRs), comments, reviews, diffs, and changed files. | Full: 28 days; Incremental: 10 minutes | GET /repos/{owner}/{repo}/pulls |
| Issues | Crawl issues (non-PR) and comments. | Full: 28 days ; Incremental: 10 minutes | GET /repos/{owner}/{repo}/issues |
| Pages | Crawl GitHub Pages content. Only indexes HTML and Markdown files from repositories using the legacy gh-pages branch workflow. | ||
| Identity | Discover users, groups (repositories), and repository collaborators. | Every 10 minutes | GET /orgs/{org}/members (for Users) GET /repos/{owner}/{repo}/collaborators (for Repository collaborators crawl) |
Webhook events
| Event type | Action | Behavior |
|---|---|---|
| pull_request | opened, edited, closed, reopened, etc. | Create dirty nodes for PR + comments/reviews/diff/files |
| pull_request_review | submitted, edited | Create dirty node for reviews |
| pull_request_review_comment | created, edited, deleted | Create dirty node for review comments or publish deletion doc |
| issue_comment | created, edited, deleted | Create dirty node for issue comments or publish deletion doc |
| issues | opened, edited, closed, reopened, etc. | Create dirty node for issue + comments |
| installation | created, deleted | Trigger admin re-auth |
| member | added, removed, edited | Update user identity (if enabled) |
| membership | added, removed | Update team memberships (if enabled) |
| organization | member_added, member_removed | Update user identity (if enabled) |
| team | created, deleted, edited | Update team metadata (if enabled) |
Setup instructions
Step 1. Create a GitHub App
This app will be used by Glean to crawl your GitHub instance.- Go to your GitHub Server.
- Click on your organization.
- Click settings.
- Click GitHub Apps.
- Click New GitHub App.
- Fill the following fields:
- Name: Glean
- Homepage URL: https://app.glean.com
- Identifying and authorizing users
- User authorization callback URL: Copy the generated URL from the setup page
- Request user authorization: unchecked
- Post installation
- Leave blank
- Webhook
- Webhook Active: checked
- Webhook URL: Copy the generated URL from the setup page
- Webhook secret: %1%
- Copy the webhook secret into the corresponding field in Glean
- Copy the webhook secret into the corresponding field in the GitHub App
- Repository permissions
- Set only the following to read-only:
- Repository permissions
- Administration
- Contents
- Commit statuses
- Issues
- Metadata
- Pull requests
- Pages
- Organization permissions
- Members
- User permissions (or Account Permissions)
- Email addresses
- Repository permissions
- Set only the following to read-only:
- Subscribe to events
- Check only the following:
- Commit comment
- Issues
- Issue comment
- Member
- Organization
- Pull request
- Pull request review
- Pull request review comment
- Push
- Repository
- Team
- Team add
- Check only the following:
- Where can this App be installed: Any account
Step 2. Configure the GitHub App
Copy the following values into the corresponding fields in Glean:- App ID
- Client ID
- Client Secret
Step 3. Install the GitHub App
Click on Install App from the menu on the left. Click Install for your organization.Step 4. Configure additional configs on Admin Console
Enter the following configs in Glean:- Git Domain
- Organization Name
GitHub authentication system
The system uses two separate flows to manage access: App authentication (for organizational data) and User token refresh (for individual sessions).GitHub app authentication (Installation token)
This flow manages the application’s core access token (AUTH_ACCESS_TOKEN), which allows Glean to read organizational data via the GitHub App installation.| Process Stage | Action / Purpose | Key artifacts | Expiry logic |
|---|---|---|---|
| Token request | Generates a JSON Web Token (JWT) to request a new access token from the GitHub API. | JWT, installation ID (Cached for 24h) | |
| Active token | The current token used for all API calls. | access token | 1 hour expiry. |
| Pre-fetch | Stores a pre-fetched token 30 minutes before the active token expires. | next access token | |
| Validation | If the active token is < 5 minutes from expiry, the pending token is immediately accepted. |
User token refresh
This flow manages the renewal of individual user sessions via the OAuth token refresh mechanism.| Process stage | Action / Purpose | Condition / Endpoint |
|---|---|---|
| Discovery | The admin crawl queries the user token store for sessions needing renewal. | Targets tokens expiring within a two-hour buffer (for 8-hour tokens). |
| Renewal | The system posts the refresh token to the GitHub OAuth endpoint to retrieve a new pair. | gitDomain/login/oauth/access_token |
| Update | The User Token Store is updated with the new access and refresh token pair and their updated expiry time. | Performed for each user requiring a refresh. |