Skip to main content
The instructions below will work for on-prem instances that the Glean Crawler can access. Glean supports deployments on both GCP and AWS. Your on-prem GitHub Server instance must be network-accessible to the Glean crawler running in your cloud. You can reach out to Glean Support for any network configuration required.
The GitHub Server connector for Glean allows Glean to fetch and index content from GitHub Server, ensuring that users can search and access documents for which they have authorized permissions.

Key features

  • Glean requires authentication to the GitHub instance in order to fetch relevant information.
  • Authentication is done by creating an application in GitHub.
  • Glean understands all user access permissions and strictly enforces permissions for users at the time of the query which ensures that users are not able to see results which they do not have access to.
  • It’s important to note that all data is stored in the customer’s cloud account and no data leaves the customer’s environment.

Supported objects

For GitHub Server, Glean captures the following content:
  • PR descriptions
  • PR conversations/comments
  • Issue threads
  • Commit messages for main branch
  • Wikis
  • Code: Code Search is supported in Glean Assistant for repositories connected via GitHub Server. Code Search is enabled by default once connected; the Admin console toggle has been removed. See Code Tool for more details.
  • GitHub pages: You must enter a comma-separated list of repository names to include pages in your search index. We currently only support indexing repositories that use the legacy gh-pages branch based workflow, and within that branch, we index only HTML and markdown files.

Crawl strategy and indexed content

This table outlines the purpose, frequency, and corresponding API endpoints.
Crawl ScopePurposeFrequencyAPI endpoint
RepositoriesDiscover all repositories in the organization.Every 4 hoursGET /orgs/{org}/repos
GITClone repositories and crawl files, commits, and READMEs.Full: 28 days; Incremental: 10m
PRsCrawl pull requests (PRs), comments, reviews, diffs, and changed files.Full: 28 days; Incremental: 10 minutesGET /repos/{owner}/{repo}/pulls
IssuesCrawl issues (non-PR) and comments.Full: 28 days ; Incremental: 10 minutesGET /repos/{owner}/{repo}/issues
PagesCrawl GitHub Pages content. Only indexes HTML and Markdown files from repositories using the legacy gh-pages branch workflow.
IdentityDiscover users, groups (repositories), and repository collaborators.Every 10 minutesGET /orgs/{org}/members (for Users) GET /repos/{owner}/{repo}/collaborators (for Repository collaborators crawl)

Webhook events

Event typeActionBehavior
pull_requestopened, edited, closed, reopened, etc.Create dirty nodes for PR + comments/reviews/diff/files
pull_request_reviewsubmitted, editedCreate dirty node for reviews
pull_request_review_commentcreated, edited, deletedCreate dirty node for review comments or publish deletion doc
issue_commentcreated, edited, deletedCreate dirty node for issue comments or publish deletion doc
issuesopened, edited, closed, reopened, etc.Create dirty node for issue + comments
installationcreated, deletedTrigger admin re-auth
memberadded, removed, editedUpdate user identity (if enabled)
membershipadded, removedUpdate team memberships (if enabled)
organizationmember_added, member_removedUpdate user identity (if enabled)
teamcreated, deleted, editedUpdate team metadata (if enabled)

Setup instructions

Step 1. Create a GitHub App

This app will be used by Glean to crawl your GitHub instance.
  1. Go to your GitHub Server.
  2. Click on your organization.
  3. Click settings.
  4. Click GitHub Apps.
  5. Click New GitHub App.
  6. Fill the following fields:
    1. Name: Glean
    2. Homepage URL: https://app.glean.com
    3. Identifying and authorizing users
      • User authorization callback URL: Copy the generated URL from the setup page
      • Request user authorization: unchecked
    4. Post installation
      • Leave blank
    5. Webhook
      • Webhook Active: checked
      • Webhook URL: Copy the generated URL from the setup page
      • Webhook secret: %1%
        • Copy the webhook secret into the corresponding field in Glean
        • Copy the webhook secret into the corresponding field in the GitHub App
    6. Repository permissions
      • Set only the following to read-only:
        • Repository permissions
          • Administration
          • Contents
          • Commit statuses
          • Issues
          • Metadata
          • Pull requests
          • Pages
        • Organization permissions
          • Members
        • User permissions (or Account Permissions)
          • Email addresses
    7. Subscribe to events
      • Check only the following:
        • Commit comment
        • Issues
        • Issue comment
        • Member
        • Organization
        • Pull request
        • Pull request review
        • Pull request review comment
        • Push
        • Repository
        • Team
        • Team add
    8. Where can this App be installed: Any account

Step 2. Configure the GitHub App

Copy the following values into the corresponding fields in Glean:
  • App ID
  • Client ID
  • Client Secret
At the very bottom of the page, click “Generate a private key” It will download the key to your local machine. Upload this file into the corresponding field in Glean.

Step 3. Install the GitHub App

Click on Install App from the menu on the left. Click Install for your organization.

Step 4. Configure additional configs on Admin Console

Enter the following configs in Glean:
  1. Git Domain
  2. Organization Name

GitHub authentication system

The system uses two separate flows to manage access: App authentication (for organizational data) and User token refresh (for individual sessions).

GitHub app authentication (Installation token)

This flow manages the application’s core access token (AUTH_ACCESS_TOKEN), which allows Glean to read organizational data via the GitHub App installation.
Process StageAction / PurposeKey artifactsExpiry logic
Token requestGenerates a JSON Web Token (JWT) to request a new access token from the GitHub API.JWT, installation ID (Cached for 24h)
Active tokenThe current token used for all API calls.access token1 hour expiry.
Pre-fetchStores a pre-fetched token 30 minutes before the active token expires.next access token
ValidationIf the active token is < 5 minutes from expiry, the pending token is immediately accepted.

User token refresh

This flow manages the renewal of individual user sessions via the OAuth token refresh mechanism.
Process stageAction / PurposeCondition / Endpoint
DiscoveryThe admin crawl queries the user token store for sessions needing renewal.Targets tokens expiring within a two-hour buffer (for 8-hour tokens).
RenewalThe system posts the refresh token to the GitHub OAuth endpoint to retrieve a new pair.gitDomain/login/oauth/access_token
UpdateThe User Token Store is updated with the new access and refresh token pair and their updated expiry time.Performed for each user requiring a refresh.