- Code search as a shared layer for Glean for Engineering
- How Glean code search works under the hood
- Agentic looping with Glean Assistant and Glean Document Reader
- Security, performance, and eval results
- How Glean code search works with IDE tools through MCP
- How to get started
Code search is a shared layer for Glean for Engineering experiences
In Glean, code search isn’t a stand-alone product. It is a layer in the work AI stack that every product calls when it needs to understand your codebase. It builds on the Enterprise Graph to search across your repositories within seconds, returning relevant files, diffs, code snippets, and references to help answer code related questions. It works alongside document search so the assistant can connect code to design docs, runbooks, tickets, and conversations. Glean code search shows up in several ways:- Glean Assistant: When you ask a code question, our agentic system will invoke code search to pull in relevant code files and explain them, instead of relying on generic web training or isolated snippets.
- Glean Search: You can ask broad natural-language questions or issue precise keyword queries with filters for repository, file path, and extension.
- Actions in Agents: You can explicitly use code search as a tool inside agents to debug, explore implementations, or automate workflows that depend on understanding code.
- Tool in Glean MCP Server: You can use code search through Glean MCP Server to third-party hosts like Claude Code or Cursor. That lets those tools converge on correct answers faster by combining their local view of your repo with Glean’s semantic and lexical search over your entire codebase.

How code search works under the hood
We built a dedicated code search tool that our agentic engine can call whenever it needs to understand or navigate code. When Glean Assistant or an agent receives a code-related query, it uses this tool to retrieve the right files and chunks. The system has three components:- Crawling: docs, permissions, activity, and code
- Indices: semantic and lexical
- Agentic looping: how the assistant drives these tools
1. Crawling: docs, permissions, activity, and code
Glean’s architecture starts with a crawling layer that continuously builds an enterprise graph. For code, we maintain three complementary crawls: docs/code, permissions, and activity.Docs crawl (code and code-adjacent content)
Docs crawl brings in the code and code-adjacent content that engineers and agents will read and reason over:- It connects to a wide array of code datasources, including GitHub, GitHub Enterprise, GitLab, Bitbucket, and on-prem mirrors.
- It crawls every file across all your connected repositories, not just a single monolith or the repo you have checked out.
- Each connector runs an initial full crawl to index the existing corpus, then stays fresh via incremental updates, using webhooks or periodic diffs to pick up only changed files instead of recrawling entire repos on every commit.
- For GitHub and other major code datasources, we track a lifecycle freshness metric that captures how up-to-date the connector is (for example, how quickly new commits and branches show up in search). This makes staleness visible and lets us monitor and tune connector performance over time.
Permissions crawl
Permissions crawl ensures that ingested code carries the right access controls:- It mirrors source-of-truth ACLs from all your code datasources (GitHub, GitLab, etc.), including users, groups, teams, and repository/project-level permissions.
- For each document or code file, we attach the same ACLs your source system enforces: which org/project/repo it belongs to, which teams, groups, or users can access it, and any additional policies such as confidential directories or archived repos.
- These ACLs are synchronized continuously and pushed all the way through to query time. When an engineer searches, the index can only ever return files they could already see in the source system.
Activity crawl
Activity crawl captures how people actually work with content and code:- It records views and visits, so we can tell which docs and code files are commonly read in practice.
- It tracks edits and authorship, including which files are frequently modified together in PRs or changesets, giving us co-edit signals for related code and docs.
- It ingests references and links across systems, for example code files referenced in design docs, tickets, or runbooks, to build linkage between code, docs, and incidents.
- Preferring examples that real teams actually use.
- Surfacing docs that engineers visit when working in a given code area.
- Following references between code and documentation when answering multi-hop questions.
2. Indices: semantic and lexical
On top of this crawl, we build two complementary indices for code:- A lexical index over tokens in code, comments, and paths.
- A semantic index that understands the meaning of code.
Lexical index (Opensearch)
Our lexical code index is similar to a traditional inverted index:- We treat each code file as a document.
- We break each document into terms (tokens), normalize and de-noise them, and build an inverted index mapping tokens to the most relevant documents.
- For code, we tune tokenization for camelCase, snake_case, and other common coding conventions, so RuntimeConfigError and runtime_config_error are both searchable in natural ways.
- The index supports faceting, so agents can filter by file extension, repo, or path pattern.
- Filenames, symbols, and test names.
- Error messages and stack traces.
- Tight filters like “just this repo” or “only .go files”
Semantic index
The semantic index is a vector database that stores high-dimensional embeddings representing the meaning of code. Modern codebases are full of near-duplicates in intent that look different at the text level:- Different teams wrap the same logic with different helpers
- The same idea is implemented in different languages.
- Engineers rename variables or refactor functions while keeping behavior roughly the same.
- Chunking
- We split each code file into smaller, meaningful chunks (e.g., imports in one chunk, related functions or classes in others).
- We use AST-based chunking, so we split along semantic boundaries instead of arbitrary line counts. That avoids splitting functions in half and yields chunks that are easier to match to questions.
- Embedding
- We embed each chunk into a numeric vector using a model trained on both code and natural language, so queries and code live in the same space.
- Alongside the raw code text, we store metadata like file path, repo name, and language so the agent can reason with that context later.
- ANN indexing
- We build an approximate nearest-neighbor (ANN) index over these vectors.
- At query time, we embed the user’s question and look up the closest code chunks in this vector space.
- A single ANN lookup returns the top-N most relevant code chunks to feed into downstream ranking and LLM reasoning.
Combining semantic and lexical at query time
Early experiments with semantic-only retrieval showed strong intent-level performance, but struggled when exact strings or filenames mattered. Developers want both:- “Find the right concept”
- “Let me jump to the exact file or symbol I care about”
- Semantic scores say “this file is about what you’re asking.”
- Lexical scores say “these files match the exact names, paths, or error strings you typed.”
- Ranking adds popularity, recency, and affinity signals to prefer the code people actually use.
3. Agentic looping in Glean Assistant with code search and Doc Reader
Indices alone aren’t enough. Real engineering questions often require multiple hops:- “Where is this implemented, and what do I need to change to support multi-tenant configs?”
- “What changed recently that could explain this new runtime error?”
- Run several code search queries to go broad across services
- Narrow to a promising repo and path
- Use GDR to load a specific file once it has found the right entry point
- Iterate as it learns more from the retrieved code
- Load the full file once a promising result has been found.
- Preserve structure like imports, class and function boundaries, and comments that might be split across chunks.
- Let the model reason about how different parts of the file relate, for example, helper functions, shared constants, or multiple code paths in the same module.
Security-First Design
For enterprises, source code is among their most valuable intellectual property, and it comes with significant security risk. Internally, incorrect access controls can allow employees to view code they should not be able to see. Externally, sending code to LLMs can expose it to unwanted storage or even model training on proprietary assets. Together, these issues create a wide surface area of security vulnerabilities whenever organizations work with their code. Glean is designed with security as a first principle:- Your code stays within your Glean VPC and reflects the permissions of the source system.
- We mirror ACLs from your code hosts and enforce them at query time, so code search can only return files that a user is allowed to see.
- We have contractual agreements with LLM providers to ensure zero day data retention and prevent models from training on enterprise data.
Performance & Scale
In a typical large enterprise deployment, Glean code search sits in front of millions of code files and still stays fast enough to sit in the developer loop. In a typical large customer deployment, we:- Index 1k+ repositories
- Make 100M+ code files searchable
- Maintain a semantic index over 10M+ code chunks
- Keep latency low - 50 ms at P95
- A 3x reduction in tool calls
- ~27% fewer output tokens
- 1.43 win-loss ratio
How to Get Started
Glean Code Search currently supports:- GitHub Cloud, Enterprise, and On-Prem
- GitLab
- Bitbucket