Skip to main content

Setting up Glean to use Anthropic Claude models on Google Vertex AI

This article provides instructions for configuring Glean to use Anthropic Claude models on Google Vertex AI, allowing direct billing of LLM usage through your Google Vertex AI account using the customer key option.

warning

Do not use this document if you are leveraging the Glean Key option. For the Glean Key option, Glean manages the configuration and provisioning of LLM resources transparently.

Enable access to models in Vertex AI

Go to the Vertex AI Model Garden and make sure you have enabled access to the following foundation models from the GCP project that Glean is running in:

Model nameHow Glean uses the model
Claude Sonnet 4.6 claude-sonnet-4-6-20260217Agentic reasoning model used for assistant and autonomous agents
Claude Sonnet 4.6 claude-sonnet-4-6-20260217Agentic model used for other, more complex tasks in Glean
Claude Sonnet 4.6 claude-sonnet-4-6-20260217Fast agentic model used for simpler tasks such as follow-up question generation

Request additional quota from Vertex AI

You will need to file a standard GCP quota request, which is expressed in Requests Per Minute (RPM) and Tokens Per Minute (TPM). Filter for base_model: on the model names in the table below and region: for the region that your GCP project is running in.

Please note that the quota is not a guarantee of capacity, but is intended by Google to ensure fair use of the shared capacity, and your requests may not be served during peak periods. To obtain guaranteed capacity, please speak with your Google account team about purchasing Provisioned Throughput.

A screenshot of the Google Cloud console showing Vertex AI quota settings. It shows filters for service, base_model (anthropic-claude-sonnet-4-5), and region (us-east5). It details the quota for online prediction input tokens, output tokens, and requests per minute.

Capacity Requirements

Glean token consumption varies based on query complexity and document size. To estimate your weekly LLM costs, calculate your expected weekly query volume and multiply by the per-query cost based on current Claude API pricing. Actual token usage will vary by customer depending on query complexity and document size.

To estimate throughput requirements (TPM), identify your deployment's query-per-minute (QPM) rate at the desired percentile (typically p90), then multiply by the average tokens per query.

The table below illustrates example TPM conversions assuming 0.004 QPM per DAU, based on historical customer data.

UsersTPM
500125,000
1000245,000
2500615,000
50001,225,000
100002,450,000
200004,895,000
note

Glean highly recommends estimating capacity using your deployment's actual QPM to produce capacity estimates as QPM per DAU varies widely across customers.

Select the model in Glean Workspace

  1. Go to Admin Console → Platform → LLMs.
  2. Click on Add LLM.
  3. Select Vertex AI.
  4. Select Claude Sonnet 4.6 for the agentic model.
  5. Click Validate to ensure Glean can leverage the model.
  6. Once validated, click Save.

A screenshot of the Glean admin interface for selecting LLM models. Vertex AI is chosen as the hosting provider. Claude Sonnet 4.6 is selected for the agentic model, Claude Sonnet 4.6 for the fast agentic model, and Claude Sonnet 4.6 for the agentic reasoning model.

note
  • In order to use Claude 4.6 Sonnet with Glean, agentic engine features should be turned on. See details here. Until these features are turned on, Glean will continue to use agentic and fast agentic models you previously configured. You do not need to change your agentic and fast agentic model at this time.
  • We will use Application Default Credentials to call the models, so no additional authentication is required.

FAQ

Architecture Diagram

An architecture diagram showing the flow of a user query in the Glean system. A user asks a question, which goes to the Glean Planner. The planner uses query planning, tool selection, and execution to interact with the Glean Index, Governance Engine, and Knowledge Graph. It then uses this information for answer generation via Google Vertex AI models, and finally presents the answer to the user.