- Exclude non-English pages so only one language is searchable
- Exclude legacy, beta, or internal sections of a site
- Exclude specific hosts, subdomains, or paths
How to add a sitemap redlist regex in Glean
- In the Glean admin dashboard, click the wrench (settings) icon.
- Go to Data sources.
- Find and select your web data source (for example, Example Web).
- Click Setup.
- Scroll down to Advanced settings.
- In Sitemap red list, paste your regex pattern.
- Click Save.
Example regex patterns for language URLs
The exact regex you use depends on how your site encodes languages in the URL. Below are two common patterns and sample redlists for excluding non-English pages.Example 1: Locale code after /docs/ on www
Some sites include a full locale code (language + region) right after /docs/, for example:
https://www.example.com/docs/en-us/...https://www.example.com/docs/es-es/...https://www.example.com/docs/fr-fr/...
en-us), you could redlist:
Example 2: Short language code directly under www
Other sites use a short language code right after the domain, for example:
https://www.example.com/es/...https://www.example.com/fr/...https://www.example.com/it/...
In all cases:
- URLs that match the regex are redlisted and will be ignored by Glean once the data source is crawled again.
- URLs that donβt match (for example, English paths) remain eligible for indexing.