Alerts reference

This document contains a complete reference of all alerts in Sourcegraph's monitoring, and next steps for when you find alerts that are firing. If your alert isn't mentioned here, or if the next steps don't help, contact us for assistance.

To learn more about Sourcegraph's alerting and how to set up alerts, see our alerting guide.

frontend: 99th_percentile_search_request_duration

99th percentile successful search request duration over 5m

Descriptions

warning frontend: 20s+ 99th percentile successful search request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 20, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_99th_percentile_search_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_search_streaming_latency_seconds_bucket{source="browser"}[5m])))) >= 20)

frontend: 90th_percentile_search_request_duration

90th percentile successful search request duration over 5m

Descriptions

warning frontend: 15s+ 90th percentile successful search request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 15, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_90th_percentile_search_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_search_streaming_latency_seconds_bucket{source="browser"}[5m])))) >= 15)

frontend: timeout_search_responses

timeout search responses every 5m

Descriptions

warning frontend: 2%+ timeout search responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_timeout_search_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(src_search_streaming_response{source="browser",status=~"timeout\\|partial_timeout"\}[5m])) / sum(increase(src_search_streaming_response\{source="browser"}[5m])) * 100) >= 2)

frontend: hard_error_search_responses

hard error search responses every 5m

Descriptions

warning frontend: 2%+ hard error search responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_hard_error_search_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(src_search_streaming_response{source="browser",status="error"\}[5m])) / sum(increase(src_search_streaming_response\{source="browser"}[5m])) * 100) >= 2)

frontend: search_no_results

searches with no results every 5m

Descriptions

warning frontend: 5%+ searches with no results every 5m for 15m0s

Next steps

A sudden increase in this metric could indicate a problem with search indexing, or a shift in search behavior that are causing fewer users to find the results they`re looking for.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_search_no_results"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(src_search_streaming_response{source="browser",status="no_results"\}[5m])) / sum(increase(src_search_streaming_response\{source="browser"}[5m])) * 100) >= 5)

frontend: search_alert_user_suggestions

search alert user suggestions shown every 5m

Descriptions

warning frontend: 5%+ search alert user suggestions shown every 5m for 15m0s

Next steps

This indicates your user`s are making syntax errors or similar user errors.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_search_alert_user_suggestions"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (alert_type) (increase(src_search_streaming_response{alert_type!~"timed_out",source="browser",status="alert"\}[5m])) / ignoring (alert_type) group_left () sum(increase(src_search_streaming_response\{source="browser"}[5m])) * 100) >= 5)

frontend: page_load_latency

90th percentile page load latency over all routes over 10m

Descriptions

warning frontend: 2s+ 90th percentile page load latency over all routes over 10m

Next steps

Confirm that the Sourcegraph frontend has enough CPU/memory using the provisioning panels.
Investigate potential sources of latency by selecting Explore and modifying the sum by(le) section to include additional labels: for example, sum by(le, job) or sum by (le, instance).
Trace a request to see what the slowest part is: https://sourcegraph.com/docs/admin/observability/tracing
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_page_load_latency"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_http_request_duration_seconds_bucket{route!="blob",route!="raw",route!~"graphql.*"}[10m])))) >= 2)

frontend: 99th_percentile_search_codeintel_request_duration

99th percentile code-intel successful search request duration over 5m

Descriptions

warning frontend: 20s+ 99th percentile code-intel successful search request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 20, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
This alert may indicate that your instance is struggling to process symbols queries on a monorepo, learn more here.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_99th_percentile_search_codeintel_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",request_name="CodeIntelSearch",source="browser",type="Search"}[5m])))) >= 20)

frontend: 90th_percentile_search_codeintel_request_duration

90th percentile code-intel successful search request duration over 5m

Descriptions

warning frontend: 15s+ 90th percentile code-intel successful search request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 15, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
This alert may indicate that your instance is struggling to process symbols queries on a monorepo, learn more here.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_90th_percentile_search_codeintel_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",request_name="CodeIntelSearch",source="browser",type="Search"}[5m])))) >= 15)

frontend: hard_timeout_search_codeintel_responses

hard timeout search code-intel responses every 5m

Descriptions

warning frontend: 2%+ hard timeout search code-intel responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_hard_timeout_search_codeintel_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max(((sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status="timeout"\}[5m])) + sum(increase(src_graphql_search_response\{alert_type="timed_out",request_name="CodeIntelSearch",source="browser",status="alert"\}[5m]))) / sum(increase(src_graphql_search_response\{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)

frontend: hard_error_search_codeintel_responses

hard error search code-intel responses every 5m

Descriptions

warning frontend: 2%+ hard error search code-intel responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_hard_error_search_codeintel_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status=~"error"\}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response\{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)

frontend: partial_timeout_search_codeintel_responses

partial timeout search code-intel responses every 5m

Descriptions

warning frontend: 5%+ partial timeout search code-intel responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_partial_timeout_search_codeintel_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status="partial_timeout"\}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response\{request_name="CodeIntelSearch",source="browser",status="partial_timeout"}[5m])) * 100) >= 5)

frontend: search_codeintel_alert_user_suggestions

search code-intel alert user suggestions shown every 5m

Descriptions

warning frontend: 5%+ search code-intel alert user suggestions shown every 5m for 15m0s

Next steps

This indicates a bug in Sourcegraph, please open an issue.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_search_codeintel_alert_user_suggestions"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (alert_type) (increase(src_graphql_search_response{alert_type!~"timed_out",request_name="CodeIntelSearch",source="browser",status="alert"\}[5m])) / ignoring (alert_type) group_left () sum(increase(src_graphql_search_response\{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 5)

frontend: 99th_percentile_search_api_request_duration

99th percentile successful search API request duration over 5m

Descriptions

warning frontend: 50s+ 99th percentile successful search API request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 20, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_99th_percentile_search_api_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",source="other",type="Search"}[5m])))) >= 50)

frontend: 90th_percentile_search_api_request_duration

90th percentile successful search API request duration over 5m

Descriptions

warning frontend: 40s+ 90th percentile successful search API request duration over 5m

Next steps

Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 15, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
Check that most repositories are indexed by visiting https://sourcegraph.example.com/admin/repositories?filter=needs-index (it should show few or no results.)
Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_90th_percentile_search_api_request_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",source="other",type="Search"}[5m])))) >= 40)

frontend: hard_error_search_api_responses

hard error search API responses every 5m

Descriptions

warning frontend: 2%+ hard error search API responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_hard_error_search_api_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{source="other",status=~"error"\}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response\{source="other"}[5m]))) >= 2)

frontend: partial_timeout_search_api_responses

partial timeout search API responses every 5m

Descriptions

warning frontend: 5%+ partial timeout search API responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_partial_timeout_search_api_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(src_graphql_search_response{source="other",status="partial_timeout"\}[5m])) / sum(increase(src_graphql_search_response\{source="other"}[5m]))) >= 5)

frontend: search_api_alert_user_suggestions

search API alert user suggestions shown every 5m

Descriptions

warning frontend: 5%+ search API alert user suggestions shown every 5m

Next steps

This indicates your user`s search API requests have syntax errors or a similar user error. Check the responses the API sends back for an explanation.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_search_api_alert_user_suggestions"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (alert_type) (increase(src_graphql_search_response{alert_type!~"timed_out",source="other",status="alert"\}[5m])) / ignoring (alert_type) group_left () sum(increase(src_graphql_search_response\{source="other",status="alert"}[5m]))) >= 5)

frontend: frontend_site_configuration_duration_since_last_successful_update_by_instance

maximum duration since last successful site configuration update (all "frontend" instances)

Descriptions

critical frontend: 300s+ maximum duration since last successful site configuration update (all "frontend" instances)

Next steps

This indicates that one or more "frontend" instances have not successfully updated the site configuration in over 5 minutes. This could be due to networking issues between services or problems with the site configuration service itself.
Check for relevant errors in the "frontend" logs, as well as frontend`s logs.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_frontend_frontend_site_configuration_duration_since_last_successful_update_by_instance"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: max((max(max_over_time(src_conf_client_time_since_last_successful_update_seconds{job=~"(sourcegraph-)?frontend"}[1m]))) >= 300)

frontend: internal_indexed_search_error_responses

internal indexed search error responses every 5m

Descriptions

warning frontend: 5%+ internal indexed search error responses every 5m for 15m0s

Next steps

Check the Zoekt Web Server dashboard for indications it might be unhealthy.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_internal_indexed_search_error_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (code) (increase(src_zoekt_request_duration_seconds_count{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(src_zoekt_request_duration_seconds_count[5m])) * 100) >= 5)

frontend: internal_unindexed_search_error_responses

internal unindexed search error responses every 5m

Descriptions

warning frontend: 5%+ internal unindexed search error responses every 5m for 15m0s

Next steps

Check the Searcher dashboard for indications it might be unhealthy.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_internal_unindexed_search_error_responses"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (code) (increase(searcher_service_request_total{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(searcher_service_request_total[5m])) * 100) >= 5)

frontend: 99th_percentile_gitserver_duration

99th percentile successful gitserver query duration over 5m

Descriptions

warning frontend: 20s+ 99th percentile successful gitserver query duration over 5m

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_99th_percentile_gitserver_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.99, sum by (le, category) (rate(src_gitserver_request_duration_seconds_bucket{job=~"(sourcegraph-)?frontend"}[5m])))) >= 20)

frontend: gitserver_error_responses

gitserver error responses every 5m

Descriptions

warning frontend: 5%+ gitserver error responses every 5m for 15m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_gitserver_error_responses"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (category) (increase(src_gitserver_request_duration_seconds_count{code!~"2..",job=~"(sourcegraph-)?frontend"\}[5m])) / ignoring (code) group_left () sum by (category) (increase(src_gitserver_request_duration_seconds_count\{job=~"(sourcegraph-)?frontend"}[5m])) * 100) >= 5)

frontend: observability_test_alert_warning

warning test alert metric

Descriptions

warning frontend: 1+ warning test alert metric

Next steps

This alert is triggered via the triggerObservabilityTestAlert GraphQL endpoint, and will automatically resolve itself.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_observability_test_alert_warning"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (owner) (observability_test_metric_warning)) >= 1)

frontend: observability_test_alert_critical

critical test alert metric

Descriptions

critical frontend: 1+ critical test alert metric

Next steps

This alert is triggered via the triggerObservabilityTestAlert GraphQL endpoint, and will automatically resolve itself.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_frontend_observability_test_alert_critical"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: max((max by (owner) (observability_test_metric_critical)) >= 1)

frontend: cloudkms_cryptographic_requests

cryptographic requests to Cloud KMS every 1m

Descriptions

warning frontend: 15000+ cryptographic requests to Cloud KMS every 1m for 5m0s
critical frontend: 30000+ cryptographic requests to Cloud KMS every 1m for 5m0s

Next steps

Revert recent commits that cause extensive listing from "external_services" and/or "user_external_accounts" tables.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_cloudkms_cryptographic_requests",
  "critical_frontend_cloudkms_cryptographic_requests"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(increase(src_cloudkms_cryptographic_total[1m]))) >= 15000)

Generated query for critical alert: max((sum(increase(src_cloudkms_cryptographic_total[1m]))) >= 30000)

frontend: goroutine_error_rate

error rate for periodic goroutine executions

Descriptions

warning frontend: 0.01reqps+ error rate for periodic goroutine executions for 15m0s

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Look for recent changes to the routine`s code or configuration
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_goroutine_error_rate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*frontend.*"}[5m]))) >= 0.01)

frontend: goroutine_error_percentage

percentage of periodic goroutine executions that result in errors

Descriptions

warning frontend: 5%+ percentage of periodic goroutine executions that result in errors

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Consider temporarily disabling the routine if it`s non-critical and causing cascading issues
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_goroutine_error_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*frontend.*"\}[5m])) / sum by (name, job_name) (rate(src_periodic_goroutine_total\{job=~".*frontend.*"}[5m]) > 0) * 100) >= 5)

frontend: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning frontend: 0.1s+ mean blocked seconds per conn request for 10m0s
critical frontend: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_mean_blocked_seconds_per_conn_request",
  "critical_frontend_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="frontend"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="frontend"}[5m]))) >= 0.5)

frontend: cpu_usage_percentage

CPU usage

Descriptions

warning frontend: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}) >= 95)

frontend: memory_rss

memory (RSS)

Descriptions

warning frontend: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^(frontend\\|sourcegraph-frontend).*"\} / container_spec_memory_limit_bytes\{name=~"^(frontend\\|sourcegraph-frontend).*"}) * 100) >= 90)

frontend: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning frontend: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_container_cpu_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}) >= 99)

frontend: container_memory_usage

container memory usage by instance

Descriptions

warning frontend: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_container_memory_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}) >= 99)

frontend: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning frontend: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the (frontend|sourcegraph-frontend) service.
Docker Compose: Consider increasing cpus: of the (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}[1d])) >= 80)

frontend: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning frontend: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the (frontend|sourcegraph-frontend) service.
Docker Compose: Consider increasing memory: of the (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}[1d])) >= 80)

frontend: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning frontend: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}[5m])) >= 90)

frontend: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning frontend: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of (frontend|sourcegraph-frontend) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(frontend\\|sourcegraph-frontend).*"}[5m])) >= 90)

frontend: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning frontend: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of (frontend|sourcegraph-frontend) container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^(frontend\\|sourcegraph-frontend).*"})) >= 1)

frontend: go_goroutines

maximum active goroutines

Descriptions

warning frontend: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_go_goroutines"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*(frontend\\|sourcegraph-frontend)"})) >= 10000)

frontend: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning frontend: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*(frontend\\|sourcegraph-frontend)"})) >= 2)

frontend: pods_available_percentage

percentage pods available

Descriptions

critical frontend: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod (frontend\|sourcegraph-frontend) (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p (frontend\|sourcegraph-frontend).
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_frontend_pods_available_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*(frontend\\|sourcegraph-frontend)"\}) / count by (app) (up\{app=~".*(frontend\\|sourcegraph-frontend)"}) * 100) <= 50)

frontend: email_delivery_failures

email delivery failure rate over 30 minutes

Descriptions

warning frontend: 0%+ email delivery failure rate over 30 minutes
critical frontend: 10%+ email delivery failure rate over 30 minutes

Next steps

Check your SMTP configuration in site configuration.
Check sourcegraph-frontend logs for more detailed error messages.
Check your SMTP provider for more detailed error messages.
Use sum(increase(src_email_send{success="false"}[30m])) to check the raw count of delivery failures.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_frontend_email_delivery_failures",
  "critical_frontend_email_delivery_failures"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum(increase(src_email_send{success="false"}[30m])) / sum(increase(src_email_send[30m])) * 100) > 0)

Generated query for critical alert: max((sum(increase(src_email_send{success="false"}[30m])) / sum(increase(src_email_send[30m])) * 100) >= 10)

gitserver: disk_space_remaining

disk space remaining

Descriptions

warning gitserver: less than 15% disk space remaining
critical gitserver: less than 10% disk space remaining for 10m0s

Next steps

On a warning alert, you may want to provision more disk space: Disk pressure may result in decreased performance, users having to wait for repositories to clone, etc.
On a critical alert, you need to provision more disk space. Running out of disk space will result in decreased performance, or complete service outage.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_disk_space_remaining",
  "critical_gitserver_disk_space_remaining"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 15)

Generated query for critical alert: min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 10)

gitserver: echo_command_duration_test

echo test command duration

Descriptions

warning gitserver: 0.02s+ echo test command duration for 30s

Next steps

Single container deployments (removed in 7.0.0): The single-container deployment mode has been sunset. Migrate to Docker Compose.
Kubernetes and Docker Compose: Check that you are running a similar number of git server replicas and that their CPU/memory limits are allocated according to what is shown in the Sourcegraph resource estimator.
If your persistent volume is slow, you may want to provision more IOPS, usually by increasing the volume size.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_echo_command_duration_test"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_gitserver_echo_duration_seconds)) >= 0.02)

gitserver: repo_corrupted

number of times a repo corruption has been identified

Descriptions

critical gitserver: 0+ number of times a repo corruption has been identified

Next steps

Check the corruption logs for details. gitserver_repos.corruption_logs contains more information.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_gitserver_repo_corrupted"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: max((sum(rate(src_gitserver_repo_corrupted[5m]))) > 0)

gitserver: repository_clone_queue_size

repository clone queue size

Descriptions

warning gitserver: 25+ repository clone queue size

Next steps

If you just added several repositories, the warning may be expected.
Check which repositories need cloning, by visiting e.g. https://sourcegraph.example.com/admin/repositories?filter=not-cloned
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_repository_clone_queue_size"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(src_gitserver_clone_queue)) >= 25)

gitserver: cpu_usage_percentage

CPU usage

Descriptions

warning gitserver: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}) >= 95)

gitserver: memory_rss

memory (RSS)

Descriptions

warning gitserver: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_memory_rss"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^gitserver.*"\} / container_spec_memory_limit_bytes\{name=~"^gitserver.*"}) * 100) >= 90)

gitserver: cpu_throttling_time

container CPU throttling time %

Descriptions

warning gitserver: 75%+ container CPU throttling time % for 2m0s

Next steps

Consider increasing the CPU limit for the container.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_cpu_throttling_time"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (container_label_io_kubernetes_pod_name) ((rate(container_cpu_cfs_throttled_periods_total{container_label_io_kubernetes_container_name="gitserver"\}[5m]) / rate(container_cpu_cfs_periods_total\{container_label_io_kubernetes_container_name="gitserver"}[5m])) * 100)) >= 75)

gitserver: git_command_retry_attempts_rate

rate of git command corruption retry attempts over 5m

Descriptions

warning gitserver: 0.1reqps+ rate of git command corruption retry attempts over 5m for 5m0s

Next steps

Investigate the underlying cause of corruption errors in git commands.
Check disk health and I/O performance.
Monitor for patterns in specific git operations that trigger retries.
Consider adjusting retry configuration if retries are too frequent.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_git_command_retry_attempts_rate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(rate(src_gitserver_retry_attempts_total[5m]))) >= 0.1)

gitserver: goroutine_error_rate

error rate for periodic goroutine executions

Descriptions

warning gitserver: 0.01reqps+ error rate for periodic goroutine executions for 15m0s

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Look for recent changes to the routine`s code or configuration
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_goroutine_error_rate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*gitserver.*"}[5m]))) >= 0.01)

gitserver: goroutine_error_percentage

percentage of periodic goroutine executions that result in errors

Descriptions

warning gitserver: 5%+ percentage of periodic goroutine executions that result in errors

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Consider temporarily disabling the routine if it`s non-critical and causing cascading issues
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_goroutine_error_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*gitserver.*"\}[5m])) / sum by (name, job_name) (rate(src_periodic_goroutine_total\{job=~".*gitserver.*"}[5m]) > 0) * 100) >= 5)

gitserver: gitserver_site_configuration_duration_since_last_successful_update_by_instance

maximum duration since last successful site configuration update (all "gitserver" instances)

Descriptions

critical gitserver: 300s+ maximum duration since last successful site configuration update (all "gitserver" instances)

Next steps

This indicates that one or more "gitserver" instances have not successfully updated the site configuration in over 5 minutes. This could be due to networking issues between services or problems with the site configuration service itself.
Check for relevant errors in the "gitserver" logs, as well as frontend`s logs.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_gitserver_gitserver_site_configuration_duration_since_last_successful_update_by_instance"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: max((max(max_over_time(src_conf_client_time_since_last_successful_update_seconds{job=~".*gitserver"}[1m]))) >= 300)

gitserver: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning gitserver: 0.1s+ mean blocked seconds per conn request for 10m0s
critical gitserver: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_mean_blocked_seconds_per_conn_request",
  "critical_gitserver_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="gitserver"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="gitserver"}[5m]))) >= 0.5)

gitserver: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning gitserver: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the gitserver container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_container_cpu_usage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}) >= 99)

gitserver: container_memory_usage

container memory usage by instance

Descriptions

warning gitserver: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of gitserver container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_container_memory_usage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^gitserver.*"}) >= 99)

gitserver: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning gitserver: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the gitserver service.
Docker Compose: Consider increasing cpus: of the gitserver container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}[1d])) >= 80)

gitserver: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning gitserver: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the gitserver container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}[5m])) >= 90)

gitserver: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning gitserver: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of gitserver container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^gitserver.*"})) >= 1)

gitserver: go_goroutines

maximum active goroutines

Descriptions

warning gitserver: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_go_goroutines"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*gitserver"})) >= 10000)

gitserver: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning gitserver: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_gitserver_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*gitserver"})) >= 2)

gitserver: pods_available_percentage

percentage pods available

Descriptions

critical gitserver: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod gitserver (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p gitserver.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_gitserver_pods_available_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*gitserver"\}) / count by (app) (up\{app=~".*gitserver"}) * 100) <= 50)

postgres: connections

active connections

Descriptions

warning postgres: less than 5 active connections for 5m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_connections"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: min((sum by (job) (pg_stat_activity_count{datname!~"template.*\\|postgres\\|cloudsqladmin"\}) or sum by (job) (pg_stat_activity_count\{datname!~"template.*\\|cloudsqladmin",job="codeinsights-db"})) <= 5)

postgres: usage_connections_percentage

connection in use

Descriptions

warning postgres: 80%+ connection in use for 5m0s
critical postgres: 100%+ connection in use for 5m0s

Next steps

Consider increasing max_connections of the database instance, learn more
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_usage_connections_percentage",
  "critical_postgres_usage_connections_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (job) (pg_stat_activity_count) / (sum by (job) (pg_settings_max_connections) - sum by (job) (pg_settings_superuser_reserved_connections)) * 100) >= 80)

Generated query for critical alert: max((sum by (job) (pg_stat_activity_count) / (sum by (job) (pg_settings_max_connections) - sum by (job) (pg_settings_superuser_reserved_connections)) * 100) >= 100)

postgres: transaction_durations

maximum transaction durations

Descriptions

warning postgres: 0.3s+ maximum transaction durations for 5m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_transaction_durations"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (job) (pg_stat_activity_max_tx_duration{datname!~"template.*\\|postgres\\|cloudsqladmin",job!="codeintel-db"\}) or sum by (job) (pg_stat_activity_max_tx_duration\{datname!~"template.*\\|cloudsqladmin",job="codeinsights-db"})) >= 0.3)

postgres: postgres_up

database availability

Descriptions

critical postgres: less than 0 database availability for 5m0s

Next steps

Kubernetes:
- Determine if the pod was OOM killed using kubectl describe pod (pgsql\|codeintel-db\|codeinsights) (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
- Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p (pgsql\|codeintel-db\|codeinsights).
- Check if there is any OOMKILL event using the provisioning panels
- Check kernel logs using dmesg for OOMKILL events on worker nodes
Docker Compose:
- Determine if the pod was OOM killed using docker inspect -f '\{\{json .State\}\}' (pgsql\|codeintel-db\|codeinsights) (look for "OOMKilled":true) and, if so, consider increasing the memory limit of the (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
- Check the logs before the container restarted to see if there are panic: messages or similar using docker logs (pgsql\|codeintel-db\|codeinsights) (note this will include logs from the previous and currently running container).
- Check if there is any OOMKILL event using the provisioning panels
- Check kernel logs using dmesg for OOMKILL events
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_postgres_postgres_up"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((pg_up) <= 0)

postgres: invalid_indexes

invalid indexes (unusable by the query planner)

Descriptions

warning postgres: 1+ invalid indexes (unusable by the query planner)

Next steps

Drop and re-create the invalid trigger - please contact Sourcegraph to supply the trigger definition.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_invalid_indexes"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: sum((max by (relname) (pg_invalid_index_count)) >= 1)

postgres: pg_exporter_err

errors scraping postgres exporter

Descriptions

warning postgres: 1+ errors scraping postgres exporter for 5m0s

Next steps

Ensure the Postgres exporter can access the Postgres database. Also, check the Postgres exporter logs for errors.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_pg_exporter_err"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((pg_exporter_last_scrape_error) >= 1)

postgres: migration_in_progress

active schema migration

Descriptions

warning postgres: 1+ active schema migration for 5m0s

Next steps

The database migration has been in progress for 5 or more minutes - please contact Sourcegraph if this persists.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_migration_in_progress"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((pg_sg_migration_status) >= 1)

postgres: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning postgres: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the (pgsql|codeintel-db|codeinsights) service.
Docker Compose: Consider increasing cpus: of the (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^(pgsql\\|codeintel-db\\|codeinsights).*"}[1d])) >= 80)

postgres: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning postgres: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the (pgsql|codeintel-db|codeinsights) service.
Docker Compose: Consider increasing memory: of the (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(pgsql\\|codeintel-db\\|codeinsights).*"}[1d])) >= 80)

postgres: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning postgres: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^(pgsql\\|codeintel-db\\|codeinsights).*"}[5m])) >= 90)

postgres: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning postgres: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(pgsql\\|codeintel-db\\|codeinsights).*"}[5m])) >= 90)

postgres: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning postgres: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of (pgsql|codeintel-db|codeinsights) container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_postgres_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^(pgsql\\|codeintel-db\\|codeinsights).*"})) >= 1)

postgres: pods_available_percentage

percentage pods available

Descriptions

critical postgres: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod (pgsql\|codeintel-db\|codeinsights) (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p (pgsql\|codeintel-db\|codeinsights).
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_postgres_pods_available_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*(pgsql\\|codeintel-db\\|codeinsights)"\}) / count by (app) (up\{app=~".*(pgsql\\|codeintel-db\\|codeinsights)"}) * 100) <= 50)

precise-code-intel-worker: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning precise-code-intel-worker: 0.1s+ mean blocked seconds per conn request for 10m0s
critical precise-code-intel-worker: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_precise-code-intel-worker_mean_blocked_seconds_per_conn_request",
  "critical_precise-code-intel-worker_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="precise-code-intel-worker"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="precise-code-intel-worker"}[5m]))) >= 0.5)

precise-code-intel-worker: pods_available_percentage

percentage pods available

Descriptions

critical precise-code-intel-worker: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod precise-code-intel-worker (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p precise-code-intel-worker.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_precise-code-intel-worker_pods_available_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*precise-code-intel-worker"\}) / count by (app) (up\{app=~".*precise-code-intel-worker"}) * 100) <= 50)

syntactic-indexing: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning syntactic-indexing: 0.1s+ mean blocked seconds per conn request for 10m0s
critical syntactic-indexing: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntactic-indexing_mean_blocked_seconds_per_conn_request",
  "critical_syntactic-indexing_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="syntactic-code-intel-worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="syntactic-code-intel-worker"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="syntactic-code-intel-worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="syntactic-code-intel-worker"}[5m]))) >= 0.5)

syntactic-indexing: pods_available_percentage

percentage pods available

Descriptions

critical syntactic-indexing: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod syntactic-code-intel-worker (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p syntactic-code-intel-worker.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_syntactic-indexing_pods_available_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*syntactic-code-intel-worker"\}) / count by (app) (up\{app=~".*syntactic-code-intel-worker"}) * 100) <= 50)

redis: redis-store_up

redis-store availability

Descriptions

critical redis: less than 1 redis-store availability for 10s

Next steps

Ensure redis-store is running
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_redis_redis-store_up"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((redis_up{app="redis-store"}) < 1)

redis: redis-cache_up

redis-cache availability

Descriptions

critical redis: less than 1 redis-cache availability for 10s

Next steps

Ensure redis-cache is running
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_redis_redis-cache_up"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((redis_up{app="redis-cache"}) < 1)

redis: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning redis: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the redis-cache service.
Docker Compose: Consider increasing cpus: of the redis-cache container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^redis-cache.*"}[1d])) >= 80)

redis: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning redis: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the redis-cache service.
Docker Compose: Consider increasing memory: of the redis-cache container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-cache.*"}[1d])) >= 80)

redis: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning redis: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the redis-cache container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^redis-cache.*"}[5m])) >= 90)

redis: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning redis: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of redis-cache container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-cache.*"}[5m])) >= 90)

redis: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning redis: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of redis-cache container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^redis-cache.*"})) >= 1)

redis: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning redis: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the redis-store service.
Docker Compose: Consider increasing cpus: of the redis-store container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^redis-store.*"}[1d])) >= 80)

redis: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning redis: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the redis-store service.
Docker Compose: Consider increasing memory: of the redis-store container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-store.*"}[1d])) >= 80)

redis: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning redis: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the redis-store container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^redis-store.*"}[5m])) >= 90)

redis: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning redis: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of redis-store container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-store.*"}[5m])) >= 90)

redis: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning redis: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of redis-store container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_redis_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^redis-store.*"})) >= 1)

redis: pods_available_percentage

percentage pods available

Descriptions

critical redis: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod redis-cache (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p redis-cache.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_redis_pods_available_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*redis-cache"\}) / count by (app) (up\{app=~".*redis-cache"}) * 100) <= 50)

redis: pods_available_percentage

percentage pods available

Descriptions

critical redis: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod redis-store (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p redis-store.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_redis_pods_available_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*redis-store"\}) / count by (app) (up\{app=~".*redis-store"}) * 100) <= 50)

worker: worker_job_codeintel-upload-janitor_count

number of worker instances running the codeintel-upload-janitor job

Descriptions

warning worker: less than 1 number of worker instances running the codeintel-upload-janitor job for 1m0s
critical worker: less than 1 number of worker instances running the codeintel-upload-janitor job for 5m0s

Next steps

Ensure your instance defines a worker container such that:
- WORKER_JOB_ALLOWLIST contains "codeintel-upload-janitor" (or "all"), and
- WORKER_JOB_BLOCKLIST does not contain "codeintel-upload-janitor"
Ensure that such a container is not failing to start or stay active
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_worker_job_codeintel-upload-janitor_count",
  "critical_worker_worker_job_codeintel-upload-janitor_count"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-upload-janitor"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-upload-janitor"})) == 1)

Generated query for critical alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-upload-janitor"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-upload-janitor"})) == 1)

worker: worker_job_codeintel-commitgraph-updater_count

number of worker instances running the codeintel-commitgraph-updater job

Descriptions

warning worker: less than 1 number of worker instances running the codeintel-commitgraph-updater job for 1m0s
critical worker: less than 1 number of worker instances running the codeintel-commitgraph-updater job for 5m0s

Next steps

Ensure your instance defines a worker container such that:
- WORKER_JOB_ALLOWLIST contains "codeintel-commitgraph-updater" (or "all"), and
- WORKER_JOB_BLOCKLIST does not contain "codeintel-commitgraph-updater"
Ensure that such a container is not failing to start or stay active
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_worker_job_codeintel-commitgraph-updater_count",
  "critical_worker_worker_job_codeintel-commitgraph-updater_count"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-commitgraph-updater"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-commitgraph-updater"})) == 1)

Generated query for critical alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-commitgraph-updater"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-commitgraph-updater"})) == 1)

worker: worker_job_codeintel-autoindexing-scheduler_count

number of worker instances running the codeintel-autoindexing-scheduler job

Descriptions

warning worker: less than 1 number of worker instances running the codeintel-autoindexing-scheduler job for 1m0s
critical worker: less than 1 number of worker instances running the codeintel-autoindexing-scheduler job for 5m0s

Next steps

Ensure your instance defines a worker container such that:
- WORKER_JOB_ALLOWLIST contains "codeintel-autoindexing-scheduler" (or "all"), and
- WORKER_JOB_BLOCKLIST does not contain "codeintel-autoindexing-scheduler"
Ensure that such a container is not failing to start or stay active
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_worker_job_codeintel-autoindexing-scheduler_count",
  "critical_worker_worker_job_codeintel-autoindexing-scheduler_count"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-autoindexing-scheduler"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-autoindexing-scheduler"})) == 1)

Generated query for critical alert: (min((sum(src_worker_jobs{job=~"^worker.*",job_name="codeintel-autoindexing-scheduler"\})) < 1)) or (absent(sum(src_worker_jobs\{job=~"^worker.*",job_name="codeintel-autoindexing-scheduler"})) == 1)

worker: src_repoupdater_max_sync_backoff

time since oldest sync

Descriptions

critical worker: 32400s+ time since oldest sync for 10m0s

Next steps

An alert here indicates that no code host connections have synced in at least 9h0m0s. This indicates that there could be a configuration issue with your code hosts connections or networking issues affecting communication with your code hosts.
Check the code host status indicator (cloud icon in top right of Sourcegraph homepage) for errors.
Make sure external services do not have invalid tokens by navigating to them in the web UI and clicking save. If there are no errors, they are valid.
Check the worker logs for errors about syncing.
Confirm that outbound network connections are allowed where worker is deployed.
Check back in an hour to see if the issue has resolved itself.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_src_repoupdater_max_sync_backoff"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: max((max(src_repoupdater_max_sync_backoff)) >= 32400)

worker: src_repoupdater_syncer_sync_errors_total

site level external service sync error rate

Descriptions

warning worker: 0.5+ site level external service sync error rate for 10m0s
critical worker: 1+ site level external service sync error rate for 10m0s

Next steps

An alert here indicates errors syncing site level repo metadata with code hosts. This indicates that there could be a configuration issue with your code hosts connections or networking issues affecting communication with your code hosts.
Check the code host status indicator (cloud icon in top right of Sourcegraph homepage) for errors.
Make sure external services do not have invalid tokens by navigating to them in the web UI and clicking save. If there are no errors, they are valid.
Check the worker logs for errors about syncing.
Confirm that outbound network connections are allowed where worker is deployed.
Check back in an hour to see if the issue has resolved itself.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_src_repoupdater_syncer_sync_errors_total",
  "critical_worker_src_repoupdater_syncer_sync_errors_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (family) (rate(src_repoupdater_syncer_sync_errors_total{owner!="user",reason!="internal_rate_limit",reason!="invalid_npm_path"}[5m]))) > 0.5)

Generated query for critical alert: max((max by (family) (rate(src_repoupdater_syncer_sync_errors_total{owner!="user",reason!="internal_rate_limit",reason!="invalid_npm_path"}[5m]))) > 1)

worker: syncer_sync_start

repo metadata sync was started

Descriptions

warning worker: less than 0 repo metadata sync was started for 9h0m0s

Next steps

Check worker logs for errors.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_syncer_sync_start"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max by (family) (rate(src_repoupdater_syncer_start_sync{family="Syncer.SyncExternalService"}[9h]))) <= 0)

worker: syncer_sync_duration

95th repositories sync duration

Descriptions

warning worker: 30s+ 95th repositories sync duration for 5m0s

Next steps

Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_syncer_sync_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.95, max by (le, family, success) (rate(src_repoupdater_syncer_sync_duration_seconds_bucket[1m])))) >= 30)

worker: source_duration

95th repositories source duration

Descriptions

warning worker: 30s+ 95th repositories source duration for 5m0s

Next steps

Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_source_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.95, max by (le) (rate(src_repoupdater_source_duration_seconds_bucket[1m])))) >= 30)

worker: syncer_synced_repos

repositories synced

Descriptions

warning worker: less than 0 repositories synced for 9h0m0s

Next steps

Check network connectivity to code hosts
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_syncer_synced_repos"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(rate(src_repoupdater_syncer_synced_repos_total[1m]))) <= 0)

worker: sourced_repos

repositories sourced

Descriptions

warning worker: less than 0 repositories sourced for 9h0m0s

Next steps

Check network connectivity to code hosts
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_sourced_repos"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max(rate(src_repoupdater_source_repos_total[1m]))) <= 0)

worker: sched_auto_fetch

repositories scheduled due to hitting a deadline

Descriptions

warning worker: less than 0 repositories scheduled due to hitting a deadline for 9h0m0s

Next steps

Check worker logs.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_sched_auto_fetch"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max(rate(src_repoupdater_sched_auto_fetch[1m]))) <= 0)

worker: sched_loops

scheduler loops

Descriptions

warning worker: less than 0 scheduler loops for 9h0m0s

Next steps

Check worker logs for errors. This is expected to fire if there are no user added code hosts
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_sched_loops"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max(rate(src_repoupdater_sched_loops[1m]))) <= 0)

worker: src_repoupdater_stale_repos

repos that haven't been fetched in more than 8 hours

Descriptions

warning worker: 1+ repos that haven't been fetched in more than 8 hours for 25m0s

Next steps

Check worker logs for errors. Check for rows in gitserver_repos where LastError is not an empty string.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_src_repoupdater_stale_repos"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_repoupdater_stale_repos)) >= 1)

worker: sched_error

repositories schedule error rate

Descriptions

critical worker: 1+ repositories schedule error rate for 25m0s

Next steps

Check worker logs for errors
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_sched_error"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: max((max(rate(src_repoupdater_sched_error[1m]))) >= 1)

worker: src_repoupdater_cleanup_failed_repos

repos that have failed cleanup more than 5 times consecutively

Descriptions

critical worker: 1+ repos that have failed cleanup more than 5 times consecutively

Next steps

Check worker logs for cleanup errors. Check for rows in gitserver_repos where failed_cleanup_attempts > 5. Failure to optimize repositories consistently will eventually lead to bad performance problems, and needs to be addressed.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_src_repoupdater_cleanup_failed_repos"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: max((max(src_repoupdater_cleanup_failed_repos)) >= 1)

worker: src_repoupdater_external_services_total

the total number of external services

Descriptions

critical worker: 20000+ the total number of external services for 1h0m0s

Next steps

Check for spikes in external services, could be abuse
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_src_repoupdater_external_services_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: max((max(src_repoupdater_external_services_total)) >= 20000)

worker: repoupdater_queued_sync_jobs_total

the total number of queued sync jobs

Descriptions

warning worker: 100+ the total number of queued sync jobs for 1h0m0s

Next steps

Check if jobs are failing to sync: "SELECT * FROM external_service_sync_jobs WHERE state = errored";
Increase the number of workers using the repoConcurrentExternalServiceSyncers site config.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_repoupdater_queued_sync_jobs_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_repoupdater_queued_sync_jobs_total)) >= 100)

worker: repoupdater_completed_sync_jobs_total

the total number of completed sync jobs

Descriptions

warning worker: 100000+ the total number of completed sync jobs for 1h0m0s

Next steps

Check worker logs. Jobs older than 1 day should have been removed.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_repoupdater_completed_sync_jobs_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_repoupdater_completed_sync_jobs_total)) >= 100000)

worker: repoupdater_errored_sync_jobs_percentage

the percentage of external services that have failed their most recent sync

Descriptions

warning worker: 10%+ the percentage of external services that have failed their most recent sync for 1h0m0s

Next steps

Check worker logs. Check code host connectivity
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_repoupdater_errored_sync_jobs_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_repoupdater_errored_sync_jobs_percentage)) > 10)

worker: github_graphql_rate_limit_remaining

remaining calls to GitHub graphql API before hitting the rate limit

Descriptions

warning worker: less than 250 remaining calls to GitHub graphql API before hitting the rate limit

Next steps

Consider creating a new token for the indicated resource (the name label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_github_graphql_rate_limit_remaining"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})) <= 250)

worker: github_rest_rate_limit_remaining

remaining calls to GitHub rest API before hitting the rate limit

Descriptions

warning worker: less than 250 remaining calls to GitHub rest API before hitting the rate limit

Next steps

Consider creating a new token for the indicated resource (the name label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_github_rest_rate_limit_remaining"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})) <= 250)

worker: github_search_rate_limit_remaining

remaining calls to GitHub search API before hitting the rate limit

Descriptions

warning worker: less than 5 remaining calls to GitHub search API before hitting the rate limit

Next steps

Consider creating a new token for the indicated resource (the name label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_github_search_rate_limit_remaining"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="search"})) <= 5)

worker: gitlab_rest_rate_limit_remaining

remaining calls to GitLab rest API before hitting the rate limit

Descriptions

critical worker: less than 30 remaining calls to GitLab rest API before hitting the rate limit

Next steps

Try restarting the pod to get a different public IP.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_gitlab_rest_rate_limit_remaining"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: min((max by (name) (src_gitlab_rate_limit_remaining{resource="rest"})) <= 30)

worker: perms_syncer_outdated_perms

number of entities with outdated permissions

Descriptions

warning worker: 100+ number of entities with outdated permissions for 5m0s

Next steps

Enabled permissions for the first time: Wait for few minutes and see if the number goes down.
Otherwise: Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_perms_syncer_outdated_perms"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (type) (src_repo_perms_syncer_outdated_perms)) >= 100)

worker: perms_syncer_sync_duration

95th permissions sync duration

Descriptions

warning worker: 30s+ 95th permissions sync duration for 5m0s

Next steps

Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_perms_syncer_sync_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.95, max by (le, type) (rate(src_repo_perms_syncer_sync_duration_seconds_bucket[1m])))) >= 30)

worker: goroutine_error_rate

error rate for periodic goroutine executions

Descriptions

warning worker: 0.01reqps+ error rate for periodic goroutine executions for 15m0s

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Look for recent changes to the routine`s code or configuration
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_goroutine_error_rate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*worker.*"}[5m]))) >= 0.01)

worker: goroutine_error_percentage

percentage of periodic goroutine executions that result in errors

Descriptions

warning worker: 5%+ percentage of periodic goroutine executions that result in errors

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Consider temporarily disabling the routine if it`s non-critical and causing cascading issues
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_goroutine_error_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*worker.*"\}[5m])) / sum by (name, job_name) (rate(src_periodic_goroutine_total\{job=~".*worker.*"}[5m]) > 0) * 100) >= 5)

worker: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning worker: 0.1s+ mean blocked seconds per conn request for 10m0s
critical worker: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_mean_blocked_seconds_per_conn_request",
  "critical_worker_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="worker"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="worker"}[5m]))) >= 0.5)

worker: cpu_usage_percentage

CPU usage

Descriptions

warning worker: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}) >= 95)

worker: memory_rss

memory (RSS)

Descriptions

warning worker: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^worker.*"\} / container_spec_memory_limit_bytes\{name=~"^worker.*"}) * 100) >= 90)

worker: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning worker: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_container_cpu_usage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}) >= 99)

worker: container_memory_usage

container memory usage by instance

Descriptions

warning worker: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_container_memory_usage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}) >= 99)

worker: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning worker: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the worker service.
Docker Compose: Consider increasing cpus: of the worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}[1d])) >= 80)

worker: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning worker: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the worker service.
Docker Compose: Consider increasing memory: of the worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}[1d])) >= 80)

worker: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning worker: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}[5m])) >= 90)

worker: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning worker: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of worker container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}[5m])) >= 90)

worker: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning worker: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of worker container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^worker.*"})) >= 1)

worker: go_goroutines

maximum active goroutines

Descriptions

warning worker: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_go_goroutines"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*worker"})) >= 10000)

worker: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning worker: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_worker_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*worker"})) >= 2)

worker: pods_available_percentage

percentage pods available

Descriptions

critical worker: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod worker (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p worker.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_pods_available_percentage"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*worker"\}) / count by (app) (up\{app=~".*worker"}) * 100) <= 50)

worker: worker_site_configuration_duration_since_last_successful_update_by_instance

maximum duration since last successful site configuration update (all "worker" instances)

Descriptions

critical worker: 300s+ maximum duration since last successful site configuration update (all "worker" instances)

Next steps

This indicates that one or more "worker" instances have not successfully updated the site configuration in over 5 minutes. This could be due to networking issues between services or problems with the site configuration service itself.
Check for relevant errors in the "worker" logs, as well as frontend`s logs.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_worker_worker_site_configuration_duration_since_last_successful_update_by_instance"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: max((max(max_over_time(src_conf_client_time_since_last_successful_update_seconds{job=~"^worker.*"}[1m]))) >= 300)

searcher: replica_traffic

requests per second per replica over 10m

Descriptions

warning searcher: 5+ requests per second per replica over 10m

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_replica_traffic"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (instance) (rate(searcher_service_request_total[10m]))) >= 5)

searcher: unindexed_search_request_errors

unindexed search request errors every 5m by code

Descriptions

warning searcher: 5%+ unindexed search request errors every 5m by code for 5m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_unindexed_search_request_errors"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (code) (increase(searcher_service_request_total{code!="200",code!="canceled"}[5m])) / ignoring (code) group_left () sum(increase(searcher_service_request_total[5m])) * 100) >= 5)

searcher: searcher_site_configuration_duration_since_last_successful_update_by_instance

maximum duration since last successful site configuration update (all "searcher" instances)

Descriptions

critical searcher: 300s+ maximum duration since last successful site configuration update (all "searcher" instances)

Next steps

This indicates that one or more "searcher" instances have not successfully updated the site configuration in over 5 minutes. This could be due to networking issues between services or problems with the site configuration service itself.
Check for relevant errors in the "searcher" logs, as well as frontend`s logs.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_searcher_searcher_site_configuration_duration_since_last_successful_update_by_instance"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: max((max(max_over_time(src_conf_client_time_since_last_successful_update_seconds{job=~".*searcher"}[1m]))) >= 300)

searcher: goroutine_error_rate

error rate for periodic goroutine executions

Descriptions

warning searcher: 0.01reqps+ error rate for periodic goroutine executions for 15m0s

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Look for recent changes to the routine`s code or configuration
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_goroutine_error_rate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*searcher.*"}[5m]))) >= 0.01)

searcher: goroutine_error_percentage

percentage of periodic goroutine executions that result in errors

Descriptions

warning searcher: 5%+ percentage of periodic goroutine executions that result in errors

Next steps

Check service logs for error details related to the failing periodic routine
Check if the routine depends on external services that may be unavailable
Consider temporarily disabling the routine if it`s non-critical and causing cascading issues
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_goroutine_error_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (name, job_name) (rate(src_periodic_goroutine_errors_total{job=~".*searcher.*"\}[5m])) / sum by (name, job_name) (rate(src_periodic_goroutine_total\{job=~".*searcher.*"}[5m]) > 0) * 100) >= 5)

searcher: mean_blocked_seconds_per_conn_request

mean blocked seconds per conn request

Descriptions

warning searcher: 0.1s+ mean blocked seconds per conn request for 10m0s
critical searcher: 0.5s+ mean blocked seconds per conn request for 10m0s

Next steps

Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
Scale up Postgres memory/cpus - see our scaling guide
If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_mean_blocked_seconds_per_conn_request",
  "critical_searcher_mean_blocked_seconds_per_conn_request"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="searcher"}[5m]))) >= 0.1)

Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"\}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for\{app_name="searcher"}[5m]))) >= 0.5)

searcher: cpu_usage_percentage

CPU usage

Descriptions

warning searcher: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}) >= 95)

searcher: memory_rss

memory (RSS)

Descriptions

warning searcher: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^searcher.*"\} / container_spec_memory_limit_bytes\{name=~"^searcher.*"}) * 100) >= 90)

searcher: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning searcher: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_container_cpu_usage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}) >= 99)

searcher: container_memory_usage

container memory usage by instance

Descriptions

warning searcher: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_container_memory_usage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}) >= 99)

searcher: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning searcher: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the searcher service.
Docker Compose: Consider increasing cpus: of the searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}[1d])) >= 80)

searcher: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning searcher: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the searcher service.
Docker Compose: Consider increasing memory: of the searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}[1d])) >= 80)

searcher: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning searcher: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}[5m])) >= 90)

searcher: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning searcher: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of searcher container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}[5m])) >= 90)

searcher: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning searcher: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of searcher container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^searcher.*"})) >= 1)

searcher: go_goroutines

maximum active goroutines

Descriptions

warning searcher: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_go_goroutines"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*searcher"})) >= 10000)

searcher: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning searcher: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_searcher_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*searcher"})) >= 2)

searcher: pods_available_percentage

percentage pods available

Descriptions

critical searcher: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod searcher (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p searcher.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_searcher_pods_available_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*searcher"\}) / count by (app) (up\{app=~".*searcher"}) * 100) <= 50)

syntect-server: cpu_usage_percentage

CPU usage

Descriptions

warning syntect-server: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}) >= 95)

syntect-server: memory_rss

memory (RSS)

Descriptions

warning syntect-server: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^syntect-server.*"\} / container_spec_memory_limit_bytes\{name=~"^syntect-server.*"}) * 100) >= 90)

syntect-server: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning syntect-server: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_container_cpu_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}) >= 99)

syntect-server: container_memory_usage

container memory usage by instance

Descriptions

warning syntect-server: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_container_memory_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}) >= 99)

syntect-server: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning syntect-server: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the syntect-server service.
Docker Compose: Consider increasing cpus: of the syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}[1d])) >= 80)

syntect-server: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning syntect-server: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the syntect-server service.
Docker Compose: Consider increasing memory: of the syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}[1d])) >= 80)

syntect-server: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning syntect-server: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}[5m])) >= 90)

syntect-server: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning syntect-server: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of syntect-server container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}[5m])) >= 90)

syntect-server: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning syntect-server: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of syntect-server container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_syntect-server_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^syntect-server.*"})) >= 1)

syntect-server: pods_available_percentage

percentage pods available

Descriptions

critical syntect-server: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod syntect-server (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p syntect-server.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_syntect-server_pods_available_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*syntect-server"\}) / count by (app) (up\{app=~".*syntect-server"}) * 100) <= 50)

zoekt: average_resolve_revision_duration

average resolve revision duration over 5m

Descriptions

warning zoekt: 15s+ average resolve revision duration over 5m

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_average_resolve_revision_duration"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(rate(resolve_revision_seconds_sum[5m])) / sum(rate(resolve_revision_seconds_count[5m]))) >= 15)

zoekt: get_index_options_error_increase

the number of repositories we failed to get indexing options over 5m

Descriptions

warning zoekt: 100+ the number of repositories we failed to get indexing options over 5m for 5m0s
critical zoekt: 100+ the number of repositories we failed to get indexing options over 5m for 35m0s

Next steps

View error rates on gitserver and frontend to identify root cause.
Rollback frontend/gitserver deployment if due to a bad code change.
View error logs for getIndexOptions via net/trace debug interface. For example click on a indexed-search-indexer- on https://sourcegraph.com/-/debug/. Then click on Traces. Replace sourcegraph.com with your instance address.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_get_index_options_error_increase",
  "critical_zoekt_get_index_options_error_increase"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(get_index_options_error_total[5m]))) >= 100)

Generated query for critical alert: max((sum(increase(get_index_options_error_total[5m]))) >= 100)

zoekt: cpu_usage_percentage

CPU usage

Descriptions

warning zoekt: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-indexserver.*"}) >= 95)

zoekt: memory_rss

memory (RSS)

Descriptions

warning zoekt: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^zoekt-indexserver.*"\} / container_spec_memory_limit_bytes\{name=~"^zoekt-indexserver.*"}) * 100) >= 90)

zoekt: cpu_usage_percentage

CPU usage

Descriptions

warning zoekt: 95%+ CPU usage for 10m0s

Next steps

Consider increasing CPU limits or scaling out.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_cpu_usage_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-webserver.*"}) >= 95)

zoekt: memory_rss

memory (RSS)

Descriptions

warning zoekt: 90%+ memory (RSS) for 10m0s

Next steps

Consider increasing memory limits or scaling out.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_memory_rss"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (name) (container_memory_rss{name=~"^zoekt-webserver.*"\} / container_spec_memory_limit_bytes\{name=~"^zoekt-webserver.*"}) * 100) >= 90)

zoekt: memory_map_areas_percentage_used

process memory map areas percentage used (per instance)

Descriptions

warning zoekt: 60%+ process memory map areas percentage used (per instance)
critical zoekt: 80%+ process memory map areas percentage used (per instance)

Next steps

If you are running out of memory map areas, you could resolve this by:
- Enabling shard merging for Zoekt: Set SRC_ENABLE_SHARD_MERGING="1" for zoekt-indexserver. Use this option if your corpus of repositories has a high percentage of small, rarely updated repositories. See documentation.
- Creating additional Zoekt replicas: This spreads all the shards out amongst more replicas, which means that each individual replica will have fewer shards. This, in turn, decreases the amount of memory map areas that a single replica can create (in order to load the shards into memory).
- Increasing the virtual memory subsystem`s "max_map_count" parameter which defines the upper limit of memory areas a process can use. The default value of max_map_count is usually 65536. We recommend to set this value to 2x the number of repos to be indexed per Zoekt instance. This means, if you want to index 240k repositories with 3 Zoekt instances, set max_map_count to (240000 / 3) * 2 = 160000. The exact instructions for tuning this parameter can differ depending on your environment. See https://kernel.org/doc/Documentation/sysctl/vm.txt for more information.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_memory_map_areas_percentage_used",
  "critical_zoekt_memory_map_areas_percentage_used"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max(((proc_metrics_memory_map_current_count / proc_metrics_memory_map_max_limit) * 100) >= 60)

Generated query for critical alert: max(((proc_metrics_memory_map_current_count / proc_metrics_memory_map_max_limit) * 100) >= 80)

zoekt: indexed_search_request_errors

indexed search request errors every 5m by code

Descriptions

warning zoekt: 5%+ indexed search request errors every 5m by code for 5m0s

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_indexed_search_request_errors"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

zoekt: go_goroutines

maximum active goroutines

Descriptions

warning zoekt: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_go_goroutines"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*indexed-search-indexer"})) >= 10000)

zoekt: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning zoekt: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*indexed-search-indexer"})) >= 2)

zoekt: go_goroutines

maximum active goroutines

Descriptions

warning zoekt: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_go_goroutines"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*indexed-search"})) >= 10000)

zoekt: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning zoekt: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_zoekt_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*indexed-search"})) >= 2)

zoekt: pods_available_percentage

percentage pods available

Descriptions

critical zoekt: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod indexed-search (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p indexed-search.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_zoekt_pods_available_percentage"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*indexed-search"\}) / count by (app) (up\{app=~".*indexed-search"}) * 100) <= 50)

prometheus: prometheus_rule_eval_duration

average prometheus rule group evaluation duration over 10m by rule group

Descriptions

warning prometheus: 30s+ average prometheus rule group evaluation duration over 10m by rule group

Next steps

Check the Container monitoring (not available on server) panels and try increasing resources for Prometheus if necessary.
If the rule group taking a long time to evaluate belongs to /sg_prometheus_addons, try reducing the complexity of any custom Prometheus rules provided.
If the rule group taking a long time to evaluate belongs to /sg_config_prometheus, please open an issue.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_rule_eval_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (rule_group) (avg_over_time(prometheus_rule_group_last_duration_seconds[10m]))) >= 30)

prometheus: prometheus_rule_eval_failures

failed prometheus rule evaluations over 5m by rule group

Descriptions

warning prometheus: 0+ failed prometheus rule evaluations over 5m by rule group

Next steps

Check Prometheus logs for messages related to rule group evaluation (generally with log field component="rule manager").
If the rule group failing to evaluate belongs to /sg_prometheus_addons, ensure any custom Prometheus configuration provided is valid.
If the rule group taking a long time to evaluate belongs to /sg_config_prometheus, please open an issue.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_rule_eval_failures"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (rule_group) (rate(prometheus_rule_evaluation_failures_total[5m]))) > 0)

prometheus: alertmanager_notification_latency

alertmanager notification latency over 1m by integration

Descriptions

warning prometheus: 1s+ alertmanager notification latency over 1m by integration

Next steps

Check the Container monitoring (not available on server) panels and try increasing resources for Prometheus if necessary.
Ensure that your observability.alerts configuration (in site configuration) is valid.
Check if the relevant alert integration service is experiencing downtime or issues.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_alertmanager_notification_latency"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (integration) (rate(alertmanager_notification_latency_seconds_sum[1m]))) >= 1)

prometheus: alertmanager_notification_failures

failed alertmanager notifications over 1m by integration

Descriptions

warning prometheus: 0+ failed alertmanager notifications over 1m by integration

Next steps

Ensure that your observability.alerts configuration (in site configuration) is valid.
Check if the relevant alert integration service is experiencing downtime or issues.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_alertmanager_notification_failures"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (integration) (rate(alertmanager_notifications_failed_total[1m]))) > 0)

prometheus: prometheus_config_status

prometheus configuration reload status

Descriptions

warning prometheus: less than 1 prometheus configuration reload status

Next steps

Check Prometheus logs for messages related to configuration loading.
Ensure any custom configuration you have provided Prometheus is valid.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_config_status"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((prometheus_config_last_reload_successful) < 1)

prometheus: alertmanager_config_status

alertmanager configuration reload status

Descriptions

warning prometheus: less than 1 alertmanager configuration reload status

Next steps

Ensure that your observability.alerts configuration (in site configuration) is valid.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_alertmanager_config_status"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: min((alertmanager_config_last_reload_successful) < 1)

prometheus: prometheus_tsdb_op_failure

prometheus tsdb failures by operation over 1m by operation

Descriptions

warning prometheus: 0+ prometheus tsdb failures by operation over 1m by operation

Next steps

Check Prometheus logs for messages related to the failing operation.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_tsdb_op_failure"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((increase(label_replace({__name__=~"prometheus_tsdb_(.*)_failed_total"\}, "operation", "$1", "__name__", "(.+)s_failed_total")[5m:1m])) > 0)

prometheus: prometheus_target_sample_exceeded

prometheus scrapes that exceed the sample limit over 10m

Descriptions

warning prometheus: 0+ prometheus scrapes that exceed the sample limit over 10m

Next steps

Check Prometheus logs for messages related to target scrape failures.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_target_sample_exceeded"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])) > 0)

prometheus: prometheus_target_sample_duplicate

prometheus scrapes rejected due to duplicate timestamps over 10m

Descriptions

warning prometheus: 0+ prometheus scrapes rejected due to duplicate timestamps over 10m

Next steps

Check Prometheus logs for messages related to target scrape failures.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_prometheus_target_sample_duplicate"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[10m])) > 0)

prometheus: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning prometheus: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_container_cpu_usage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}) >= 99)

prometheus: container_memory_usage

container memory usage by instance

Descriptions

warning prometheus: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_container_memory_usage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}) >= 99)

prometheus: provisioning_container_cpu_usage_long_term

container cpu usage total (90th percentile over 1d) across all cores by instance

Descriptions

warning prometheus: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing CPU limits in the Deployment.yaml for the prometheus service.
Docker Compose: Consider increasing cpus: of the prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_provisioning_container_cpu_usage_long_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}[1d])) >= 80)

prometheus: provisioning_container_memory_usage_long_term

container memory usage (1d maximum) by instance

Descriptions

warning prometheus: 80%+ container memory usage (1d maximum) by instance for 336h0m0s

Next steps

Kubernetes: Consider increasing memory limits in the Deployment.yaml for the prometheus service.
Docker Compose: Consider increasing memory: of the prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_provisioning_container_memory_usage_long_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}[1d])) >= 80)

prometheus: provisioning_container_cpu_usage_short_term

container cpu usage total (5m maximum) across all cores by instance

Descriptions

warning prometheus: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_provisioning_container_cpu_usage_short_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}[5m])) >= 90)

prometheus: provisioning_container_memory_usage_short_term

container memory usage (5m maximum) by instance

Descriptions

warning prometheus: 90%+ container memory usage (5m maximum) by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of prometheus container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_provisioning_container_memory_usage_short_term"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}[5m])) >= 90)

prometheus: container_oomkill_events_total

container OOMKILL events total by instance

Descriptions

warning prometheus: 1+ container OOMKILL events total by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of prometheus container in docker-compose.yml.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_prometheus_container_oomkill_events_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^prometheus.*"})) >= 1)

prometheus: pods_available_percentage

percentage pods available

Descriptions

critical prometheus: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod prometheus (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p prometheus.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_prometheus_pods_available_percentage"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*prometheus"\}) / count by (app) (up\{app=~".*prometheus"}) * 100) <= 50)

executor: executor_handlers

executor active handlers

Descriptions

critical executor: 0 active executor handlers and > 0 queue size for 5m0s

Next steps

Check to see the state of any compute VMs, they may be taking longer than expected to boot.
Make sure the executors appear under Site Admin > Executors.
Check the Grafana dashboard section for APIClient, it should do frequent requests to Dequeue and Heartbeat and those must not fail.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_executor_executor_handlers"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Custom query for critical alert: min(((sum(src_executor_processor_handlers{sg_job=~"^sourcegraph-executors.*"\}) or vector(0)) == 0 and (sum by (queue) (src_executor_total\{job=~"^sourcegraph-executors.*"})) > 0) <= 0)

executor: executor_processor_error_rate

executor operation error rate over 5m

Descriptions

warning executor: 100%+ executor operation error rate over 5m for 1h0m0s

Next steps

Determine the cause of failure from the auto-indexing job logs in the site-admin page.
This alert fires if all executor jobs have been failing for the past hour. The alert will continue for up to 5 hours until the error rate is no longer 100%, even if there are no running jobs in that time, as the problem is not know to be resolved until jobs start succeeding again.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_executor_executor_processor_error_rate"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Custom query for warning alert: max((last_over_time(sum(increase(src_executor_processor_errors_total{sg_job=~"^sourcegraph-executors.*"\}[5m]))[5h:]) / (last_over_time(sum(increase(src_executor_processor_total\{sg_job=~"^sourcegraph-executors.*"\}[5m]))[5h:]) + last_over_time(sum(increase(src_executor_processor_errors_total\{sg_job=~"^sourcegraph-executors.*"}[5m]))[5h:])) * 100) >= 100)

executor: go_goroutines

maximum active goroutines

Descriptions

warning executor: 10000+ maximum active goroutines for 10m0s

Next steps

More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_executor_go_goroutines"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max by (sg_instance) (go_goroutines{sg_job=~".*sourcegraph-executors"})) >= 10000)

executor: go_gc_duration_seconds

maximum go garbage collection duration

Descriptions

warning executor: 2s+ maximum go garbage collection duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_executor_go_gc_duration_seconds"
]

_{Managed by the Sourcegraph Code Plane team.}

Technical details

Generated query for warning alert: max((max by (sg_instance) (go_gc_duration_seconds{sg_job=~".*sourcegraph-executors"})) >= 2)

telemetry: telemetry_gateway_exporter_queue_growth

rate of growth of events export queue over 30m

Descriptions

warning telemetry: 1+ rate of growth of events export queue over 30m for 1h0m0s
critical telemetry: 1+ rate of growth of events export queue over 30m for 36h0m0s

Next steps

Check the "number of events exported per batch over 30m" dashboard panel to see if export throughput is at saturation.
Increase TELEMETRY_GATEWAY_EXPORTER_EXPORT_BATCH_SIZE to export more events per batch.
Reduce TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL to schedule more export jobs.
See worker logs in the worker.telemetrygateway-exporter log scope for more details to see if any export errors are occuring - if logs only indicate that exports failed, reach out to Sourcegraph with relevant log entries, as this may be an issue in Sourcegraph`s Telemetry Gateway service.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetry_gateway_exporter_queue_growth",
  "critical_telemetry_telemetry_gateway_exporter_queue_growth"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(deriv(src_telemetrygatewayexporter_queue_size[30m]))) > 1)

Generated query for critical alert: max((max(deriv(src_telemetrygatewayexporter_queue_size[30m]))) > 1)

telemetry: telemetrygatewayexporter_exporter_errors_total

events exporter operation errors every 30m

Descriptions

warning telemetry: 0+ events exporter operation errors every 30m

Next steps

Failures indicate that exporting of telemetry events from Sourcegraph are failing. This may affect the performance of the database as the backlog grows.
See worker logs in the worker.telemetrygateway-exporter log scope for more details. If logs only indicate that exports failed, reach out to Sourcegraph with relevant log entries, as this may be an issue in Sourcegraph`s Telemetry Gateway service.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetrygatewayexporter_exporter_errors_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(increase(src_telemetrygatewayexporter_exporter_errors_total{job=~"^worker.*"}[30m]))) > 0)

telemetry: telemetrygatewayexporter_queue_cleanup_errors_total

events export queue cleanup operation errors every 30m

Descriptions

warning telemetry: 0+ events export queue cleanup operation errors every 30m

Next steps

Failures indicate that pruning of already-exported telemetry events from the database is failing. This may affect the performance of the database as the export queue table grows.
See worker logs in the worker.telemetrygateway-exporter log scope for more details.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetrygatewayexporter_queue_cleanup_errors_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(increase(src_telemetrygatewayexporter_queue_cleanup_errors_total{job=~"^worker.*"}[30m]))) > 0)

telemetry: telemetrygatewayexporter_queue_metrics_reporter_errors_total

events export backlog metrics reporting operation errors every 30m

Descriptions

warning telemetry: 0+ events export backlog metrics reporting operation errors every 30m

Next steps

Failures indicate that reporting of telemetry events metrics is failing. This may affect the reliability of telemetry events export metrics.
See worker logs in the worker.telemetrygateway-exporter log scope for more details.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetrygatewayexporter_queue_metrics_reporter_errors_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_errors_total{job=~"^worker.*"}[30m]))) > 0)

telemetry: telemetry_v2_export_queue_write_failures

failed writes to events export queue over 5m

Descriptions

warning telemetry: 1%+ failed writes to events export queue over 5m
critical telemetry: 2.5%+ failed writes to events export queue over 5m for 5m0s

Next steps

Look for error logs related to inserting telemetry events.
Look for error attributes on telemetryevents.QueueForExport trace spans.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetry_v2_export_queue_write_failures",
  "critical_telemetry_telemetry_v2_export_queue_write_failures"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max(((sum(increase(src_telemetry_export_store_queued_events{failed="true"}[5m])) / sum(increase(src_telemetry_export_store_queued_events[5m]))) * 100) > 1)

Generated query for critical alert: max(((sum(increase(src_telemetry_export_store_queued_events{failed="true"}[5m])) / sum(increase(src_telemetry_export_store_queued_events[5m]))) * 100) > 2.5)

telemetry: telemetry_v2_event_logs_write_failures

failed write V2 events to V1 'event_logs' over 5m

Descriptions

warning telemetry: 5%+ failed write V2 events to V1 'event_logs' over 5m

Next steps

Error details are only persisted in trace metadata as it is considered non-critical.
To diagnose, enable trace sampling across all requests and look for error attributes on telemetrystore.v1teewrite spans.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetry_v2_event_logs_write_failures"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max(((sum(increase(src_telemetry_teestore_v1_events{failed="true"}[5m])) / sum(increase(src_telemetry_teestore_v1_events[5m]))) * 100) > 5)

telemetry: telemetrygatewayexporter_usermetadata_exporter_errors_total

(off by default) user metadata exporter operation errors every 30m

Descriptions

warning telemetry: 0+ (off by default) user metadata exporter operation errors every 30m

Next steps

Failures indicate that exporting of telemetry events from Sourcegraph are failing. This may affect the performance of the database as the backlog grows.
See worker logs in the worker.telemetrygateway-exporter log scope for more details. If logs only indicate that exports failed, reach out to Sourcegraph with relevant log entries, as this may be an issue in Sourcegraph`s Telemetry Gateway service.
This exporter is DISABLED by default.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_telemetry_telemetrygatewayexporter_usermetadata_exporter_errors_total"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(increase(src_telemetrygatewayexporter_usermetadata_exporter_errors_total{job=~"^worker.*"}[30m]))) > 0)

otel-collector: otel_span_refused

spans refused per receiver

Descriptions

warning otel-collector: 1+ spans refused per receiver for 5m0s

Next steps

Check logs of the collector and configuration of the receiver
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_otel_span_refused"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (receiver) (rate(otelcol_receiver_refused_spans[1m]))) > 1)

otel-collector: otel_span_export_failures

span export failures by exporter

Descriptions

warning otel-collector: 1+ span export failures by exporter for 5m0s

Next steps

Check the configuration of the exporter and if the service being exported is up
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_otel_span_export_failures"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))) > 1)

otel-collector: otelcol_exporter_enqueue_failed_spans

exporter enqueue failed spans

Descriptions

warning otel-collector: 0+ exporter enqueue failed spans for 5m0s

Next steps

Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_otelcol_exporter_enqueue_failed_spans"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))) > 0)

otel-collector: otelcol_processor_dropped_spans

spans dropped per processor per minute

Descriptions

warning otel-collector: 0+ spans dropped per processor per minute for 5m0s

Next steps

Check the configuration of the processor
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_otelcol_processor_dropped_spans"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))) > 0)

otel-collector: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

Descriptions

warning otel-collector: 99%+ container cpu usage total (1m average) across all cores by instance

Next steps

Kubernetes: Consider increasing CPU limits in the the relevant Deployment.yaml.
Docker Compose: Consider increasing cpus: of the otel-collector container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_container_cpu_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)

otel-collector: container_memory_usage

container memory usage by instance

Descriptions

warning otel-collector: 99%+ container memory usage by instance

Next steps

Kubernetes: Consider increasing memory limit in relevant Deployment.yaml.
Docker Compose: Consider increasing memory: of otel-collector container in docker-compose.yml.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_otel-collector_container_memory_usage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)

otel-collector: pods_available_percentage

percentage pods available

Descriptions

critical otel-collector: less than 50% percentage pods available for 15m0s

Next steps

Determine if the pod was OOM killed using kubectl describe pod otel-collector (look for OOMKilled: true) and, if so, consider increasing the memory limit in the relevant Deployment.yaml.
Check the logs before the container restarted to see if there are panic: messages or similar using kubectl logs -p otel-collector.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "critical_otel-collector_pods_available_percentage"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for critical alert: min((sum by (app) (up{app=~".*otel-collector"\}) / count by (app) (up\{app=~".*otel-collector"}) * 100) <= 50)

deepsearch: deepsearch_questions_in_flight_growth

rate of growth of in-flight questions over 1h

Descriptions

warning deepsearch: 1+ rate of growth of in-flight questions over 1h for 30m0s
critical deepsearch: 1+ rate of growth of in-flight questions over 1h for 1h0m0s

Next steps

Check for questions that are never stopping.
Check for slow LLM responses or tool execution times.
Review deepsearch_question_processing_duration for processing time trends.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_deepsearch_deepsearch_questions_in_flight_growth",
  "critical_deepsearch_deepsearch_questions_in_flight_growth"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((max(deriv(src_deepsearch_questions_in_flight[1h]))) > 1)

Generated query for critical alert: max((max(deriv(src_deepsearch_questions_in_flight[1h]))) > 1)

deepsearch: deepsearch_question_processing_error_rate

question processing error rate over 5m

Descriptions

warning deepsearch: 10%+ question processing error rate over 5m for 10m0s
critical deepsearch: 20%+ question processing error rate over 5m for 10m0s

Next steps

Check frontend logs for Worker failed to process question errors.
Review LLM stream errors in the LLM streaming panel below.
Check for upstream LLM provider issues.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_deepsearch_deepsearch_question_processing_error_rate",
  "critical_deepsearch_deepsearch_question_processing_error_rate"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(rate(src_deepsearch_question_processing_errors_total{operation="question"\}[5m])) / (sum(rate(src_deepsearch_question_processing_total\{operation="question"}[5m])) > 0) * 100) > 10)

Generated query for critical alert: max((sum(rate(src_deepsearch_question_processing_errors_total{operation="question"\}[5m])) / (sum(rate(src_deepsearch_question_processing_total\{operation="question"}[5m])) > 0) * 100) > 20)

deepsearch: deepsearch_llm_stream_fatal_errors

fatal LLM stream errors over 5m

Descriptions

warning deepsearch: 20+ fatal LLM stream errors over 5m

Next steps

Check frontend logs for fatal error in LLM stream.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_deepsearch_deepsearch_llm_stream_fatal_errors"
]

_{Managed by the Sourcegraph Code Understanding team.}

Technical details

Generated query for warning alert: max((sum(increase(src_deepsearch_question_processing_errors_total{operation="llm_stream_fatal"}[5m]))) > 20)

externalapi: externalapi_error_rate

error rate over 5m

Descriptions

warning externalapi: 10%+ error rate over 5m for 10m0s
critical externalapi: 25%+ error rate over 5m for 10m0s

Next steps

Check frontend logs for external API errors.
Review individual RPC method error rates below.
More help interpreting this metric is available in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_externalapi_externalapi_error_rate",
  "critical_externalapi_externalapi_error_rate"
]

_{Managed by the Sourcegraph Platform team.}

Technical details

Generated query for warning alert: max((sum(rate(rpc_server_duration_milliseconds_count{rpc_connect_error_code!=""}[5m])) / (sum(rate(rpc_server_duration_milliseconds_count[5m])) > 0) * 100) > 10)

Generated query for critical alert: max((sum(rate(rpc_server_duration_milliseconds_count{rpc_connect_error_code!=""}[5m])) / (sum(rate(rpc_server_duration_milliseconds_count[5m])) > 0) * 100) > 25)

background-jobs: error_percentage_by_method

percentage of operations resulting in error by method

Descriptions

warning background-jobs: 5%+ percentage of operations resulting in error by method
critical background-jobs: 50%+ percentage of operations resulting in error by method

Next steps

Review logs for the specific operation to identify patterns in errors. Check database connectivity and schema. If a particular method is consistently failing, investigate potential issues with that operation`s SQL query or transaction handling.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_error_percentage_by_method",
  "critical_background-jobs_error_percentage_by_method"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max(((sum by (op) (rate(src_workerutil_dbworker_store_errors_total[5m])) / (sum by (op) (rate(src_workerutil_dbworker_store_errors_total[5m])) + sum by (op) (rate(src_workerutil_dbworker_store_total[5m])))) * 100) >= 5)

Generated query for critical alert: max(((sum by (op) (rate(src_workerutil_dbworker_store_errors_total[5m])) / (sum by (op) (rate(src_workerutil_dbworker_store_errors_total[5m])) + sum by (op) (rate(src_workerutil_dbworker_store_total[5m])))) * 100) >= 50)

background-jobs: error_percentage_by_domain

percentage of operations resulting in error by domain

Descriptions

warning background-jobs: 5%+ percentage of operations resulting in error by domain
critical background-jobs: 50%+ percentage of operations resulting in error by domain

Next steps

Review logs for the specific domain to identify patterns in errors. Check database connectivity and schema.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_error_percentage_by_domain",
  "critical_background-jobs_error_percentage_by_domain"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max(((sum by (domain) (rate(src_workerutil_dbworker_store_errors_total[5m])) / (sum by (domain) (rate(src_workerutil_dbworker_store_errors_total[5m])) + sum by (domain) (rate(src_workerutil_dbworker_store_total[5m])))) * 100) >= 5)

Generated query for critical alert: max(((sum by (domain) (rate(src_workerutil_dbworker_store_errors_total[5m])) / (sum by (domain) (rate(src_workerutil_dbworker_store_errors_total[5m])) + sum by (domain) (rate(src_workerutil_dbworker_store_total[5m])))) * 100) >= 50)

background-jobs: resetter_duration

time spent running the resetter

Descriptions

warning background-jobs: 10s+ time spent running the resetter

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_resetter_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.95, sum by (le, domain) (rate(src_dbworker_resetter_duration_seconds_bucket[5m])))) >= 10)

background-jobs: resetter_failures

number of times the resetter failed to run

Descriptions

warning background-jobs: 1reqps+ number of times the resetter failed to run

Next steps

Check application logs for the failing domain to check for errors. High failure rates indicate a bug in the code handling the job, or a pod frequently dying.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_resetter_failures"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (domain) (increase(src_dbworker_resetter_errors_total[5m]))) >= 1)

background-jobs: failed_records

number of stalled records marked as 'failed'

Descriptions

warning background-jobs: 50+ number of stalled records marked as 'failed'

Next steps

Check application logs for the failing domain to check for errors. High failure rates indicate a bug in the code handling the job, or a pod frequently dying.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_failed_records"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum by (domain) (increase(src_dbworker_resetter_record_reset_failures_total[5m]))) >= 50)

background-jobs: stall_duration_p90

90th percentile of stall duration

Descriptions

warning background-jobs: 300s+ 90th percentile of stall duration

Next steps

Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_stall_duration_p90"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((histogram_quantile(0.9, sum by (le, domain) (rate(src_dbworker_resetter_stall_duration_seconds_bucket[5m])))) >= 300)

background-jobs: aggregate_queue_size

total number of jobs queued across all domains

Descriptions

warning background-jobs: 1e+06+ total number of jobs queued across all domains

Next steps

Check for stuck workers or investigate the specific domains with high queue depth. Check worker logs for errors and database for high load.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_aggregate_queue_size"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((sum(max by (domain) (src_workerutil_queue_depth))) >= 1000000)

background-jobs: max_queue_duration

maximum time a job has been in queue across all domains

Descriptions

warning background-jobs: 86400s+ maximum time a job has been in queue across all domains

Next steps

Investigate which domain has jobs stuck in queue. If the queue is growing, consider scaling up worker instances.
Learn more about the related dashboard panel in the dashboards reference.
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

JSON
"observability.silenceAlerts": [
  "warning_background-jobs_max_queue_duration"
]

_{Managed by the Sourcegraph Services team.}

Technical details

Generated query for warning alert: max((max(src_workerutil_queue_duration_seconds)) >= 86400)