docs(alerting): add settings for alert evaluation backoff retries (#111891)

* docs(alerting): add settings for alert evaluation backoff retries

* docs(alerting): add mention of backoff settings on the Error state docs

* fix vale prose
This commit is contained in:
Pepe Cano 2025-10-03 11:08:14 +02:00 committed by GitHub
parent 3ce9137c19
commit 6248971d1e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 37 additions and 4 deletions

View File

@ -106,7 +106,7 @@ A Grafana-managed alert instance can be in any of the following states, dependin
The **Error** state is triggered when the alert rule fails to evaluate its query or queries successfully.
This can occur due to evaluation timeouts (default: `30s`) or three repeated failures when querying the data source. The [`evaluation_timeout`](ref:evaluation_timeout) and [`max_attempts`](ref:max_attempts) options control these settings.
This can occur due to evaluation timeouts (default: `30s`) or repeated failures (default: `3`) when querying the data source. The [`evaluation_timeout`](ref:evaluation_timeout) and [`max_attempts`](ref:max_attempts) options control these settings.
When an alert instance enters the **Error** state, Grafana, by default, triggers a new [`DatasourceError` alert](#no-data-and-error-alerts). You can control this behavior based on the desired outcome of your alert rule in [Modify the `No Data` or `Error` state](#modify-the-no-data-or-error-state).
@ -157,10 +157,10 @@ To minimize the number of **No Data** or **Error** state alerts received, try th
To minimize timeouts resulting in the **Error** state, reduce the time range to request less data every evaluation cycle.
1. Change the default [evaluation time out](ref:evaluation_timeout). The default is set at 30 seconds. To increase the default evaluation timeout, open a support ticket from the [Cloud Portal](https://grafana.com/docs/grafana-cloud/account-management/support/#grafana-cloud-support-options). Note that this should be a last resort, because it may affect the performance of all alert rules and cause missed evaluations if the timeout is too long.
1. To reduce multiple notifications from **Error** alerts, define a [notification policy](ref:notification-policies) to handle all related alerts with `alertname=DatasourceError`, and filter and group errors from the same data source using the `datasource_uid` label.
1. Change the [evaluation timeout](ref:evaluation_timeout) (default: `30s`) or the [retry mechanism (`max_attempts`)](ref:max_attempts) settings. This should be a last resort, as it can affect the performance of all alert rules and may cause missed evaluations if the timeout is too long. For Grafana Cloud, open a support ticket from the [Cloud Portal](https://grafana.com/docs/grafana-cloud/account-management/support/#grafana-cloud-support-options).
{{< admonition type="tip" >}}
For common examples and practical guidance on handling **Error**, **No Data**, and **stale** alert scenarios, refer to the [Handle connectivity errors](ref:guide-connectivity-errors) and [Handle missing data](ref:guide-missing-data) guides.
{{< /admonition >}}

View File

@ -1915,7 +1915,40 @@ The timeout string is a possibly signed sequence of decimal numbers, followed by
#### `max_attempts`
Sets a maximum number of times Grafana attempts to evaluate an alert rule before giving up on that evaluation. The default value is `3`.
The maximum number of times Grafana retries evaluating an alert rule before giving up on that evaluation. Default is `3`.
The retry mechanism:
- Adds jitter to retry delays to prevent thundering herd problems when multiple rules fail simultaneously.
- Stops when either `max_attempts` is reached or the rules evaluation interval is exceeded.
You can customize retry behaviour with `initial_retry_delay`, `max_retry_delay`, and `randomization_factor`.
#### `initial_retry_delay`
The initial delay before retrying a failed alert evaluation. Default is `1s`.
This value is the starting point for exponential backoff.
#### `max_retry_delay`
The maximum delay between retries during exponential backoff. Default is `4s`.
After the retry delay reaches `max_retry_delay`, all subsequent retries use this delay.
To avoid overlapping retries with scheduled evaluations, `max_retry_delay` must be less than the rules evaluation interval.
#### `randomization_factor`
The randomization factor for exponential backoff retries. Default is `0.1`.
The value must be between `0` and `1`.
The actual retry delay is chosen randomly between:
```
[current_delay*(1-randomization_factor), current_delay*(1+randomization_factor)]
```
#### `min_interval`