docs(alerting): add settings for alert evaluation backoff retries (#111891)

* docs(alerting): add settings for alert evaluation backoff retries * docs(alerting): add mention of backoff settings on the Error state docs * fix vale prose
2025-10-03 11:08:14 +02:00 · 2025-10-03 11:08:14 +02:00 · 6248971d1e
parent 3ce9137c19
commit 6248971d1e
2 changed files with 37 additions and 4 deletions
--- a/docs/sources/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states.md
+++ b/docs/sources/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states.md
@ -106,7 +106,7 @@ A Grafana-managed alert instance can be in any of the following states, dependin

 The **Error** state is triggered when the alert rule fails to evaluate its query or queries successfully.

-This can occur due to evaluation timeouts (default: `30s`) or three repeated failures when querying the data source. The [`evaluation_timeout`](ref:evaluation_timeout) and [`max_attempts`](ref:max_attempts) options control these settings.
+This can occur due to evaluation timeouts (default: `30s`) or repeated failures (default: `3`) when querying the data source. The [`evaluation_timeout`](ref:evaluation_timeout) and [`max_attempts`](ref:max_attempts) options control these settings.

 When an alert instance enters the **Error** state, Grafana, by default, triggers a new [`DatasourceError` alert](#no-data-and-error-alerts). You can control this behavior based on the desired outcome of your alert rule in [Modify the `No Data` or `Error` state](#modify-the-no-data-or-error-state).

@ -157,10 +157,10 @@ To minimize the number of **No Data** or **Error** state alerts received, try th

   To minimize timeouts resulting in the **Error** state, reduce the time range to request less data every evaluation cycle.

-1. Change the default [evaluation time out](ref:evaluation_timeout). The default is set at 30 seconds. To increase the default evaluation timeout, open a support ticket from the [Cloud Portal](https://grafana.com/docs/grafana-cloud/account-management/support/#grafana-cloud-support-options). Note that this should be a last resort, because it may affect the performance of all alert rules and cause missed evaluations if the timeout is too long.
-
 1. To reduce multiple notifications from **Error** alerts, define a [notification policy](ref:notification-policies) to handle all related alerts with `alertname=DatasourceError`, and filter and group errors from the same data source using the `datasource_uid` label.

+1. Change the [evaluation timeout](ref:evaluation_timeout) (default: `30s`) or the [retry mechanism (`max_attempts`)](ref:max_attempts) settings. This should be a last resort, as it can affect the performance of all alert rules and may cause missed evaluations if the timeout is too long. For Grafana Cloud, open a support ticket from the [Cloud Portal](https://grafana.com/docs/grafana-cloud/account-management/support/#grafana-cloud-support-options).
+
   {{< admonition type="tip" >}}
   For common examples and practical guidance on handling **Error**, **No Data**, and **stale** alert scenarios, refer to the [Handle connectivity errors](ref:guide-connectivity-errors) and [Handle missing data](ref:guide-missing-data) guides.
   {{< /admonition  >}}
--- a/docs/sources/setup-grafana/configure-grafana/_index.md
+++ b/docs/sources/setup-grafana/configure-grafana/_index.md
@ -1915,7 +1915,40 @@ The timeout string is a possibly signed sequence of decimal numbers, followed by

 #### `max_attempts`

-Sets a maximum number of times Grafana attempts to evaluate an alert rule before giving up on that evaluation. The default value is `3`.
+The maximum number of times Grafana retries evaluating an alert rule before giving up on that evaluation. Default is `3`.
+
+The retry mechanism:
+
+- Adds jitter to retry delays to prevent thundering herd problems when multiple rules fail simultaneously.
+- Stops when either `max_attempts` is reached or the rule’s evaluation interval is exceeded.
+
+You can customize retry behaviour with `initial_retry_delay`, `max_retry_delay`, and `randomization_factor`.
+
+#### `initial_retry_delay`
+
+The initial delay before retrying a failed alert evaluation. Default is `1s`.
+
+This value is the starting point for exponential backoff.
+
+#### `max_retry_delay`
+
+The maximum delay between retries during exponential backoff. Default is `4s`.
+
+After the retry delay reaches `max_retry_delay`, all subsequent retries use this delay.
+
+To avoid overlapping retries with scheduled evaluations, `max_retry_delay` must be less than the rule’s evaluation interval.
+
+#### `randomization_factor`
+
+The randomization factor for exponential backoff retries. Default is `0.1`.
+
+The value must be between `0` and `1`.
+
+The actual retry delay is chosen randomly between:
+
+```
+[current_delay*(1-randomization_factor), current_delay*(1+randomization_factor)]
+```

 #### `min_interval`