Commit Graph

77 Commits

Author SHA1 Message Date
Seunghun Shin 512c292e04
Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357)
What is this feature?

This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments.

Why do we need this feature?

In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments.

Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints.

For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds.

This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.
2025-09-29 11:22:36 +02:00
Alexander Akhmetov 100528e274
Alerting: Support retry with backoff in alert rule evaluation (#99710) 2025-09-04 13:56:03 +02:00
Alexander Akhmetov 6db07b901c
Alerting: Enable HA clustering in remote primary mode (#108930) 2025-07-31 09:55:08 +02:00
Vadim Stepanov bccc980b90
Alerting: Notifiication history (#107644)
* Add unified_alerting.notification_history to ini files

* Parse notification history settings

* Move Loki client to a separate package

* Loki client: add params for metrics and traces

* add NotificationHistorian

* rm writeDuration

* remove RangeQuery stuff

* wip

* wip

* wip

* wip

* pass notification historian in tests

* unify loki settings

* unify loki settings

* add test

* update grafana/alerting

* make update-workspace

* add feature toggle

* fix configureNotificationHistorian

* Revert "add feature toggle"

This reverts commit de7af8f7

* add feature toggle

* more tests

* RuleUID

* fix metrics test

* met.Info.Set(0)
2025-07-17 14:26:26 +01:00
Alexander Akhmetov a76057cedb
Alerting: Use GRAFANA_ALERTS as the default state history metric name (#107050) 2025-06-20 18:14:36 +02:00
Alexander Akhmetov ad683f83ff
Alerting: Add state history backend to write ALERTS metric (#104361)
**What is this feature?**

This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources.

The metric includes a few additional labels:

* `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus.
* `grafana_rule_uid`: Grafana's alert rule UID.

Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values:

| Grafana alert state | `alertstate`          | `grafana_alertstate`  |
|---------------------|-----------------------|-----------------------|
| `Alerting`          | `firing`              | `alerting`            |
| `Recovering`        | `firing`              | `recovering`          |
| `Pending`           | `pending`             | `pending`             |
| `Error`             | `firing`              | `error`               |
| `NoData`            | `firing`              | `nodata`              |
| `Normal`            | _(no metric emitted)_ | _(no metric emitted)_ |
2025-06-18 07:17:57 +02:00
Vadim Stepanov 5137995830
Alerting: Add support for Redis Sentinel for Alerting HA (#106322)
* Alerting: Add support for Redis Sentinel

* docs

* docs

* Use minisentinel in test

* Apply suggestions from code review

Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>

* "address(es)" -> "address or addresses"

* make update-workspace

* make lint-go-diff

---------

Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com>
Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>
2025-06-05 15:02:40 +01:00
Alexander Akhmetov e256f2d5e2
Alerting: Enable recording rules by default (#105603) 2025-06-02 10:56:05 +02:00
Santiago 6c3d89f390
Remote Alertmanager: Add timeouts to the HTTP client (#105279)
* Remote Alertmanage: Add timeouts to the HTTP client

* code review suggestions
2025-05-13 13:25:56 +02:00
Santiago 5a589bb51a
Alerting: Enable the remote Alertmanager feature using only feature toggles (#101410)
* Alerting: Enable the remote Alertmanager feature using only feature toggles

* Trigger build
2025-04-30 12:18:47 +02:00
Alexander Akhmetov 756b45402c
Alerting: Add rule_query_offset setting for Prometheus rule conversion (#102500)
Adds a new configuration option to specify a time offset for rule evaluation, which gets applied and saved during the Prometheus -> Grafana conversion. For example:

[unified_alerting.prometheus_conversion]
rule_query_offset = 1m

Changing this option affects only the rules imported after the change. If query_offset is set at the group level, it takes precedence over this setting.

Default is set to 1m.
2025-03-20 15:31:21 +01:00
Yuri Tseretyan 943b73a682
Alerting: Add scheduled clean-up of deleted rules (#101963)
* add scheduled clean up of deleted rules


---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-03-11 22:58:26 +02:00
Steve Simpson bbab62ce39
Alerting: Select remote write path dependent on metrics backend type. (#101891)
The remote write path differs based on whether the data source is actually
Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want
users to have to specify the path, so this change determines the path as
best it can.

It may be in the future we have to make this configurable per-datasource
to cater for setups where it's impossible to determine the correct path.
2025-03-11 13:45:16 +01:00
Steve Simpson 14ebec527c
Alerting: Allow selection of recording rule write target on per-rule basis. (#101778)
* Alerting: Allow selection of recording rule write target on per-rule basis.

Introduces a new feature flag (`grafanaManagedRecordingRulesDatasources`),
disabled by default, to enable the ability to write recording rules data using
data source settings, and selecting the data source to use on a per-rule basis.

To cope with the scenario of users upgrading, a configuration file option
allows setting the default data source to use, if none is specified in the rule,
emulating the behaviour of recording rules without the flag enabled.

* Lint

* Update conf/sample.ini

Co-authored-by: Alexander Akhmetov <me@alx.cx>

---------

Co-authored-by: Alexander Akhmetov <me@alx.cx>
2025-03-07 14:30:40 +01:00
Alexander Akhmetov 1f8f9a45d7
Alerting: Add state_periodic_save_batch_size config option (#98019)
* Alerting: Add state_periodic_save_batch_size config option

---------

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
2024-12-16 15:30:38 +01:00
Steve Simpson c440bd2bda
Alerting: Change default for max_attempts to 3. (#97461)
Currently the default is 1, this means that by default users will see transient
query errors reflected as alert evaluation failures, when often an immediate
retry is sufficient to evaluate the rule successfully.

Enabling retries by default leads to a better experience out of the box.
2024-12-05 21:48:24 +01:00
Fayzal Ghantiwala 1fdc48faba
Alerting: Make context deadline on AlertNG service startup configurable (#96053)
* Make alerting context deadline configurable

* Remove debug logs

* Change default timeout

* Update tests
2024-11-07 18:23:55 +00:00
Brandon fbad76007d
Alerting: Limit and clean up old alert rules versions (#89754) 2024-10-05 00:31:21 +03:00
Alexander Weaver ac5ebe6e4d
Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
Santiago b79b38f02c
Alertmanager: Support limits for silences (#90826)
* Alertmanager: support limits for silences

* update grafana/alerting to latest main
2024-07-24 14:22:29 +02:00
Alexander Akhmetov 68691c9386
Alerting: Add setting for maximum allowed rule evaluation results (#89468)
* Alerting: Add setting for maximum allowed rule evaluation results

Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
2024-06-27 09:45:15 +02:00
Yuri Tseretyan 4a5aab54a5
Alerting: Add max limit for Loki query size in state history API (#89646)
* add setting for query limit

* update BuildLogQuery to return error if limit is exceeded

* move tests for BuildLogQuery to separate suite
2024-06-25 09:20:38 -04:00
Matthew Jacobson 3228b64fe6
Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
William Wernert c62cc25513
Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
Jacob Valdemar eb76ea47a0
Alerting: Add ha_reconnect_timeout configuration option (#88823)
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2024-06-11 13:25:48 -04:00
Santiago e15e40fbd3
Alerting: Skip setting up clustering in remote primary/only modes (#88968)
* Alerting: Skip setting up clustering in remote primary mode

* Update pkg/services/ngalert/notifier/multiorg_alertmanager.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-06-10 13:51:11 +02:00
Fayzal Ghantiwala 80f54778f3
Alerting: Add option to use Redis in cluster mode for Alerting HA (#88696)
* Add config option to use Redis in cluster mode

* Use UniversalOptions
2024-06-05 17:02:25 +01:00
Fayzal Ghantiwala 7a2fbad0c8
Alerting: Add options to configure TLS for HA using Redis (#87567)
* Add Alerting HA Redis Client TLS configs

* Add test to ping miniredis with mTLS

* Update .ini files and docs

* Add tests for unified alerting ha redis TLS settings

* Fix malformed go.sum

* Add modowner

* Fix lint error

* Update docs and use dstls config
2024-05-14 14:21:42 +01:00
Santiago 529f55cfe8
Alerting: Remove isDefault field from receivers (Alertmanager configuration) (#86605)
Alerting: Remove isDefault field from receivers in the Alertmanager configuration
2024-04-19 15:44:20 +02:00
Santiago c7573bb0f7
Alerting: Make retention period configurable for the notification log (#85605)
* Alerting: Make retention period configurable for the notification log

* update sample.ini

* fix outdated comment (on disk -> kvstore)

* skip checking cyclomatic complexity for ReadUnifiedAlertingSettings
2024-04-05 12:25:43 +02:00
William Wernert 97f37b2e6f
Alerting: Clamp Loki ASH range query to configured max_query_length (#83986)
* Clamp range in loki http client to configured max_query_length

Defaults to 721h to match Loki default
2024-03-15 18:59:45 +02:00
Gilles De Mey 8765c48389
Alerting: Remove legacy alerting (#83671)
Removes legacy alerting, so long and thanks for all the fish! 🐟

---------

Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com>
Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com>
Co-authored-by: William Wernert <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-03-14 15:36:35 +01:00
Yuri Tseretyan 7147af6b8e
Alerting: Disable legacy alerting for ever (#83651)
* hard disable for legacy alerting
* remove alerting section from configuration file 
* update documentation to not refer to deleted section
* remove AlertingEnabled from usage in UA setting parsing
2024-03-07 16:01:11 -05:00
Alexander Weaver 99fa064576
Alerting: Emit warning when creating or updating unusually large groups (#82279)
* Add config for limit of rules per rule group

* Warn when editing big groups through normal API

* Warn on prov api writes for groups

* Wire up comp root, tests

* Also add warning to state manager warm

* Drop unnecessary conversion
2024-02-13 08:29:03 -06:00
Alexander Weaver 5bbe9c6e61
Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212)
* remove jitter feature flag

* Add an out so users can manually disable jitter

* Pass in cfg

* Add TODO to remove knob in future
2024-02-09 15:53:58 -06:00
Jean-Philippe Quéméner aa25776f81
Alerting: Add a feature flag to periodically save states (#80987) 2024-01-23 17:03:30 +01:00
Marcus Efraimsson 6768c6c059
Chore: Remove public vars in setting package (#81018)
Removes the public variable setting.SecretKey plus some other ones. 
Introduces some new functions for creating setting.Cfg.
2024-01-23 12:36:22 +01:00
Santiago a77ba40ed4
Alerting: Use the forked Alertmanager for remote secondary mode (#79646)
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode

* fall back to using internal AM in case of error

* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct

* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only

* extract logic to decide remote Alertmanager mode to a separate function, switch on mode

* tests

* make linter happy

* remove func to decide remote Alertmanager mode

* refactor factory function and options

* add default case to switch statement

* remove ineffectual assignment
2023-12-21 15:26:31 +01:00
Santiago c46da8ea9b
Alerting: Update alerting package and imports from cluster and clusterpb (#79786)
* Alerting: Update alerting package

* update to latest commit

* alias for imports
2023-12-21 12:34:48 +01:00
gotjosh 0c9356a3c7
Unified Alerting: Set `max_attempts` to 1 by default (#79095)
* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-05 17:42:34 +00:00
Matthew Jacobson 5a80962de9
Alerting: Add clean_upgrade config and deprecate force_migration (#78324)
* Alerting: Add clean_upgrade config and deprecate force_migration

Upgrading to UA and rolling back will no longer delete any data by default. 
Instead, each set of tables will remain unchanged when switching between 
legacy and UA. As such, the force_migration config has been deprecated 
and no extra configuration is required to roll back to legacy anymore.

If clean_upgrade is set to true when upgrading from legacy alerting to Unified
Alerting, grafana will first delete all existing Unified Alerting resources,
thus re-upgrading all organizations from scratch. If false or unset,
organizations that have previously upgraded will not lose their existing Unified
 Alerting data when switching between legacy and Unified Alerting.

 Similar to force_migration, it should be kept false when not needed as it may
 cause unintended data-loss if left enabled.

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2023-11-30 11:01:11 -05:00
Santiago 7a34cdb3a2
Alerting: Add configuration options to migrate to an external Alertmanager (#71318)
* add configuration options to .ini file and parse them

* updates on config options, add external AM config to the main config struct

* separate external AM configs from general alerting configs, naming

* comments about usage of tenantID in basic auth & not using config options yet
2023-09-05 11:24:35 -03:00
Alexander Weaver dfba94e052
Alerting: Limit redis pool size to 5 and make configurable (#74057)
* Limit redis pool size to 5 and expose it in config ini

* Coerce negative pool sizes to the default
2023-08-29 14:59:12 -05:00
Yuri Tseretyan c7598cc6fb
Alerting: Add ability to control scheduler tick interval via config (#71980)
* add ability to control scheduler interval via config
* add feature flag `configurableSchedulerTick`
2023-07-26 12:44:12 -04:00
George Robinson 7edbe72483
Alerting: Support concurrent queries for saving alert instances (#70525)
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
2023-06-23 11:36:07 +01:00
Jean-Philippe Quéméner 8bb62a8316
Alerting: Add option for memberlist label (#67982) 2023-05-09 10:32:23 +02:00
Jean-Philippe Quéméner bc11a484ed
Alerting: Add support for running HA using Redis (#65267)
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2023-04-19 17:05:26 +02:00
Alexander Weaver b2abb63286
Alerting: Introduce proper feature toggles for common state history backend combinations (#65497)
* define 3 feature toggles for rollout phases

* Pass feature toggles along

* Implement first feature toggle

* Try a different strategy with fall-throughs to specific configurations

* Apply toggle overrides once outside of backend composition

* Emit log messages when we coerce backends

* Run code generator for feature toggle files

* Improve wording in flag descs

* Re-run generator

* Use code-generated constants instead of plain strings

* Use converted enum values rather than strings for pre-parsing
2023-03-30 13:53:21 -05:00
Alexander Weaver a31672fa40
Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774)
* Rename RecordStatesAsync to Record

* Rename QueryStates to Query

* Implement fanout writes

* Implement primary queries

* Simplify error joining

* Add test for query path

* Add tests for writes and error propagation

* Allow fanout backend to be configured

* Touch up log messages and config validation

* Consistent documentation for all backend structs

* Parse and normalize backend names more consistently against an enum

* Touch-ups to documentation

* Improve clarity around multi-record blocking

* Keep primary and secondaries more distinct

* Rename fanout backend to multiple backend

* Simplify config keys for multi backend mode
2023-03-17 12:41:18 -05:00
Alexander Weaver e7ace4ed62
Alerting: Allow separate read and write path URLs for Loki state history (#62268)
Extract config parsing and add tests
2023-01-30 16:30:05 -06:00