grafana

Commit Graph

Author	SHA1	Message	Date
Seunghun Shin	512c292e04	Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357 ) What is this feature? This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments. Why do we need this feature? In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments. Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints. For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds. This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.	2025-09-29 11:22:36 +02:00
Alexander Akhmetov	100528e274	Alerting: Support retry with backoff in alert rule evaluation (#99710 )	2025-09-04 13:56:03 +02:00
Alexander Akhmetov	6db07b901c	Alerting: Enable HA clustering in remote primary mode (#108930 )	2025-07-31 09:55:08 +02:00
Vadim Stepanov	bccc980b90	Alerting: Notifiication history (#107644 ) * Add unified_alerting.notification_history to ini files * Parse notification history settings * Move Loki client to a separate package * Loki client: add params for metrics and traces * add NotificationHistorian * rm writeDuration * remove RangeQuery stuff * wip * wip * wip * wip * pass notification historian in tests * unify loki settings * unify loki settings * add test * update grafana/alerting * make update-workspace * add feature toggle * fix configureNotificationHistorian * Revert "add feature toggle" This reverts commit `de7af8f7` * add feature toggle * more tests * RuleUID * fix metrics test * met.Info.Set(0)	2025-07-17 14:26:26 +01:00
Alexander Akhmetov	a76057cedb	Alerting: Use GRAFANA_ALERTS as the default state history metric name (#107050 )	2025-06-20 18:14:36 +02:00
Alexander Akhmetov	ad683f83ff	Alerting: Add state history backend to write ALERTS metric (#104361 ) What is this feature? This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources. The metric includes a few additional labels: * `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus. * `grafana_rule_uid`: Grafana's alert rule UID. Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values: \| Grafana alert state \| `alertstate` \| `grafana_alertstate` \| \|---------------------\|-----------------------\|-----------------------\| \| `Alerting` \| `firing` \| `alerting` \| \| `Recovering` \| `firing` \| `recovering` \| \| `Pending` \| `pending` \| `pending` \| \| `Error` \| `firing` \| `error` \| \| `NoData` \| `firing` \| `nodata` \| \| `Normal` \| _(no metric emitted)_ \| _(no metric emitted)_ \|	2025-06-18 07:17:57 +02:00
Vadim Stepanov	5137995830	Alerting: Add support for Redis Sentinel for Alerting HA (#106322 ) * Alerting: Add support for Redis Sentinel * docs * docs * Use minisentinel in test * Apply suggestions from code review Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com> Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com> * "address(es)" -> "address or addresses" * make update-workspace * make lint-go-diff --------- Co-authored-by: Johnny Kartheiser <140559259+JohnnyK-Grafana@users.noreply.github.com> Co-authored-by: Fayzal Ghantiwala <114010985+fayzal-g@users.noreply.github.com>	2025-06-05 15:02:40 +01:00
Alexander Akhmetov	e256f2d5e2	Alerting: Enable recording rules by default (#105603 )	2025-06-02 10:56:05 +02:00
Santiago	6c3d89f390	Remote Alertmanager: Add timeouts to the HTTP client (#105279 ) * Remote Alertmanage: Add timeouts to the HTTP client * code review suggestions	2025-05-13 13:25:56 +02:00
Santiago	5a589bb51a	Alerting: Enable the remote Alertmanager feature using only feature toggles (#101410 ) * Alerting: Enable the remote Alertmanager feature using only feature toggles * Trigger build	2025-04-30 12:18:47 +02:00
Alexander Akhmetov	756b45402c	Alerting: Add rule_query_offset setting for Prometheus rule conversion (#102500 ) Adds a new configuration option to specify a time offset for rule evaluation, which gets applied and saved during the Prometheus -> Grafana conversion. For example: [unified_alerting.prometheus_conversion] rule_query_offset = 1m Changing this option affects only the rules imported after the change. If query_offset is set at the group level, it takes precedence over this setting. Default is set to 1m.	2025-03-20 15:31:21 +01:00
Yuri Tseretyan	943b73a682	Alerting: Add scheduled clean-up of deleted rules (#101963 ) * add scheduled clean up of deleted rules --------- Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>	2025-03-11 22:58:26 +02:00
Steve Simpson	bbab62ce39	Alerting: Select remote write path dependent on metrics backend type. (#101891 ) The remote write path differs based on whether the data source is actually Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want users to have to specify the path, so this change determines the path as best it can. It may be in the future we have to make this configurable per-datasource to cater for setups where it's impossible to determine the correct path.	2025-03-11 13:45:16 +01:00
Steve Simpson	14ebec527c	Alerting: Allow selection of recording rule write target on per-rule basis. (#101778 ) * Alerting: Allow selection of recording rule write target on per-rule basis. Introduces a new feature flag (`grafanaManagedRecordingRulesDatasources`), disabled by default, to enable the ability to write recording rules data using data source settings, and selecting the data source to use on a per-rule basis. To cope with the scenario of users upgrading, a configuration file option allows setting the default data source to use, if none is specified in the rule, emulating the behaviour of recording rules without the flag enabled. * Lint * Update conf/sample.ini Co-authored-by: Alexander Akhmetov <me@alx.cx> --------- Co-authored-by: Alexander Akhmetov <me@alx.cx>	2025-03-07 14:30:40 +01:00
Alexander Akhmetov	1f8f9a45d7	Alerting: Add state_periodic_save_batch_size config option (#98019 ) * Alerting: Add state_periodic_save_batch_size config option --------- Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>	2024-12-16 15:30:38 +01:00
Steve Simpson	c440bd2bda	Alerting: Change default for max_attempts to 3. (#97461 ) Currently the default is 1, this means that by default users will see transient query errors reflected as alert evaluation failures, when often an immediate retry is sufficient to evaluate the rule successfully. Enabling retries by default leads to a better experience out of the box.	2024-12-05 21:48:24 +01:00
Fayzal Ghantiwala	1fdc48faba	Alerting: Make context deadline on AlertNG service startup configurable (#96053 ) * Make alerting context deadline configurable * Remove debug logs * Change default timeout * Update tests	2024-11-07 18:23:55 +00:00
Brandon	fbad76007d	Alerting: Limit and clean up old alert rules versions (#89754 )	2024-10-05 00:31:21 +03:00
Alexander Weaver	ac5ebe6e4d	Alerting: Add enablement flag for recording rules (#92032 ) * Add enablement flag * Disable if toggle not enabled	2024-08-19 12:01:00 -05:00
Santiago	b79b38f02c	Alertmanager: Support limits for silences (#90826 ) * Alertmanager: support limits for silences * update grafana/alerting to latest main	2024-07-24 14:22:29 +02:00
Alexander Akhmetov	68691c9386	Alerting: Add setting for maximum allowed rule evaluation results (#89468 ) * Alerting: Add setting for maximum allowed rule evaluation results Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.	2024-06-27 09:45:15 +02:00
Yuri Tseretyan	4a5aab54a5	Alerting: Add max limit for Loki query size in state history API (#89646 ) * add setting for query limit * update BuildLogQuery to return error if limit is exceeded * move tests for BuildLogQuery to separate suite	2024-06-25 09:20:38 -04:00
Matthew Jacobson	3228b64fe6	Alerting: Resend resolved notifications for ResolvedRetention duration (#88938 ) * Simple replace of State.Resolved with State.ResolvedAt * Retain ResolvedAt time between Normal->Normal transition * Introduce ResolvedRetention to keep sending recently resolved alerts * Make ResolvedRetention configurable with resolved_alert_retention * Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention * Do not reset ResolvedAt during Normal->Pending transition Initially this was done to be inline with Prom ruler. However, Prom ruler doesn't keep track of Inactive->Pending/Alerting using the same alert instance, so it's more understandable that they choose not to retain ResolvedAt. In our case, since we use the same cached instance to represent the transition, it makes more sense to retain it. This should help alleviate some odd situations where temporarily entering Pending will stop future resolved notifications that would have happened because of ResolvedRetention. * Pointers for ResolvedAt & LastSentAt To avoid awkward time.Time{}.Unix() defaults on persist	2024-06-20 16:33:03 -04:00
William Wernert	c62cc25513	Alerting: Configure recording rule writer from config.ini (#89056 )	2024-06-12 16:04:46 -04:00
Jacob Valdemar	eb76ea47a0	Alerting: Add ha_reconnect_timeout configuration option (#88823 ) * Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration --------- Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>	2024-06-11 13:25:48 -04:00
Santiago	e15e40fbd3	Alerting: Skip setting up clustering in remote primary/only modes (#88968 ) * Alerting: Skip setting up clustering in remote primary mode * Update pkg/services/ngalert/notifier/multiorg_alertmanager.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> --------- Co-authored-by: Steve Simpson <steve.simpson@grafana.com>	2024-06-10 13:51:11 +02:00
Fayzal Ghantiwala	80f54778f3	Alerting: Add option to use Redis in cluster mode for Alerting HA (#88696 ) * Add config option to use Redis in cluster mode * Use UniversalOptions	2024-06-05 17:02:25 +01:00
Fayzal Ghantiwala	7a2fbad0c8	Alerting: Add options to configure TLS for HA using Redis (#87567 ) * Add Alerting HA Redis Client TLS configs * Add test to ping miniredis with mTLS * Update .ini files and docs * Add tests for unified alerting ha redis TLS settings * Fix malformed go.sum * Add modowner * Fix lint error * Update docs and use dstls config	2024-05-14 14:21:42 +01:00
Santiago	529f55cfe8	Alerting: Remove isDefault field from receivers (Alertmanager configuration) (#86605 ) Alerting: Remove isDefault field from receivers in the Alertmanager configuration	2024-04-19 15:44:20 +02:00
Santiago	c7573bb0f7	Alerting: Make retention period configurable for the notification log (#85605 ) * Alerting: Make retention period configurable for the notification log * update sample.ini * fix outdated comment (on disk -> kvstore) * skip checking cyclomatic complexity for ReadUnifiedAlertingSettings	2024-04-05 12:25:43 +02:00
William Wernert	97f37b2e6f	Alerting: Clamp Loki ASH range query to configured max_query_length (#83986 ) * Clamp range in loki http client to configured max_query_length Defaults to 721h to match Loki default	2024-03-15 18:59:45 +02:00
Gilles De Mey	8765c48389	Alerting: Remove legacy alerting (#83671 ) Removes legacy alerting, so long and thanks for all the fish! 🐟 --------- Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com> Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com> Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com> Co-authored-by: William Wernert <rwwiv@users.noreply.github.com> Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>	2024-03-14 15:36:35 +01:00
Yuri Tseretyan	7147af6b8e	Alerting: Disable legacy alerting for ever (#83651 ) * hard disable for legacy alerting * remove alerting section from configuration file * update documentation to not refer to deleted section * remove AlertingEnabled from usage in UA setting parsing	2024-03-07 16:01:11 -05:00
Alexander Weaver	99fa064576	Alerting: Emit warning when creating or updating unusually large groups (#82279 ) * Add config for limit of rules per rule group * Warn when editing big groups through normal API * Warn on prov api writes for groups * Wire up comp root, tests * Also add warning to state manager warm * Drop unnecessary conversion	2024-02-13 08:29:03 -06:00
Alexander Weaver	5bbe9c6e61	Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212 ) * remove jitter feature flag * Add an out so users can manually disable jitter * Pass in cfg * Add TODO to remove knob in future	2024-02-09 15:53:58 -06:00
Jean-Philippe Quéméner	aa25776f81	Alerting: Add a feature flag to periodically save states (#80987 )	2024-01-23 17:03:30 +01:00
Marcus Efraimsson	6768c6c059	Chore: Remove public vars in setting package (#81018 ) Removes the public variable setting.SecretKey plus some other ones. Introduces some new functions for creating setting.Cfg.	2024-01-23 12:36:22 +01:00
Santiago	a77ba40ed4	Alerting: Use the forked Alertmanager for remote secondary mode (#79646 ) * (WIP) Alerting: Use the forked Alertmanager for remote secondary mode * fall back to using internal AM in case of error * remove TODOs, clean up .ini file, add orgId as part of remote AM config struct * log warnings and errors, fall back to remoteSecondary, fall back to internal AM only * extract logic to decide remote Alertmanager mode to a separate function, switch on mode * tests * make linter happy * remove func to decide remote Alertmanager mode * refactor factory function and options * add default case to switch statement * remove ineffectual assignment	2023-12-21 15:26:31 +01:00
Santiago	c46da8ea9b	Alerting: Update alerting package and imports from cluster and clusterpb (#79786 ) * Alerting: Update alerting package * update to latest commit * alias for imports	2023-12-21 12:34:48 +01:00
gotjosh	0c9356a3c7	Unified Alerting: Set `max_attempts` to 1 by default (#79095 ) * Unified Alerting: Set `max_attempts` to 1 by default The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it. I see two cons with this approach: - Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating. - Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration. Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>	2023-12-05 17:42:34 +00:00
Matthew Jacobson	5a80962de9	Alerting: Add clean_upgrade config and deprecate force_migration (#78324 ) * Alerting: Add clean_upgrade config and deprecate force_migration Upgrading to UA and rolling back will no longer delete any data by default. Instead, each set of tables will remain unchanged when switching between legacy and UA. As such, the force_migration config has been deprecated and no extra configuration is required to roll back to legacy anymore. If clean_upgrade is set to true when upgrading from legacy alerting to Unified Alerting, grafana will first delete all existing Unified Alerting resources, thus re-upgrading all organizations from scratch. If false or unset, organizations that have previously upgraded will not lose their existing Unified Alerting data when switching between legacy and Unified Alerting. Similar to force_migration, it should be kept false when not needed as it may cause unintended data-loss if left enabled. --------- Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>	2023-11-30 11:01:11 -05:00
Santiago	7a34cdb3a2	Alerting: Add configuration options to migrate to an external Alertmanager (#71318 ) * add configuration options to .ini file and parse them * updates on config options, add external AM config to the main config struct * separate external AM configs from general alerting configs, naming * comments about usage of tenantID in basic auth & not using config options yet	2023-09-05 11:24:35 -03:00
Alexander Weaver	dfba94e052	Alerting: Limit redis pool size to 5 and make configurable (#74057 ) * Limit redis pool size to 5 and expose it in config ini * Coerce negative pool sizes to the default	2023-08-29 14:59:12 -05:00
Yuri Tseretyan	c7598cc6fb	Alerting: Add ability to control scheduler tick interval via config (#71980 ) * add ability to control scheduler interval via config * add feature flag `configurableSchedulerTick`	2023-07-26 12:44:12 -04:00
George Robinson	7edbe72483	Alerting: Support concurrent queries for saving alert instances (#70525 ) This commit adds support for concurrent queries when saving alert instances to the database. This is an experimental feature in response to some customers experiencing delays between rule evaluation and sending alerts to Alertmanager, resulting in flapping. It is disabled by default.	2023-06-23 11:36:07 +01:00
Jean-Philippe Quéméner	8bb62a8316	Alerting: Add option for memberlist label (#67982 )	2023-05-09 10:32:23 +02:00
Jean-Philippe Quéméner	bc11a484ed	Alerting: Add support for running HA using Redis (#65267 ) Co-authored-by: Steve Simpson <steve.simpson@grafana.com>	2023-04-19 17:05:26 +02:00
Alexander Weaver	b2abb63286	Alerting: Introduce proper feature toggles for common state history backend combinations (#65497 ) * define 3 feature toggles for rollout phases * Pass feature toggles along * Implement first feature toggle * Try a different strategy with fall-throughs to specific configurations * Apply toggle overrides once outside of backend composition * Emit log messages when we coerce backends * Run code generator for feature toggle files * Improve wording in flag descs * Re-run generator * Use code-generated constants instead of plain strings * Use converted enum values rather than strings for pre-parsing	2023-03-30 13:53:21 -05:00
Alexander Weaver	a31672fa40	Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774 ) * Rename RecordStatesAsync to Record * Rename QueryStates to Query * Implement fanout writes * Implement primary queries * Simplify error joining * Add test for query path * Add tests for writes and error propagation * Allow fanout backend to be configured * Touch up log messages and config validation * Consistent documentation for all backend structs * Parse and normalize backend names more consistently against an enum * Touch-ups to documentation * Improve clarity around multi-record blocking * Keep primary and secondaries more distinct * Rename fanout backend to multiple backend * Simplify config keys for multi backend mode	2023-03-17 12:41:18 -05:00
Alexander Weaver	e7ace4ed62	Alerting: Allow separate read and write path URLs for Loki state history (#62268 ) Extract config parsing and add tests	2023-01-30 16:30:05 -06:00

1 2

77 Commits