grafana/docs/sources/alerting/set-up/performance-limitations/index.md

120 lines
7.9 KiB
Markdown
Raw Normal View History

---
aliases:
Alerting docs: restructure `Introduction` (#84248) * Rename `Data sources` title * Relocate and rename `Introduction/Notification templates` * Rename `alert-rules/alert-instances` to `alert-rules/multi-dimensional-alerts` * Move `fundamentals/high-availability` to `setup/enable-ha` * Fix 404 high-availability alerting link on Setup HA Grafana docs * Move alert manager/contact poitns/notification templates within Notifications * Remove `Alerting on numeric data` * Restructure Introduction v2 * Continue Intro restructuring * Update docs/sources/alerting/fundamentals/alert-rules/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Complete contact point TODO * Alias: alertManager * Aliases `annotation-label` + content changes * Aliases to `templating-labels-annotations` * Aliases to `queries-conditions` * Rename `rule-evaluation.md` file * Aliases: `contact points` * Aliases to `message-templating` * Aliases to `alert-rules` * Update links to new URL slugs * Remove duplicated alias * Remove trailing slash for external heading links * Remove trailing slash in heading links to other grafana pages * Change URL directory slug `fundamentals/notifications` * rename title `Configure High Availability` * Content changes * Update docs/sources/alerting/fundamentals/alert-rules/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-alert-state-history/index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-alert-state-history/index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/set-up/configure-high-availability/_index.md Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> * Update docs/sources/alerting/fundamentals/alert-rules/_index.md Co-authored-by: Jack Baldry <jack.baldry@grafana.com> * Fix broken link reference * Fix `queries-and-conditions` * Fix `alert-rule-evaluation` ref link * Fix aliases + inline doc comments * Fix broken link --------- Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> Co-authored-by: Jack Baldry <jack.baldry@grafana.com>
2024-03-14 23:58:18 +08:00
- ./alerting-limitations/ # /docs/grafana/<GRAFANA_VERSION>/alerting/set-up/alerting-limitations/
- ../../alerting/performance-limitations/ # /docs/grafana/<GRAFANA_VERSION>/alerting/performance-limitations/
canonical: https://grafana.com/docs/grafana/latest/alerting/set-up/performance-limitations/
description: Learn about performance considerations and limitations
keywords:
- grafana
- alerting
- performance
- limitations
Explicitly set all front matter labels in the source files (#71548) * Set every page to have defaults of 'Enterprise' and 'Open source' labels Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration pages to have of 'Cloud', 'Enterprise', and 'Open source' labels Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration/enterprise-licensing pages to have 'Enterprise' labels Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration/organization-management pages to have 'Enterprise' and 'Open source' labels Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration/provisioning pages to have 'Enterprise' and 'Open source' labels Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration/recorded-queries pages to have labels cloud,enterprise * Set administration/roles-and-permissions/access-control pages to have labels cloud,enterprise Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set administration/stats-and-license pages to have labels cloud,enterprise * Set alerting pages to have labels cloud,enterprise,oss * Set breaking-changes pages to have labels cloud,enterprise,oss * Set dashboards pages to have labels cloud,enterprise,oss * Set datasources pages to have labels cloud,enterprise,oss * Set explore pages to have labels cloud,enterprise,oss * Set fundamentals pages to have labels cloud,enterprise,oss * Set introduction/grafana-cloud pages to have labels cloud Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Fix introduction pages products Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set panels-visualizations pages to have labels cloud,enterprise,oss * Set release-notes pages to have labels cloud,enterprise,oss * Set search pages to have labels cloud,enterprise,oss * Set setup-grafana/configure-security/audit-grafana pages to have labels cloud,enterprise Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set setup-grafana/configure-security/configure-authentication pages to have labels cloud,enterprise,oss * Set setup-grafana/configure-security/configure-authentication/enhanced-ldap pages to have labels cloud,enterprise * Set setup-grafana/configure-security/configure-authentication/saml pages to have labels cloud,enterprise * Set setup-grafana/configure-security/configure-database-encryption/encrypt-secrets-using-hashicorp-key-vault pages to have labels cloud,enterprise * Set setup-grafana/configure-security/configure-request-security pages to have labels cloud,enterprise,oss Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set setup-grafana/configure-security/configure-team-sync pages to have labels cloud,enterprise Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set setup-grafana/configure-security/export-logs pages to have labels cloud,enterprise Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Set troubleshooting pages to have labels cloud,enterprise,oss * Set whatsnew pages to have labels cloud,enterprise,oss * Apply updated labels from review Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> Co-authored-by: Isabel <76437239+imatwawana@users.noreply.github.com> --------- Signed-off-by: Jack Baldry <jack.baldry@grafana.com> Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com> Co-authored-by: Isabel <76437239+imatwawana@users.noreply.github.com>
2023-07-18 16:10:12 +08:00
labels:
products:
- cloud
- enterprise
- oss
title: Performance considerations and limitations
weight: 800
---
# Performance considerations and limitations
2024-05-06 17:43:34 +08:00
Grafana Alerting supports multi-dimensional alerting, where one alert rule can generate many alerts. For example, you can configure an alert rule to fire an alert every time the CPU of individual virtual machines max out. This topic discusses performance considerations resulting from multi-dimensional alerting.
Evaluating alerting rules consumes RAM and CPU to compute the output of an alerting query, and network resources to send alert notifications and write the results to the Grafana SQL database. The configuration of individual alert rules affects the resource consumption and, therefore, the maximum number of rules a given configuration can support.
The following section provides a list of alerting performance considerations.
2024-05-06 17:43:34 +08:00
- Frequency of rule evaluation consideration. The "Evaluate Every" property of an alert rule controls the frequency of rule evaluation. It is recommended to use the lowest acceptable evaluation frequency to support more concurrent rules.
- Cardinality of the rule's result set. For example, suppose you are monitoring API response errors for every API path, on every virtual machine in your fleet. This set has a cardinality of _n_ number of paths multiplied by _v_ number of VMs. You can reduce the cardinality of a result set - perhaps by monitoring errors-per-VM instead of for each path per VM.
- Complexity of the alerting query consideration. Queries that data sources can process and respond to quickly consume fewer resources. Although this consideration is less important than the other considerations listed above, if you have reduced those as much as possible, looking at individual query performance could make a difference.
2024-05-06 17:43:34 +08:00
Each evaluation of an alert rule generates a set of alert instances; one for each member of the result set. The state of all the instances is written to the `alert_instance` table in the Grafana SQL database. This number of write-heavy operations can cause issues when using SQLite.
2024-05-06 17:43:34 +08:00
Grafana Alerting exposes a metric, `grafana_alerting_rule_evaluations_total` that counts the number of alert rule evaluations. To get a feel for the influence of rule evaluations on your Grafana instance, you can observe the rate of evaluations and compare it with resource consumption. In a Prometheus-compatible database, you can use the query `rate(grafana_alerting_rule_evaluations_total[5m])` to compute the rate over 5 minute windows of time. It's important to remember that this isn't the full picture of rule evaluation. For example, the load is unevenly distributed if you have some rules that evaluate every 10 seconds, and others every 30 minutes.
These factors all affect the load on the Grafana instance, but you should also be aware of the performance impact that evaluating these rules has on your data sources. Alerting queries are often the vast majority of queries handled by monitoring databases, so the same load factors that affect the Grafana instance affect them as well.
## Limited rule sources support
Grafana Alerting can retrieve alerting and recording rules **stored** in most available Prometheus, Loki, Mimir, and Alertmanager compatible data sources.
It does not support reading or writing alerting rules from any other data sources but the ones previously mentioned at this time.
## Prometheus version support
2024-05-06 17:43:34 +08:00
The latest two minor versions of both Prometheus and Alertmanager are supported. We cannot guarantee that older versions work.
2024-05-06 17:43:34 +08:00
As an example, if the current Prometheus version is `2.31.1`, >= `2.29.0` is supported.
## The Grafana Alertmanager can only receive Grafana managed alerts
Grafana cannot be used to receive external alerts. You can only send alerts to the Grafana Alertmanager using Grafana managed alerts.
2024-05-06 17:43:34 +08:00
You have the option to send Grafana-managed alerts to an external Alertmanager, you can find this option in the Admin tab on the Alerting page.
For more information, refer to [this GitHub issue](https://github.com/grafana/grafana/issues/73447).
## High load on database caused by a high number of alert instances
If you have a high number of alert instances, it can happen that the load on the database gets very high, as each state
transition of an alert instance is saved in the database after every evaluation.
### Compressed alert state
When the `alertingSaveStateCompressed` feature toggle is enabled, Grafana saves the alert rule state in a compressed form. Instead of performing an individual SQL update for each alert instance, Grafana performs a single SQL update per alert rule, updating all alert instances belonging to that rule.
This can significantly reduce database overhead for alert rules with many alert instances.
### Save state periodically
High load can be also prevented by writing to the database periodically, instead of after every evaluation.
To save state periodically, enable the `alertingSaveStatePeriodic` feature toggle.
By default, it saves the states every 5 minutes to the database and on each shutdown. The periodic interval
can also be configured using the `state_periodic_save_interval` configuration flag. During this process, Grafana deletes all existing alert instances from the database and then writes the entire current set of instances back in batches in a single transaction.
Configure the size of each batch using the `state_periodic_save_batch_size` configuration option.
Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357) What is this feature? This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments. Why do we need this feature? In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments. Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints. For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds. This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.
2025-09-29 17:22:36 +08:00
#### Jitter for periodic saves
To further distribute database load, you can enable jitter for periodic state saves by setting `state_periodic_save_jitter_enabled = true`. When jitter is enabled, instead of saving all batches simultaneously, Grafana spreads the batch writes across a calculated time window of 85% of the save interval.
**How jitter works:**
- Calculates delays for each batch: `delay = (batchIndex * timeWindow) / (totalBatches - 1)`
- Time window uses 85% of save interval for safety margin
- Batches are evenly distributed across the time window
- All operations occur within a single database transaction
**Configuration example:**
```ini
[unified_alerting]
state_periodic_save_jitter_enabled = true
state_periodic_save_interval = 1m
state_periodic_save_batch_size = 100
```
**Performance impact:**
For 2000 alert instances with 1-minute interval and 100 batch size:
- Creates 20 batches (2000 ÷ 100)
- Spreads writes across 51 seconds (85% of 60s)
- Batch writes occur every ~2.68 seconds
This helps reduce database load spikes in environments with high alert cardinality by distributing writes over time rather than concentrating them at the beginning of each save cycle.
The time it takes to write to the database periodically can be monitored using the `state_full_sync_duration_seconds` metric
that is exposed by Grafana.
If Grafana crashes or is force killed, then the database can be up to `state_periodic_save_interval` seconds out of date.
When Grafana restarts, the UI might show incorrect state for some alerts until the alerts are re-evaluated.
In some cases, alerts that were firing before the crash might fire again.
If this happens, Grafana might send duplicate notifications for firing alerts.
## Alert rule migrations for Grafana 11.6.0
When you upgrade to Grafana 11.6.0, a migration is performed on the `alert_rule_versions` table. If you experience a 11.6.0 upgrade that causes a migration failure, then your `alert_rule_versions` table has too many rows. To fix this, you need to truncated the `alert_rule_versions` table for the migration to complete.