mirror of https://github.com/grafana/grafana.git
Docs: Add separate fundamentals topic on notification policies (#69174)
Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
This commit is contained in:
parent
f610e37aec
commit
b10daa12b3
|
|
@ -45,6 +45,14 @@ Set where, when, and how firing alert instances get routed.
|
|||
|
||||
Each notification policy contains a set of label matchers to indicate which alerts rules or instances it is responsible for. It also has a contact point assigned to it that consists of one or more contact point types, such as Slack or email. Contact points define how your contacts are notified when an alert instance fires.
|
||||
|
||||
For more information on notification policies, see [fundamentals of Notification Policies]({{< relref "../fundamentals/notification-policies/index.md" >}}).
|
||||
|
||||
**Message templates**
|
||||
|
||||
Use message templates for your notifications to create reusable custom templates and use them in contact points.
|
||||
|
||||
Add silences to stop notifications from one or more alert instances or use mute timings to specify time intervals when you don’t want new notifications to be generated or sent out. The difference between the two being that a silence only lasts for only a specified window of time whereas a mute timing recurs on a schedule, for example, during a maintenance period.
|
||||
**Silences and mute timings**
|
||||
|
||||
Add silences to stop notifications from one or more alert instances or use mute timings to specify time intervals when you don’t want new notifications to be generated or sent out.
|
||||
|
||||
The difference between the two being that a silence only lasts for only a specified window of time whereas a mute timing recurs on a schedule, for example, during a maintenance period.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,132 @@
|
|||
---
|
||||
title: Notification Policies
|
||||
description: Introduction to Notification Policies and how they work
|
||||
weight: 409
|
||||
keywords:
|
||||
- grafana
|
||||
- alerting
|
||||
- notification policies
|
||||
---
|
||||
|
||||
# Notification Policies
|
||||
|
||||
Notification policies provide you with a flexible way of routing alerts to various different receivers. Using label matchers, you can modify alert notification delivery without having to update every individual alert rule.
|
||||
|
||||
Learn more about how notification policies work and are structured, so that you can make the most out of setting up your notification policies.
|
||||
|
||||
## Policy tree
|
||||
|
||||
Notification policies are _not_ a list, but rather are structured according to a [tree structure](https://en.wikipedia.org/wiki/Tree_structure). This means that each policy can have child policies, and so on. The root of the notification policy tree is called the **Default notification policy**.
|
||||
|
||||
Each policy consists of a set of label matchers (0 or more) that specify which labels they are or aren't interested in handling.
|
||||
|
||||
For more information on label matching, see [how label matching works]({{< relref "../annotation-label/labels-and-label-matchers.md" >}}).
|
||||
|
||||
{{% admonition type="note" %}}
|
||||
If you haven't configured any label matchers for your notification policy, your notification policy will match _all_ alert instances. This may prevent child policies from being evaluated unless you have enabled **Continue matching siblings** on the notification policy.
|
||||
{{% /admonition %}}
|
||||
|
||||
## Routing
|
||||
|
||||
To determine which notification policy will handle which alert instances, you have to start by looking at the existing set of notification policies, starting with the default notification policy.
|
||||
|
||||
If no policies other than the default policy are configured, the default policy will handle the alert instance.
|
||||
|
||||
If policies other than the default policy are defined, it will inspect those notification policies in descending order.
|
||||
|
||||
If a notification policy has label matchers that match the labels of the alert instance, it will descend in to its child policies and, if there are any, will continue to look for any child policies that might have label matchers that further narrow down the set of labels, and so forth until no more child policies have been found.
|
||||
|
||||
If no child policies are defined in a notification policy or if none of the child policies have any label matchers that match the alert instance's labels, the default notification policy is used.
|
||||
|
||||
As soon as a matching policy is found, the system does not continue to look for other matching policies. If you want to continue to look for other policies that may match, enable **Continue matching siblings** on that particular policy.
|
||||
|
||||
Lastly, if none of the notification policies are selected the default notification policy is used.
|
||||
|
||||
### Routing example
|
||||
|
||||
Here is an example of a relatively simple notification policy tree and some alert instances.
|
||||
|
||||
{{< figure src="/media/docs/alerting/notification-routing.png" max-width="750px" caption="Notification policy routing" >}}
|
||||
|
||||
Here's a breakdown of how these policies are selected:
|
||||
|
||||
**Pod stuck in CrashLoop** does not have a `severity` label, so none of its child policies are matched. It does have a `team=operations` label, so the first policy is matched.
|
||||
|
||||
The `team=security` policy is not evaluated since we already found a match and **Continue matching siblings** was not configured for that policy.
|
||||
|
||||
**Disk Usage – 80%** has both a `team` and `severity` label, and matches a child policy of the operations team.
|
||||
|
||||
**Unauthorized log entry** has a `team` label but does not match the first policy (`team=operations`) since the values are not the same, so it will continue searching and match the `team=security` policy. It does not have any child policies, so the additional `severity=high` label is ignored.
|
||||
|
||||
## Inheritance
|
||||
|
||||
In addition to child policies being a useful concept for routing alert instances, they also inherit properties from their parent policy. This also applies to any policies that are child policies of the default notification policy.
|
||||
|
||||
The following properties are inherited by child policies:
|
||||
|
||||
- Contact point
|
||||
- Grouping options
|
||||
- Timing options
|
||||
- Mute timings
|
||||
|
||||
Each of these properties can be overwritten by an individual policy should you wish to override the inherited properties.
|
||||
|
||||
To inherit a contact point from the parent policy, leave it blank. To override the inherited grouping options, enable **Override grouping**. To override the inherited timing options, enable **Override general timings**.
|
||||
|
||||
### Inheritance example
|
||||
|
||||
The example below shows how the notification policy tree from our previous example allows the child policies of the `team=operations` to inherit its contact point.
|
||||
|
||||
In this way, we can avoid having to specify the same contact point multiple times for each child policy.
|
||||
|
||||
{{< figure src="/media/docs/alerting/notification-inheritance.png" max-width="750px" caption="Notification policy inheritance" >}}
|
||||
|
||||
## Additional configuration options
|
||||
|
||||
### Grouping
|
||||
|
||||
Grouping is a key concept in Grafana Alerting that categorizes alert instances of similar nature into a single funnel. This allows you to properly route alert notifications during larger outages when many parts of a system fail at once causing a high number of alerts to fire simultaneously.
|
||||
|
||||
Grouping options determine _which_ alert instances are bundled together.
|
||||
|
||||
When an alert instance is matched to a specific notification policy, it no longer has any association with its alert rule.
|
||||
|
||||
To group alert instances by the original alert rule, set the grouping using `alertname` and `grafana_folder` (since alert names are not unique across multiple folders).
|
||||
|
||||
This is also the default setting for the built-in Grafana Alertmanager.
|
||||
|
||||
Should you wish to group alert instances by something other than the alert rule, check the grouping to any other combination of label keys.
|
||||
|
||||
#### Turn off grouping
|
||||
|
||||
Should you wish to receive every alert instance as a separate notification, choose to do so by grouping by a special label called `...`.
|
||||
|
||||
#### Everything in a single group
|
||||
|
||||
Should you wish to receive all alert instance in a single notification, create an empty list of labels to group by.
|
||||
|
||||
### Timing options
|
||||
|
||||
Timing options can be updated and affect _when_ a group of notifications are sent to their corresponding contact point.
|
||||
|
||||
#### Group wait
|
||||
|
||||
The waiting time until the initial notification is sent for a **new group** created by an incoming alert.
|
||||
|
||||
**Default** 30 seconds
|
||||
|
||||
#### Group interval
|
||||
|
||||
The waiting time to send a batch of alert instances for **existing groups**.
|
||||
|
||||
{{% admonition type="note" %}}
|
||||
This means that notifications will **not** be sent any sooner than 5 minutes (default) since the last batch of updates were delivered, regardless of whether the alert rule interval for those alert instances was lower.
|
||||
{{% /admonition %}}
|
||||
|
||||
**Default** 5 minutes
|
||||
|
||||
#### Repeat interval
|
||||
|
||||
The waiting time to resend an alert after they have successfully been sent. This means notifications for **firing** alerts will be re-delivered every 4 hours (default).
|
||||
|
||||
**Default** 4 hours
|
||||
|
|
@ -16,23 +16,17 @@ weight: 300
|
|||
|
||||
# Manage notification policies
|
||||
|
||||
Notification policies determine how alerts are routed to contact points. Policies have a tree structure, where each policy can have one or more nested policies. Each policy, except for the default policy, can also match specific alert labels. Each alert is evaluated by the default policy and subsequently by each nested policy. If the **Continue matching subsequent sibling nodes** option is enabled for a nested policy, then evaluation continues even after one or more matches. A parent policy’s configuration settings and contact point information govern the behavior of an alert that does not match any of the nested policies. A default policy governs any alert that does not match a nested policy.
|
||||
Notification policies determine how alerts are routed to contact points.
|
||||
|
||||
You can configure Grafana managed notification policies as well as notification policies for an external Alertmanager data source.
|
||||
Policies have a tree structure, where each policy can have one or more nested policies. Each policy, except for the default policy, can also match specific alert labels.
|
||||
|
||||
## Grouping
|
||||
Each alert is evaluated by the default policy and subsequently by each nested policy.
|
||||
|
||||
Grouping is a new and key concept of Grafana Alerting that categorizes alert notifications of similar nature into a single funnel. This allows you to properly route alert notifications during larger outages when many parts of a system fail at once causing a high number of alerts to fire simultaneously.
|
||||
If the **Continue matching subsequent sibling nodes** option is enabled for a nested policy, then evaluation continues even after one or more matches. A parent policy’s configuration settings and contact point information govern the behavior of an alert that does not match any of the nested policies. A default policy governs any alert that does not match a nested policy.
|
||||
|
||||
For example, suppose you have 100 services connected to a database in different environments. These services are differentiated by the label `env=environmentname`. An alert rule is in place to monitor whether your services can reach the database named `alertname=DatabaseUnreachable`.
|
||||
You can configure Grafana-managed notification policies as well as notification policies for an external Alertmanager data source.
|
||||
|
||||
When a network partition occurs, half of your services can no longer reach the database. As a result, 50 different alerts (assuming half of your services) are fired. For this situation, you want to receive a single-page notification (as opposed to 50) with a list of the environments that are affected.
|
||||
|
||||
You can configure grouping to be `group_by: [alertname]` (take note that the `env` label is omitted). With this configuration in place, Grafana sends a single compact notification that has all the affected environments for this alert rule.
|
||||
|
||||
{{% admonition type="note" %}}
|
||||
Grafana also has a special label named `...` that you can use to group all alerts by all labels (effectively disabling grouping), therefore each alert will go into its own group. It is different from the default of `group_by: null` where **all** alerts go into a single group.
|
||||
{{% /admonition %}}
|
||||
For more information on notification policies, see [fundamentals of Notification Policies]({{< relref "../fundamentals/notification-policies/index.md" >}}).
|
||||
|
||||
## Edit default notification policy
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue