Commit Graph

518 Commits

Author SHA1 Message Date
Michael Klishin f414c2d512
More missed license header updates #9969 2024-02-05 11:53:50 -05:00
Michael Klishin 01092ff31f
(c) year bumps 2024-01-01 22:02:20 -05:00
Lajos Gerecs 82e25af5d5
Grafana: make sure dashboards do not break when detailed metrics are used (#5945)
* Fix broken dashboards if detailed metrics are used

If detailed metrics are pulled into the same prometheus, then
we get an error in Grafana:

execution: many-to-many matching not allowed:
matching labels must be unique on one side

This is because both endpoints provide `rabbit_identity_info`
which is not unique to the endpoint.

* add detailed metric scraper to prometheus config

---------

Co-authored-by: Michal Kuratczyk <michal.kuratczyk@broadcom.com>
2023-12-27 15:44:05 +01:00
Péter Gömöri fec09c0792 Escape prometheus core metric label values
For example special characters like double quotes are allowed in queue
names, in which case detailed metrics could produce unparsable text
format output.
2023-12-03 01:14:44 +01:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Johan Rhodin 0b2a94c1ec
Update RabbitMQ-Overview.json
Global counters for producers added in https://github.com/rabbitmq/rabbitmq-server/pull/3127 but never made it to this dashboard
2023-11-01 13:23:41 -05:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Péter Gömöri e009a0af72 Expose number of unreachable cluster peers via Prometheus
Unreachable peers is a subset of DB cluster nodes that are not connected
to the current node via Erlang distribution for any reason.
2023-09-17 17:18:41 +02:00
Michal Kuratczyk c56f2e2678
Remove the query threshold
The graph looks empty or broken when values are sometimes
above and sometimes below the 5000 limit. I think it's better
to just show everything.
2023-09-07 11:33:35 +02:00
David Ansari 0f5fe8fadd Add Prometheus metric messages dropped by MQTT QoS 0 queue type
Why:
A RabbitMQ operator should be able to see whether RabbitMQ drops MQTT
QoS 0 messages due to overload protection. It's an indication that an
MQTT subscriber does not consume fast enough.

How:
Use Prometheus global counters.

There are 2 valid solutions:
1. Introduce a new metric called messages_dropped specifically for the
   rabbitmq_mqtt_qos0_queue type. This would work in a similar fashion
   how streams extends the per protocol global counters, but requires
   extending the per protocol & queue type global counters for the MQTT
   QoS queue type. The emitted metrics would look as follows:
```
rabbitmq_global_messages_dropped_total{protocol="mqtt310",queue_type="rabbit_mqtt_qos0_queue"} 0
rabbitmq_global_messages_dropped_total{protocol="mqtt311",queue_type="rabbit_mqtt_qos0_queue"} 0
rabbitmq_global_messages_dropped_total{protocol="mqtt50",queue_type="rabbit_mqtt_qos0_queue"} 0
```
2. Reuse the existing metric rabbitmq_global_messages_dead_lettered_maxlen_total

This commit decides to go for the 2nd approach because:
a) there is no need to add a new metric. Even though dead lettering is not supported
for the MQTT QoS 0 queue type, this metric maps nicely to
what happens: The queue drop messages since itx max length
(mqtt.mailbox_soft_limit) is exceeded with overflow behaviour
drop-head. Furtheremore the label `dead_letter_strategy="disabled"` tells
that dead lettering is not taking place from this queue type.

b) this metric allows to support dead lettering for the MQTT QoS 0 queue
type in the future.

The new dead lettering metrics look as follows:
```
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_mqtt_qos0_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0

rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0

rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0

rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0

rabbitmq_global_messages_dead_lettered_confirmed_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
```
2023-08-15 16:06:15 +02:00
Iliia Khaprov 19f122fea4 Prometheus core metrics collector: Do not render any sampels that are NaN or undefined
close #8740
2023-07-08 18:45:03 +02:00
Simon Unge 2a2af36b9c rename fix 2023-06-23 14:53:57 -07:00
Simon Unge 8b3ca4c972 See #8605. Add authentcation support to prometheus. 2023-06-23 13:54:45 -07:00
Michael Klishin 55442aa914 Replace @rabbitmq.com addresses with rabbitmq-core@groups.vmware.com
Don't ask why we have to do it. Because reasons!
2023-06-20 15:40:13 +04:00
Rin Kuryloski eb94a58bc9 Add a workflow to compare the bazel/erlang.mk output
To catch any drift between the builds
2023-05-15 13:54:14 +02:00
Michael Klishin 59fe5dc01b
Prometheus: handle scenarios when no listener is configured
Start a plain TCP one with all defaults.
2023-05-06 00:19:58 +04:00
Chunyi Lyu 4ddb0c2038 Support TLS-only listener for Prometheus
- tcp listener can be turned off by setting
'prometheus.tcp.listener = none'
- config schema follows web_mqtt and web_stomp
2023-05-05 15:44:53 +01:00
Rin Kuryloski a944439fba Replace globs in bazel with explicit lists of files
As this is preferred in rules_erlang 3.9.14
2023-04-25 17:29:12 +02:00
Rin Kuryloski 854d01d9a5 Restore the original -include_lib statements from before #6466
since this broke erlang_ls

requires rules_erlang 3.9.13
2023-04-20 12:40:45 +02:00
Rin Kuryloski 8de8f59d47 Use gazelle generated bazel files
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).

In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
2023-04-17 18:13:18 +02:00
Rin Kuryloski 8a7eee6a86 Ignore warnings when building plt files for dependencies
As we don't generally care if a dependency has warnings, only the
target
2023-04-17 10:09:24 +02:00
Rin Kuryloski 609171ec70 Rename the tanzu cli scope to vmware
And update other references to commercial editions
2023-02-16 13:49:54 +01:00
Ilia Kurenkov 051419e46b
Fix descriptions of auth metrics. 2023-02-11 23:36:21 +01:00
Alexey Lebedeff c7da0da8b8 Cleanup dialyzer calls
- Use the same base .plt everywhere, so there is no need to list
standard apps everywhere
- Fix typespecs: some typos and the use of not-exported types
2023-02-06 17:05:30 +01:00
Rin Kuryloski 5ef8923462 Avoid the need to pass package name to rabbitmq_integration_suite 2023-01-18 15:25:27 +01:00
Rin Kuryloski a317b30807 Use improved assert_suites2 macro from rules_erlang 3.9.0 2023-01-18 15:07:06 +01:00
Michael Klishin ba7b44df8a
Merge pull request #6879 from rabbitmq/dialyzer-warnings-rabbitmq-prometheus
Fix all dialyzer warnings in rabbitmq_prometheus
2023-01-13 12:21:15 -06:00
Alexey Lebedeff cd92258346 Fix all dialyzer warnings in rabbitmq_prometheus 2023-01-13 15:52:26 +01:00
Michal Kuratczyk 510415f8b9
Update prometheus.erl to 4.10.0
Since 4.10.0 was released specifically to address an issue we
encountered in RabbitMQ integration with prometheus.erl, new test was
added to validate this functionality in the future.
2023-01-13 10:24:41 +01:00
Michael Klishin 8e8def801c
Wording 2023-01-03 14:19:58 -05:00
Ilia Kurenkov 98a5e34e90 DRY: Link docs for `/metrics/detailed` endpoint to website. 2023-01-03 13:45:45 +01:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Luke Bakken 7fe159edef
Yolo-replace format strings
Replaces `~s` and `~p` with their unicode-friendly counterparts.

```
git ls-files *.erl | xargs sed -i.ORIG -e s/~s>/~ts/g -e s/~p>/~tp/g
```
2022-10-10 10:32:03 +04:00
Loïc Hoguin 73dd0acf01
rabbit_prometheus_http_SUITE: Update tests for new CQs
CQs without consumers will have only one message in memory.
2022-09-27 12:00:10 +02:00
Michael Klishin 96b6e6c368
Merge pull request #5463 from rabbitmq/global-metrics-values
Move message rate metrics from aggregated to global counters
2022-09-12 20:25:08 +04:00
Iliia Khaprov - VMware e15d12d767
Merge pull request #5449 from rabbitmq/grafana-9-support
Update RabbitMQ Dashboards to support latest Grafana versions
2022-08-24 21:25:56 +02:00
David Ansari c3cccf4963
Rename run_queues_length_total to run_queues_length
It's a gauge, not a counter.
@deadtrickster fixed the bug in d0feb0df58
See #4380
2022-08-24 18:04:14 +02:00
Péter Gömöri bf00ee4cfc Fix a typo in a comment in prometheus_rabbitmq_core_metrics_collector 2022-08-23 00:54:35 +02:00
Connor Rogers 6ee0a318e8
Move message rate metrics from channel/queue aggregation to global counters 2022-08-08 16:19:01 +01:00
Connor Rogers c88326ef23
Add README.md for creating/updating dashboards 2022-08-05 17:41:47 +01:00
Connor Rogers e35fd65ff3
Fix overview graphs in Grafana 9
'-1' is no longer accepted as of Grafana 9, and causes a console error when rendering
2022-08-05 17:09:34 +01:00
Connor Rogers 9ac6862e06
Fix dist link graph
Both directions of the link were showing as one entry instead of two.

This is beacuse of https://github.com/flant/grafana-statusmap/issues/277
2022-08-05 16:51:21 +01:00
Connor Rogers 42f30ba7c3
Set time series to show all series in tooltip 2022-08-05 16:14:29 +01:00
Connor Rogers 40767cdae4
Take dashboard definitions straight from exported Grafana for simplicity 2022-08-05 16:09:30 +01:00
Connor Rogers 4d28eef0f8
Migrate from deprecated panels in Grafana 2022-08-05 15:46:27 +01:00
Connor Rogers 8e404ecd04
Update to supported Grafana and Prometheus versions 2022-08-05 12:52:26 +01:00
Jean-Sébastien Pédron 6e9ee4d0da
Remove test code which depended on the `quorum_queue` feature flags
These checks are now irrelevant as the feature flag is required.
2022-08-01 12:41:30 +02:00
Iliia Khaprov 360db38db0 Add process_start_time_seconds metrics. See #4539 2022-07-13 08:35:33 +02:00
Philip Kuryloski 15a79466b1 Use the new xref2 macro from rules_erlang
That adopts the modern erlang.mk xref behaviour
2022-06-09 23:18:28 +02:00
Philip Kuryloski 327f075d57 Make rabbitmq-server work with rules_erlang 3
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps

Also:

- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
  - Fetching seems blocked in CI, and this should reduce hex api usage in
    all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
  the defaults
2022-06-08 14:04:53 +02:00
Loïc Hoguin dc70cbf281
Update Erlang.mk and switch to new xref code 2022-05-31 13:51:12 +02:00
Michael Klishin 018b04a1ea
Wording 2022-04-26 23:08:00 +04:00
Péter Gömöri 35b21797ae Expose head_message_timestamp via Prometheus plugin as well
It is already exposed via rabbitmqctl and the API. It is also exposed by
old or unofficial prometheus plugins and other monitoring
integrations (DataDog).
2022-04-26 15:42:57 +02:00
Michael Klishin 7c47d0925a
Revert "Correct a double quote introduced in #4603"
This reverts commit 6a44e0e2ef.

That wiped a lot of files unintentionally
2022-04-20 16:05:56 +04:00
Michael Klishin 6a44e0e2ef
Correct a double quote introduced in #4603 2022-04-20 16:01:29 +04:00
Luke Bakken dba25f6462
Replace files with symlinks
This prevents duplicated and out-of-date instructions.
2022-04-15 06:04:29 -07:00
Loïc Hoguin 499e0b9197
Remove the CQv1 disabled stats from management/Prometheus 2022-04-05 12:37:54 +02:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
David Ansari a3905da47c Add note about missed Prometheus counter updates
Currently, the quorum queue state machine updates counters via mod_call effects
which are not guaranteed to be executed.

They are updated via mod_call effects such that only the leader
increments the counter (and not the followers).

In certain failure scenarios when dead-lettering lots of messages
at the same time, these mod_call effects might not be executed.

Hence, one shouldn't rely that counters for dead lettered messages
and dead lettered confirmed messages match up 100% even though all
dead-lettered messages were confirmed eventually.
2022-02-28 16:28:09 +01:00
David Ansari 8c286cc680 Add Prometheus metrics for dead-lettered messages
> curl -s localhost:15692/metrics | grep rabbitmq_global_messages_dead_lettered
\# TYPE rabbitmq_global_messages_dead_lettered_delivery_limit_total counter
\# HELP rabbitmq_global_messages_dead_lettered_delivery_limit_total Total number of messages dead-lettered due to delivery-limit exceeded
rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0
\# TYPE rabbitmq_global_messages_dead_lettered_expired_total counter
\# HELP rabbitmq_global_messages_dead_lettered_expired_total Total number of messages dead-lettered due to message TTL exceeded
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0
\# TYPE rabbitmq_global_messages_dead_lettered_rejected_total counter
\# HELP rabbitmq_global_messages_dead_lettered_rejected_total Total number of messages dead-lettered due to basic.reject or basic.nack
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0
\# TYPE rabbitmq_global_messages_dead_lettered_confirmed_total counter
\# HELP rabbitmq_global_messages_dead_lettered_confirmed_total Total number of messages dead-lettered and confirmed by target queues
rabbitmq_global_messages_dead_lettered_confirmed_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0
\# TYPE rabbitmq_global_messages_dead_lettered_maxlen_total counter
\# HELP rabbitmq_global_messages_dead_lettered_maxlen_total Total number of messages dead-lettered due to overflow drop-head or reject-publish-dlx
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0
rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0

A few notes:
* dead_letter_strategy 'disabled' means either user did not configure
  dead-letter-exchange or configured dead-letter-exchange does not
  exist.
* Only time series that make sense get output.
  Example 1: Combination of 'at_least_once' and 'maxlen' will always be 0.
  Hence, we omit that time series.
  Example 2: 'confirmed' makes only sense with quorum queues and
  'at_least_once'.
  Example 3: 'delivery_limit' makes only sense with quorum queues.
* Users get to know *why* messages were dead-lettered.
* Before this commit, there was no possibilities for users to alert
  based on messages being dropped from the head of the queue when
  overflow=drop-head.
* Users can now easily create alerts:
  Example 1: Message gets silently dropped (i.e.
  dead_letter_strategy='disabled') instead of actually dead-lettered.
  Example 2: Detect dead-letter topology misconfigurations.
  Example 3: Messages expire
  Example 4: Messages overflow
  Example 5: Messages requeued too often
* Stream queues by definition do not dead-letter.
2022-02-28 16:28:02 +01:00
Philip Kuryloski 226e00fcd2 Tighten up dialyzer usage
now that rules_erlang no longer cascades up dialyzer warnings from deps
2022-02-24 11:18:41 +01:00
Philip Kuryloski d8201726ae Ignore dialyzer warnings for most apps 2022-02-21 09:19:56 +01:00
Philip Kuryloski efcd881658 Use rules_erlang v2
bazel-erlang has been renamed rules_erlang. v2 is a substantial
refactor that brings Windows support. While this alone isn't enough to
run all rabbitmq-server suites on windows, one can at least now start
the broker (bazel run broker) and run the tests that do not start a
background broker process
2022-01-18 13:43:46 +01:00
Alexey Lebedeff 7676ed9685 Use `rabbitmq_cluster_` prefix for cluster-wide metrics 2021-11-24 16:49:43 +01:00
Michael Klishin 38d64a54b1
Wording 2021-11-24 14:19:57 +03:00
Michael Klishin a1c0cd3785
Wording 2021-11-24 14:02:10 +03:00
Alexey Lebedeff 6e3012aaf9 Add optional metrics for vhost and exchange count
These can make sense in some scenarios, e.g. when vhost/exchanges are
+created using self-service automation
2021-11-24 11:00:41 +01:00
Luke Bakken bd2858c208
Compile the regex 2021-11-22 08:30:17 -08:00
dcorbacho a7c9b66653 Use own key to exclude queues 2021-11-16 16:53:17 +01:00
dcorbacho 242cb539b3 Exclude queues from aggregated metrics in prometheus collector
Uses same exclusion pattern as the management agent
2021-11-16 10:23:39 +01:00
Alexey Lebedeff 8598c51579 Pre-render prometheus labels
This makes per-object metrics twice as fast.

Depends on https://github.com/deadtrickster/prometheus.erl/pull/137
2021-11-09 13:04:39 +01:00
Alexey Lebedeff b9ebfb8980 Fix ssl port handling in prometheus plugin
All ssl options were stored in the same proplist, and the code was
then trying to determine whether an option actually belongs to ranch
ssl options or not.

Some keys landed in the wrong place, like it did happen in #2975 -
different ports were mentioned in listener config (default at
top-level, and non-default in `ssl_opts`). Then `ranch` and
`rabbitmq_web_dispatch` were treating this differently.

This change just moves all ranch ssl opts into proper place using
schema, removing any need for guessing in code.

The only downside is that advanced config compatibility is broken.
2021-10-20 14:55:33 +02:00
Michael Klishin 3826a0df25
Compile #3561 2021-10-13 01:27:16 +03:00
Michael Klishin 670f240537
Compile #3561 2021-10-12 20:17:51 +03:00
Johannes Würbach 84de860b4c
feat(prom): expose cluster id in identity 2021-10-12 15:43:46 +02:00
Alexey Lebedeff 989a299720 Emit identity info in prometheus /metrics/detailed endpoint
This is needed to make filtering metrics on a cluster name possible.
2021-09-28 19:35:02 +02:00
Alexey Lebedeff 5501d07b8b Use rabbitmq_ct_helpers to allocate prometheus port
This test always used standard 15692 before, which were causing
conflicts with e.g. local `make run-broker`.
2021-09-22 15:23:35 +02:00
Alexey Lebedeff 4bb2262140 Allow selective querying for prometheus plugin 2021-09-20 14:59:17 +02:00
Michael Klishin 47b20e8f7c
Prometheus: alarm-related metric naming 2021-08-17 20:58:24 +03:00
Ilya Khaprov 9fed915192
Add alarms prometheus collector.
close #2653
2021-08-16 20:32:29 +02:00
Gerhard Lazu 62d82e1660
Break down metrics by node in all RabbitMQ-Stream pie charts
Otherwise we won't be able to see which nodes are running "hot"

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-08-11 13:39:30 +01:00
David Ansari 4b774db5c1 Use same threshold color for "Errors since boot" 2021-08-02 17:05:17 +02:00
David Ansari c99ee6961e Use same colorMode in all RabbitMQ-Stream panels
Co-authored-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-08-02 13:33:00 +02:00
David Ansari ea18c31288 Make RabbitMQ-Stream dashboard work via ConfigMap
Before this commit, importing the dashboard via ConfigMap as seen in
1eb1dc618e
didn't work because DS_PROMETHEUS variable was undefined in Grafana.

Related to https://github.com/rabbitmq/rabbitmq-server/pull/3250

Co-authored-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-08-02 13:12:48 +02:00
Gerhard Lazu 65afbb931b
Ensure RabbitMQ-Stream dashboard works correctly after import
This breaks the docker-compose integration, but we need to move away
from it anyways, the whole dev flow needs revisiting after our focus on
K8s.

$__rate_interval does not work with irate, dropping it in favour of 60s,
same as all other dashboards.

This is a follow-up to https://github.com/rabbitmq/rabbitmq-server/pull/3250

Thanks @ansd for mentioning about the post-import issues.

It was uploaded as https://grafana.com/api/dashboards/14798/revisions/3/download

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-07-30 13:53:02 +01:00
Gerhard Lazu 35a6369327
Restart stream-perf-test on-failure
This handles the scenario where rmq2 is not available, and
stream-perf-test exits with a non-zero exit code. Good spot @ansd!

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-07-30 11:25:36 +01:00
David Ansari 47d572908d Convert string to integer for ulimits.nofile
Before this commit:

> make overview metrics
services.rmq1.ulimits.nofile.hard must be a integer
make: *** [Makefile:68: overview] Error 15

Accoring to the docs
https://docs.docker.com/compose/compose-file/compose-file-v3/#ulimits
this must be an integer.
2021-07-30 09:46:38 +02:00
Gerhard Lazu 6f5c4118ea
Publish RabbitMQ-Stream dashboard to grafana.com
Removed the Dockerfile and slimmed down the Makefile, all of this is now
handled by https://github.com/rabbitmq/rabbitmq-server/blob/master/.github/workflows/oci.yaml
cc @Zerpet @pjk25

More details here (including the steps used to publish to grafana.com):
https://github.com/rabbitmq/release-engineering/issues/11#issuecomment-887627938

I don't want to hold up this PR, will invest in automating the
steps described in the previous link another time. Time to 🚀

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-07-29 19:34:05 +01:00
Philip Kuryloski b26289cb47 Adjust rabbitmq_prometheus test suite timeouts in bazel 2021-07-22 11:00:14 +02:00
Gerhard Lazu 66ef8adfc8
Fix accept dependency in rabbitmq_prometheus
It's a runtime dependency, not a build dependency.

This is a fix and should be backported to v3.9.x, after rc.2 and just
before the final release. Would you disagree @dumbbell?

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-07-21 13:38:54 +01:00
Philip Kuryloski 8f9de08de7 Also assert no missing suites for all other deps 2021-07-12 18:05:55 +02:00
dcorbacho b636ad2565 Rename protocol error counters to _total 2021-06-30 12:46:41 +02:00
dcorbacho c9305d948a
Use number of publishing channels as global publishers in amqp091 2021-06-29 08:10:42 +01:00
Philip Kuryloski 8c7e7e0656 Revert "Default all `rabbitmq_integration_suite` to flaky in bazel"
This reverts commit 70cb8147b2.
2021-06-23 20:53:14 +02:00
Gerhard Lazu c7971252cd
Global counters per protocol + protocol AND queue_type
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-06-22 14:14:21 +01:00
Philip Kuryloski 70cb8147b2 Default all `rabbitmq_integration_suite` to flaky in bazel
Most tests that can start rabbitmq nodes have some chance of
flaking. Rather than chase individual flakes for now, this commit
changes the default (though it can still be overriden, as is the case
for config_scheme_SUITE in many places, since I have yet to see that
particular suite flake).
2021-06-21 16:10:38 +02:00
Philip Kuryloski 30f9a95b9f Add dialyze for remaning tier-1 plugins 2021-06-01 10:19:10 +02:00
Philip Kuryloski a3dbdecb8c Mark //deps/rabbitmq_prometheus:rabbit_prometheus_http_SUITE flaky 2021-05-21 18:32:20 +02:00
Philip Kuryloski 98e71c45d8 Perform xref checks on many tier-1 plugins 2021-05-21 12:03:22 +02:00
Philip Kuryloski e6df6615e1 Futher bazel file refactoring and deduplication 2021-05-11 16:15:33 +02:00