Key changes:
- endpoint variable to handle scraping multiple endpoints
- message size panels (new metric in 4.1)
- panels at the top of the Overview dashboard should be more up to date
(they show the latest value)
- values should be accurate if multiple endpoints are scraped
(previously, many would be doubled)
- Nodes table shows fewer volumns and shows node uptime
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.
Elixir is only enabled from within rabbitmq_cli currently.
Eunit is disabled since there are only Elixir tests.
Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.
This commit also includes a few changes that are
related:
* The Erlang distribution will now be started for parallel-ct
* Many unnecessary PROJECT_MOD lines have been removed
* `eunit_formatters` has been removed, it provides little value
* The new `maybe_flock` Erlang.mk function is used where possible
* Build test deps when testing rabbitmq_cli (Mix won't do it anymore)
* rabbitmq_ct_helpers now use the early plugins to have Dialyzer
properly set up
- Modified metric expression and legend format in State of distribution links
- Changed panel type from 'flant-statusmap-panel' to 'status-history' for Process state
- Updated metric expressions to include instance filtering with {instance=\"$node\"}
for the following metrics:
- erlang_vm_statistics_run_queues_length
- erlang_vm_statistics_dirty_io_run_queue_length
- erlang_vm_statistics_dirty_cpu_run_queue_length
- Added 'DS_PROMETHEUS' as a templated data source variable
`init_per_group/3`, which starts the broker, was already called earlier
in the function.
This fixes a bug where the node can't be stopped in `end_per_group/2`,
attecting the next group ability to start one.
* Add BEAM dashboard
Also update the other dashboards by opening in Grafana v11.2.2 and ensuring they work as expected.
* Update the Erlang-Distributions-Compare dashboard
* Update the RabbitMQ-Overview dashboard
* Update the RabbitMQ-Quorum-Queues-Raft dashboard
* Update the RabbitMQ-Stream dashboard
* Update distribution link status panel
---------
Co-authored-by: Michal Kuratczyk <mkuratczyk@vmware.com>
* Add global histogram metrics for received message sizes per-protocol
fixup: add new files to bazel
fixup: expose message_size_bytes as prometheus classic histogram type
`rabbit_msg_size_metrics` does not use `seshat` any more, but
`counters` directly.
fixup: add msg_size_metrics unit test
* Improve message size histogram
1.
Avoid unnecessary time series emitted for stream protocol
The stream protocol cannot observe message sizes.
This commit ensures that the following time series are omitted:
```
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="64"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="256"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1024"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4096"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16384"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="65536"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="262144"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1048576"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4194304"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16777216"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="67108864"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="268435456"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="+Inf"} 0
rabbitmq_global_message_size_bytes_count{protocol="stream"} 0
rabbitmq_global_message_size_bytes_sum{protocol="stream"} 0
```
This reduces the number of time series by 15.
2.
Further reduce the number of time series by reducing the number of
buckets. Instead of 13 bucktes, emit only 9 buckets. Buckets are not
free, each is an extra time series stored.
Prior to this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
92
```
After this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
57
```
3.
The emitted metric should be called
`rabbitmq_message_size_bytes_bucket` instead of `rabbitmq_global_message_size_bytes_bucket`.
The latter is poor naming. There is no need to use `global` in
the metric name given that this metric doesn't exist in the old flawed
aggregated metrics.
4.
This commit simplies module `rabbit_global_counters`.
5.
Avoid garbage collecting the 10-elements list of buckets per message
being received.
---------
Co-authored-by: Péter Gömöri <peter@84codes.com>
Adds a specific clause on the
`prometheus_rabbitmq_core_metrics_collector:labels` function when the
associated metric item is a Queue + Exchange combo (`{Queue, Exchange}`)
By default Ra will use the cluster name as the metrics key. Currently
atom values are ignored by the prometheus plugin's tag rendering
functions, so if you have a QQ and Khepri running and request the
`/metrics/per-object` or `/metrics/detailed` endpoints you'll see values
that don't have labels set for the `ra_metrics` metrics:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total 10
With this change we map the name of the Ra cluster to a "raft_cluster"
tag, so instead an example metric might be:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total{raft_cluster="rabbitmq_metadata"} 10
This affects metrics for Khepri and the stream coordinator.
Collecting them on a large system (tens of thousands of processes
or more) can be time consuming as we iterate over all processes.
By putting them on a separate endpoint, we make that opt-in
Add copies of some per-object metrics that are labeled per-channel
aggregated to reduce cardinality. These metrics are valuable and
easier to process if exposed on per-exchange and per-queue basis.
We don't need to duplicate so many patterns in so many
files since we have a monorepo (and want to keep it).
If I managed to miss something or remove something that
should stay, please put it back. Note that monorepo-wide
patterns should go in the top-level .gitignore file.
Other .gitignore files are for application or folder-
specific patterns.
Part of the removal of file_handle_cache.
The Prometheus endpoint was updated but the Grafana dashboard
was not.
The FD stats are using the system's state rather than
file_handle_cache so there's no need to remove them.