[Why]
The `stream_pub_sub_metrics` test failed at least once in CI because the
`rabbitmq_stream_consumer_max_offset_lag` was 4 instead of the expected
3 on line 815.
I couldn't reproduce the problem so far.
[How]
The test case now logs the initial value of that metric at the beginning
of the test function. Hopefully this will give us some clue for the day
it fails again.
[Why]
It looks like `erlang_vm_dist_node_queue_size_bytes` is not always
present, even though other Erlang-specific metrics are present.
[How]
The goal is to ensure Erlang metrics are present in the output, so just
use another one that is likely to be there.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 26) (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 27) (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 28) (push) Waiting to runDetails
Test (make) / Test (1.18, 28, khepri) (push) Waiting to runDetails
Test (make) / Test (1.18, 28, mnesia) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.18, 28, khepri) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.18, 28, mnesia) (push) Waiting to runDetails
Test (make) / Type check (1.18, 28) (push) Waiting to runDetails
Switch from ra_metrics to ra_counters
* Expose many more metrics (they are also up to date)
* Bump Seshat, Ra, Osiris, Prometheus.erl
* switch from proplists to maps
Key changes:
- endpoint variable to handle scraping multiple endpoints
- message size panels (new metric in 4.1)
- panels at the top of the Overview dashboard should be more up to date
(they show the latest value)
- values should be accurate if multiple endpoints are scraped
(previously, many would be doubled)
- Nodes table shows fewer volumns and shows node uptime
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.
Elixir is only enabled from within rabbitmq_cli currently.
Eunit is disabled since there are only Elixir tests.
Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.
This commit also includes a few changes that are
related:
* The Erlang distribution will now be started for parallel-ct
* Many unnecessary PROJECT_MOD lines have been removed
* `eunit_formatters` has been removed, it provides little value
* The new `maybe_flock` Erlang.mk function is used where possible
* Build test deps when testing rabbitmq_cli (Mix won't do it anymore)
* rabbitmq_ct_helpers now use the early plugins to have Dialyzer
properly set up
- Modified metric expression and legend format in State of distribution links
- Changed panel type from 'flant-statusmap-panel' to 'status-history' for Process state
- Updated metric expressions to include instance filtering with {instance=\"$node\"}
for the following metrics:
- erlang_vm_statistics_run_queues_length
- erlang_vm_statistics_dirty_io_run_queue_length
- erlang_vm_statistics_dirty_cpu_run_queue_length
- Added 'DS_PROMETHEUS' as a templated data source variable
`init_per_group/3`, which starts the broker, was already called earlier
in the function.
This fixes a bug where the node can't be stopped in `end_per_group/2`,
attecting the next group ability to start one.
* Add BEAM dashboard
Also update the other dashboards by opening in Grafana v11.2.2 and ensuring they work as expected.
* Update the Erlang-Distributions-Compare dashboard
* Update the RabbitMQ-Overview dashboard
* Update the RabbitMQ-Quorum-Queues-Raft dashboard
* Update the RabbitMQ-Stream dashboard
* Update distribution link status panel
---------
Co-authored-by: Michal Kuratczyk <mkuratczyk@vmware.com>
* Add global histogram metrics for received message sizes per-protocol
fixup: add new files to bazel
fixup: expose message_size_bytes as prometheus classic histogram type
`rabbit_msg_size_metrics` does not use `seshat` any more, but
`counters` directly.
fixup: add msg_size_metrics unit test
* Improve message size histogram
1.
Avoid unnecessary time series emitted for stream protocol
The stream protocol cannot observe message sizes.
This commit ensures that the following time series are omitted:
```
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="64"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="256"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1024"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4096"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16384"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="65536"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="262144"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1048576"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4194304"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16777216"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="67108864"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="268435456"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="+Inf"} 0
rabbitmq_global_message_size_bytes_count{protocol="stream"} 0
rabbitmq_global_message_size_bytes_sum{protocol="stream"} 0
```
This reduces the number of time series by 15.
2.
Further reduce the number of time series by reducing the number of
buckets. Instead of 13 bucktes, emit only 9 buckets. Buckets are not
free, each is an extra time series stored.
Prior to this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
92
```
After this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
57
```
3.
The emitted metric should be called
`rabbitmq_message_size_bytes_bucket` instead of `rabbitmq_global_message_size_bytes_bucket`.
The latter is poor naming. There is no need to use `global` in
the metric name given that this metric doesn't exist in the old flawed
aggregated metrics.
4.
This commit simplies module `rabbit_global_counters`.
5.
Avoid garbage collecting the 10-elements list of buckets per message
being received.
---------
Co-authored-by: Péter Gömöri <peter@84codes.com>
Adds a specific clause on the
`prometheus_rabbitmq_core_metrics_collector:labels` function when the
associated metric item is a Queue + Exchange combo (`{Queue, Exchange}`)
By default Ra will use the cluster name as the metrics key. Currently
atom values are ignored by the prometheus plugin's tag rendering
functions, so if you have a QQ and Khepri running and request the
`/metrics/per-object` or `/metrics/detailed` endpoints you'll see values
that don't have labels set for the `ra_metrics` metrics:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total 10
With this change we map the name of the Ra cluster to a "raft_cluster"
tag, so instead an example metric might be:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total{raft_cluster="rabbitmq_metadata"} 10
This affects metrics for Khepri and the stream coordinator.
Collecting them on a large system (tens of thousands of processes
or more) can be time consuming as we iterate over all processes.
By putting them on a separate endpoint, we make that opt-in
Add copies of some per-object metrics that are labeled per-channel
aggregated to reduce cardinality. These metrics are valuable and
easier to process if exposed on per-exchange and per-queue basis.
We don't need to duplicate so many patterns in so many
files since we have a monorepo (and want to keep it).
If I managed to miss something or remove something that
should stay, please put it back. Note that monorepo-wide
patterns should go in the top-level .gitignore file.
Other .gitignore files are for application or folder-
specific patterns.
Part of the removal of file_handle_cache.
The Prometheus endpoint was updated but the Grafana dashboard
was not.
The FD stats are using the system's state rather than
file_handle_cache so there's no need to remove them.