All ssl options were stored in the same proplist, and the code was
then trying to determine whether an option actually belongs to ranch
ssl options or not.
Some keys landed in the wrong place, like it did happen in #2975 -
different ports were mentioned in listener config (default at
top-level, and non-default in `ssl_opts`). Then `ranch` and
`rabbitmq_web_dispatch` were treating this differently.
This change just moves all ranch ssl opts into proper place using
schema, removing any need for guessing in code.
The only downside is that advanced config compatibility is broken.
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.
The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.
This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.
We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.
While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.
Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.
Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Regardless of the value of `return_per_object_metrics`, this endpoint
always returns per-object metrics. This allows scraping both endpoints
at different intervals or scraping per-object metrics only during
debugging.
Co-authored-by: Mirah Gary <mgary@vmware.com>
Have to force prometheus.erl to a version that does not have this
feature, otherwise the test would succeed.
pwd
/Users/gerhard/github.com/rabbitmq/3.9.x/deps/rabbitmq_prometheus
rm -fr ../prometheus.erl
make tests
open logs/index.html
Pull request content:
Expose & visualise distribution buffer busy limit - zdbbl
> This will be closed after TGIR S01E04 gets recorded.
> The goal is to demonstrate how to do this, and then let an external contributor have a go.
Before this patch, the **Data buffered in the distribution links queue** graph was empty.
This is what that graph looks like after this gets applied:

## References
- [RabbitMQ Runtime Tuning - Inter-node Communication Buffer Size](https://www.rabbitmq.com/runtime.html#distribution-buffer)
- [erl +zdbbl](https://erlang.org/doc/man/erl.html#+zdbbl)
Fixes#39
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Since metrics are now aggregated by default, it made more sense to use
the inverse meaning of disabling aggregation, and call it a positive and
explicit action: return_per_object_metrics.
Naming pair: @michaelklishin
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Having talked to @michaelklishin we've decided to enable metrics
aggregation by default so that RabbitMQ nodes with many objects serve
the same amount of metrics quickly rather than taking many seconds and
transferring many MBs of data on every scrape.
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
We want to keep the same metric type regardless whether we aggregate or
don't. If we had used a histogram type, considering the ~12 buckets that
we added, it would have meant 12 extra metrics per queue which would
have resulted in an explosion of metrics. Keeping the gauge type and
aggregating latencies across all members.
re https://github.com/rabbitmq/rabbitmq-prometheus/pull/28
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This is a follow-up to https://github.com/rabbitmq/ra/pull/160
Had to introduce mf_convert/3 so that METRICS_REQUIRING_CONVERSIONS
proplist does not clash with METRICS_RAW proplists that have the same
number of elements. This is begging to be refactored, but I know that
@dcorbacho is working on https://github.com/rabbitmq/rabbitmq-prometheus/issues/26
Also modified the RabbitMQ-Quorum-Queues-Raft dashboard
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Some properties had queue_ appended, while others used messages_ instead
of message_. This meant that metrics such as rabbitmq_queue_consumers
were not reported correctly, as captured in https://github.com/rabbitmq/rabbitmq-prometheus/issues/9#issuecomment-558233464
The test needs fixing before this can be merged, it's currently failing with:
$ make ct-rabbit_prometheus_http t=with_metrics:metrics_test
== rabbit_prometheus_http_SUITE ==
* [with_metrics]
rabbit_prometheus_http_SUITE > with_metrics
{error,
{shutdown,
{gen_server,call,
[<0.245.0>,
{call,
{'basic.cancel',<<"amq.ctag-uHUunE5EoozMKYG8Bf6s1Q">>,
false},
none,<0.252.0>},
infinity]}}}
Closes#19
Some metrics were of type gauge while they should have been of type
counter. Thanks @brian-brazil for making the distinction clear. This is
now captured as a comment above the metric definitions.
Because all metrics are from RabbitMQ's perspective, cached for up to 5
seconds by default (configurable), we prepend `rabbitmq_` to all metrics
emitted by this collector. While Some metrics are for Erlang (erlang_),
Mnesia (schema_db_) or the System (io_), they are all observed & cached
by RabbitMQ, hence the prefix.
This is the last PR which started in the context of prometheus/docs#1414
[#167846096]
As described in
https://prometheus.io/docs/instrumenting/writing_clientlibs/#process-metrics.
Until prometheus.erl has the prometheus_process_collector functionality
built-in - this may not happen -, we are exposing a subset of those
metrics via rabbitmq_core_metrics_collector, so we are going to stick to
the expected naming conventions.
This commit supercedes the thought process captured in
1e5f4de4cb
[#167846096]
While `process_open_fds` would have been ideal, because the value is
cached within RabbitMQ, and computed differently across platforms, it is
important to keep the distinction from, say, what the kernel reports
just-in-time.
I am also capturing the Erlang context by adding `erlang_` to the
relevant metrics. The full context is: RabbitMQ observed this Erlang VM
process metric to be X, so this is why some metrics are prefixed with
`rabbitmq_erlang_process_`
Because there is a difference betwen what RabbitMQ limits are set to,
e.g. `rabbitmq_memory_used_limit_bytes`, vs. what RabbitMQ reports about
the Erlang process, e.g. `rabbitmq_erlang_process_memory_used_bytes`.
This is the best that we can do while staying honest about what is being
reported. cc @brian-brazil
[#167846096]
This started in the context of prometheus/docs#1414, specifically
https://github.com/prometheus/docs/pull/1414#issuecomment-520505757
Rather than labelling all metrics with the same label, we are
introducing 2 new metrics: rabbitmq_build_info & rabbitmq_identity_info.
I suspect that we may want to revert deadtrickster/prometheus.erl#91
when we agree that the proposed alternative is better.
We are yet to see through changes in Grafana dashboards. I am most
interested in how the updated queries will look like and, more
importantly, if we will have the same panels as we do now. More commits
to follow shortly, wanted to get this out the door first.
In summary, this commit changes:
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
# TYPE erlang_mnesia_lock_queue gauge
# HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
erlang_mnesia_lock_queue{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
...
To this:
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks 0
# TYPE erlang_mnesia_lock_queue gauge
# HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
erlang_mnesia_lock_queue 0
...
# TYPE rabbitmq_build_info untyped
# HELP rabbitmq_build_info RabbitMQ & Erlang/OTP version info
rabbitmq_build_info{rabbitmq_version="3.8.0-alpha.809",prometheus_plugin_version="3.8.0-alpha.809-2019.08.15",prometheus_client_version="4.4.0",erlang_version="22.0.7"} 1
# TYPE rabbitmq_identity_info untyped
# HELP rabbitmq_identity_info Node & cluster identity info
rabbitmq_identity_info{node="rabbit@bc7aeb0c2564",cluster="rabbit@bc7aeb0c2564"} 1
...
[#167846096]
Includes Erlang node to colour pinning
Adds a few make targets to help with docker-compose repetitive commands
& Grafana dashboard updates.
Split Overview & Distribution Docker deployments
re deadtrickster/prometheus.erl#92
[finishes #166004512]