Commit Graph

56 Commits

Author SHA1 Message Date
Alexey Lebedeff 7676ed9685 Use `rabbitmq_cluster_` prefix for cluster-wide metrics 2021-11-24 16:49:43 +01:00
Alexey Lebedeff 6e3012aaf9 Add optional metrics for vhost and exchange count
These can make sense in some scenarios, e.g. when vhost/exchanges are
+created using self-service automation
2021-11-24 11:00:41 +01:00
Alexey Lebedeff b9ebfb8980 Fix ssl port handling in prometheus plugin
All ssl options were stored in the same proplist, and the code was
then trying to determine whether an option actually belongs to ranch
ssl options or not.

Some keys landed in the wrong place, like it did happen in #2975 -
different ports were mentioned in listener config (default at
top-level, and non-default in `ssl_opts`). Then `ranch` and
`rabbitmq_web_dispatch` were treating this differently.

This change just moves all ranch ssl opts into proper place using
schema, removing any need for guessing in code.

The only downside is that advanced config compatibility is broken.
2021-10-20 14:55:33 +02:00
Michael Klishin 3826a0df25
Compile #3561 2021-10-13 01:27:16 +03:00
Johannes Würbach 84de860b4c
feat(prom): expose cluster id in identity 2021-10-12 15:43:46 +02:00
Alexey Lebedeff 989a299720 Emit identity info in prometheus /metrics/detailed endpoint
This is needed to make filtering metrics on a cluster name possible.
2021-09-28 19:35:02 +02:00
Alexey Lebedeff 5501d07b8b Use rabbitmq_ct_helpers to allocate prometheus port
This test always used standard 15692 before, which were causing
conflicts with e.g. local `make run-broker`.
2021-09-22 15:23:35 +02:00
Alexey Lebedeff 4bb2262140 Allow selective querying for prometheus plugin 2021-09-20 14:59:17 +02:00
dcorbacho c9305d948a
Use number of publishing channels as global publishers in amqp091 2021-06-29 08:10:42 +01:00
Gerhard Lazu c7971252cd
Global counters per protocol + protocol AND queue_type
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-06-22 14:14:21 +01:00
Gerhard Lazu f3f3e8aae9
Always show aggregated auth_attempts, add detailed when per object enabled
The metrics have different names now, so we can't end up with duplicate TYPEs.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-01-22 16:38:44 +00:00
Gerhard Lazu 5a6e3f235b
Single auth_attempts declarations when per-object metrics enabled
Closes #2740

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-01-22 11:36:42 +00:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
Mirah Gary fe9881687c
Change per-object endpoint to `/metrics/per-object`.
This conforms with other http endpoints.
2020-11-26 10:35:26 +01:00
Michal Kuratczyk 8b8a66cf0b Add /metrics/per_object endpoint
Regardless of the value of `return_per_object_metrics`, this endpoint
always returns per-object metrics. This allows scraping both endpoints
at different intervals or scraping per-object metrics only during
debugging.

Co-authored-by: Mirah Gary <mgary@vmware.com>
2020-11-19 18:00:42 +01:00
Michael Klishin 898a46d7bc Switch to MPL2 2020-07-14 16:42:52 +03:00
Gerhard Lazu cab99c29f0 Add failing test for erlang_vm_dist_node_queue_size_bytes
Have to force prometheus.erl to a version that does not have this
feature, otherwise the test would succeed.

    pwd
    /Users/gerhard/github.com/rabbitmq/3.9.x/deps/rabbitmq_prometheus
    rm -fr ../prometheus.erl
    make tests
    open logs/index.html

Pull request content:

  Expose & visualise distribution buffer busy limit - zdbbl

  > This will be closed after TGIR S01E04 gets recorded.
  > The goal is to demonstrate how to do this, and then let an external contributor have a go.

  Before this patch, the **Data buffered in the distribution links queue** graph was empty.

  This is what that graph looks like after this gets applied:

  ![image](https://user-images.githubusercontent.com/3342/80223464-3bf28580-8640-11ea-8851-8f33f1c4fd4f.png)

  ## References

  - [RabbitMQ Runtime Tuning - Inter-node Communication Buffer Size](https://www.rabbitmq.com/runtime.html#distribution-buffer)
  - [erl +zdbbl](https://erlang.org/doc/man/erl.html#+zdbbl)

  Fixes #39

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-06-24 16:34:48 +01:00
Gerhard Lazu db2f70753e Add tests for product name & version
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-06-18 11:51:47 +01:00
Gerhard Lazu cba6aa06f4 Fix test that was made to fail on purpose
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-04-25 00:14:05 +01:00
Gerhard Lazu 9cc33c571d Print the response body by default
Makes is easier to spot why a match failed.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-04-25 00:06:32 +01:00
Jean-Sébastien Pédron 636c3f78dc Update copyright (year 2020) 2020-03-10 16:42:08 +01:00
Gerhard Lazu e7c997744d Improve config for returning metrics per object
Since metrics are now aggregated by default, it made more sense to use
the inverse meaning of disabling aggregation, and call it a positive and
explicit action: return_per_object_metrics.

Naming pair: @michaelklishin

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-02-11 13:08:00 +00:00
dcorbacho 253ef8e827 Fix 0/0 and enable all tests 2020-02-07 17:08:10 +01:00
Gerhard Lazu 09b29057af Aggregate metrics by default
Having talked to @michaelklishin we've decided to enable metrics
aggregation by default so that RabbitMQ nodes with many objects serve
the same amount of metrics quickly rather than taking many seconds and
transferring many MBs of data on every scrape.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-02-07 15:19:30 +00:00
Gerhard Lazu 11d676f3e1 Replace histogram type with gauge for raft_entry_commit_latency_seconds
We want to keep the same metric type regardless whether we aggregate or
don't. If we had used a histogram type, considering the ~12 buckets that
we added, it would have meant 12 extra metrics per queue which would
have resulted in an explosion of metrics. Keeping the gauge type and
aggregating latencies across all members.

re https://github.com/rabbitmq/rabbitmq-prometheus/pull/28

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-02-06 17:37:37 +00:00
dcorbacho 06186065b4 Option to aggregate channel, queue and connection metrics
`prometheus.enable_metric_aggregation = true`

rabbitmq-prometheus#26
2020-01-10 16:35:50 +01:00
Gerhard Lazu 89efb964d9 Convert raft_entry_commit_latency to seconds & be explicit about unit
This is a follow-up to https://github.com/rabbitmq/ra/pull/160

Had to introduce mf_convert/3 so that METRICS_REQUIRING_CONVERSIONS
proplist does not clash with METRICS_RAW proplists that have the same
number of elements. This is begging to be refactored, but I know that
@dcorbacho is working on https://github.com/rabbitmq/rabbitmq-prometheus/issues/26

Also modified the RabbitMQ-Quorum-Queues-Raft dashboard

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2020-01-07 16:20:59 +00:00
Gerhard Lazu b8893afcde Add auto-generated test rabbitmq_management.schema 2019-12-03 11:27:26 +00:00
Michael Klishin 3aed601336 A typo 2019-11-26 12:25:08 +03:00
Michael Klishin 6e17eeb3c5 Update this test to use a consumer in a separate process 2019-11-26 12:23:40 +03:00
Gerhard Lazu f550aa0706 Fix queue_metrics references
Some properties had queue_ appended, while others used messages_ instead
of message_. This meant that metrics such as rabbitmq_queue_consumers
were not reported correctly, as captured in https://github.com/rabbitmq/rabbitmq-prometheus/issues/9#issuecomment-558233464

The test needs fixing before this can be merged, it's currently failing with:

    $ make ct-rabbit_prometheus_http t=with_metrics:metrics_test
    == rabbit_prometheus_http_SUITE ==

      * [with_metrics]

    rabbit_prometheus_http_SUITE > with_metrics
        {error,
            {shutdown,
                {gen_server,call,
                    [<0.245.0>,
                     {call,
                         {'basic.cancel',<<"amq.ctag-uHUunE5EoozMKYG8Bf6s1Q">>,
                             false},
                         none,<0.252.0>},
                     infinity]}}}

Closes #19
2019-11-26 07:47:22 +00:00
Michael Klishin b70a8da7f0 Expose endpoint path configuration, references #8 2019-09-26 13:39:21 +03:00
Michael Klishin b03dfa2dd2 New style configuration schema for listeners
Closes #8.
2019-09-26 13:08:36 +03:00
Gerhard Lazu 2b73981ab1 Fix build & identity info metrics
Improve pattern matching used in tests so that we don't match partial
metric names.

[#167846096]
2019-09-04 13:21:50 +01:00
Gerhard Lazu 5781130b61 Use the correct metric types & capture perspective when naming
Some metrics were of type gauge while they should have been of type
counter. Thanks @brian-brazil for making the distinction clear. This is
now captured as a comment above the metric definitions.

Because all metrics are from RabbitMQ's perspective, cached for up to 5
seconds by default (configurable), we prepend `rabbitmq_` to all metrics
emitted by this collector.  While Some metrics are for Erlang (erlang_),
Mnesia (schema_db_) or the System (io_), they are all observed & cached
by RabbitMQ, hence the prefix.

This is the last PR which started in the context of prometheus/docs#1414

[#167846096]
2019-09-04 11:49:48 +01:00
Gerhard Lazu aafc4c026b Revert erlang_uptime_seconds to gauge, not counter
We care about its value rather than the rate of change.

[#167846096]
2019-09-03 19:56:59 +01:00
Gerhard Lazu fbc945f710 Convert all time metrics to seconds
This started in the context of prometheus/docs#1414, specifically
https://github.com/prometheus/docs/pull/1414#issuecomment-524250746

[#167846096]
2019-09-03 17:17:50 +01:00
Gerhard Lazu 98e488f1c4 Use standard naming for metrics expected from the client library
As described in
https://prometheus.io/docs/instrumenting/writing_clientlibs/#process-metrics.

Until prometheus.erl has the prometheus_process_collector functionality
built-in - this may not happen -, we are exposing a subset of those
metrics via rabbitmq_core_metrics_collector, so we are going to stick to
the expected naming conventions.

This commit supercedes the thought process captured in
1e5f4de4cb

[#167846096]
2019-09-03 15:31:55 +01:00
Gerhard Lazu 1e5f4de4cb Rename process-related metrics to stay closer to conventions
While `process_open_fds` would have been ideal, because the value is
cached within RabbitMQ, and computed differently across platforms, it is
important to keep the distinction from, say, what the kernel reports
just-in-time.

I am also capturing the Erlang context by adding `erlang_` to the
relevant metrics. The full context is: RabbitMQ observed this Erlang VM
process metric to be X, so this is why some metrics are prefixed with
`rabbitmq_erlang_process_`

Because there is a difference betwen what RabbitMQ limits are set to,
e.g. `rabbitmq_memory_used_limit_bytes`, vs. what RabbitMQ reports about
the Erlang process, e.g. `rabbitmq_erlang_process_memory_used_bytes`.

This is the best that we can do while staying honest about what is being
reported. cc @brian-brazil

[#167846096]
2019-09-03 12:30:48 +01:00
Gerhard Lazu 2e686f1131 Continue updating RabbitMQ-Overview dashboard to use the new info metric
[#167846096]
2019-08-27 17:11:41 +01:00
Gerhard Lazu e2be7193ff Use a higher config_port when testing
Otherwise it will clash with docker-compose-overview.yml ports
2019-08-15 16:40:19 +01:00
Gerhard Lazu 052d92c74b Replace global labels with build_info & identity_info metrics
This started in the context of prometheus/docs#1414, specifically
https://github.com/prometheus/docs/pull/1414#issuecomment-520505757

Rather than labelling all metrics with the same label, we are
introducing 2 new metrics: rabbitmq_build_info & rabbitmq_identity_info.

I suspect that we may want to revert deadtrickster/prometheus.erl#91
when we agree that the proposed alternative is better.

We are yet to see through changes in Grafana dashboards. I am most
interested in how the updated queries will look like and, more
importantly, if we will have the same panels as we do now. More commits
to follow shortly, wanted to get this out the door first.

In summary, this commit changes:

    # TYPE erlang_mnesia_held_locks gauge
    # HELP erlang_mnesia_held_locks Number of held locks.
    erlang_mnesia_held_locks{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
    # TYPE erlang_mnesia_lock_queue gauge
    # HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
    erlang_mnesia_lock_queue{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
    ...

To this:

    # TYPE erlang_mnesia_held_locks gauge
    # HELP erlang_mnesia_held_locks Number of held locks.
    erlang_mnesia_held_locks 0
    # TYPE erlang_mnesia_lock_queue gauge
    # HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
    erlang_mnesia_lock_queue 0
    ...
    # TYPE rabbitmq_build_info untyped
    # HELP rabbitmq_build_info RabbitMQ & Erlang/OTP version info
    rabbitmq_build_info{rabbitmq_version="3.8.0-alpha.809",prometheus_plugin_version="3.8.0-alpha.809-2019.08.15",prometheus_client_version="4.4.0",erlang_version="22.0.7"} 1
    # TYPE rabbitmq_identity_info untyped
    # HELP rabbitmq_identity_info Node & cluster identity info
    rabbitmq_identity_info{node="rabbit@bc7aeb0c2564",cluster="rabbit@bc7aeb0c2564"} 1
    ...

[#167846096]
2019-08-15 16:00:29 +01:00
Gerhard Lazu 4aa3871194 Use different names for *_process_reductions_total metrics
It is invalid to have multiple metrics with the same name, TYPE & HELP,
but differing labels.

[#167846096]
2019-08-14 16:17:48 +01:00
Gerhard Lazu 75ecd6af1d Fix test that fails when the metric is empty 2019-08-14 12:44:29 +01:00
Gerhard Lazu e218ea5ea2 Reorder elements in metric names & improve naming
bytes / packets must come before _total

Explaing element order difference in TOTALS vs the metrics above

[#167846096]
2019-08-13 19:15:12 +01:00
Gerhard Lazu f1043134f4 Fix test failure message description 2019-08-08 16:58:01 +01:00
Diana Corbacho 82d858719d Label all metrics with Erlang and RMQ version
[#166413229]
2019-08-05 09:40:51 +01:00
Gerhard Lazu 805dd5e3b2 Enable quorum queue feature flag when runnin e2e metrics tests
Otherwise tests will fail in CI:
https://ci.rabbitmq.com/teams/main/pipelines/server-release:v3.8.x-mixed-versions/jobs/test-rabbitmq-prometheus/builds/20

Remove unused rabbit_ct_client_helpers imports
2019-06-26 11:03:28 +01:00
Gerhard Lazu 5e280c0281 Add first version of RabbitMQ Raft metrics
Depends on https://github.com/rabbitmq/ra/tree/metrics_tweaks &
https://github.com/rabbitmq/rabbitmq-server/tree/qq_metrics_tweak

[#166819045]
2019-06-20 20:11:31 +01:00
Gerhard Lazu 2645082738 Finish Erlang Distribution Grafana dashboard
Includes Erlang node to colour pinning

Adds a few make targets to help with docker-compose repetitive commands
& Grafana dashboard updates.

Split Overview & Distribution Docker deployments

re deadtrickster/prometheus.erl#92

[finishes #166004512]
2019-05-29 18:19:09 +01:00