This is a follow-up to https://github.com/rabbitmq/ra/pull/160
Had to introduce mf_convert/3 so that METRICS_REQUIRING_CONVERSIONS
proplist does not clash with METRICS_RAW proplists that have the same
number of elements. This is begging to be refactored, but I know that
@dcorbacho is working on https://github.com/rabbitmq/rabbitmq-prometheus/issues/26
Also modified the RabbitMQ-Quorum-Queues-Raft dashboard
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Grafan will keep failing with the following error message otherwise:
failed to load dashboard from /dashboards/__inputs.json Dashboard title cannot be empty
It still puts a significant load on the host, but any lower and we won't
see any change in the Uncommited log entries graph, and too little
variation in the Log entry commit latency.
Well, almost. flat-statusmap-panel v0.1.1 breaks on Grafana v6.5.0.
Since it's already been mentioned in
https://github.com/flant/grafana-statusmap/issues/76 for a different
reason, let's wait until it this is addressed.
Most keys load fine, but if one doesn't, everything fails. The package
will still verify OK even if we have just a subset of keys installed, be
more permissive...
Now:
autocomplete ac | Configure shell for autocompletion - eval "$(gmake autocomplete)"
clean-docker cd | Clean all Docker containers & volumes
cto cto | Interact with all containers via a top-like utility
dist-tls dt | Make Erlang-Distribution panels come alive - HIGH LOAD
docker-image di | Build & push Docker image to Docker Hub
docker-image-build dib | Build Docker image locally - make tests
docker-image-bump diu | Bump Docker image version across all docker-compose-* files
docker-image-push dip | Push local Docker image to Docker Hub
docker-image-run dir | Run container with local Docker image
dockerhub-login dl | Login to Docker Hub as pivotalrabbitmq
down d | Stop all containers
find-latest-otp flo | Find latest OTP version archive + sha1
metrics m | Run all metrics containers
overview o | Make RabbitMQ Overview panels come alive
preview-readme pre | Preview README & live reload on edit
qq | Make RabbitMQ-Quorum-Queues-Raft panels come alive - HIGH LOAD
Before:
-------------------------------------------------------------------------------------------------
autocomplete ac | Configure shell for autocompletion - eval "$(gmake autocomplete)"
-------------------------------------------------------------------------------------------------
clean-docker cd | Clean all Docker containers & volumes
-------------------------------------------------------------------------------------------------
cto cto | Interact with all containers via a top-like utility
-------------------------------------------------------------------------------------------------
dockerhub-login dl | Login to Docker Hub as pivotalrabbitmq
-------------------------------------------------------------------------------------------------
docker-image di | Build & push Docker image to Docker Hub
-------------------------------------------------------------------------------------------------
docker-image-build dib | Build Docker image locally - make tests
-------------------------------------------------------------------------------------------------
docker-image-bump diu | Bump Docker image version across all docker-compose-* files
-------------------------------------------------------------------------------------------------
docker-image-push dip | Push local Docker image to Docker Hub
-------------------------------------------------------------------------------------------------
docker-image-run dir | Run container with local Docker image
-------------------------------------------------------------------------------------------------
down d | Stop all containers
-------------------------------------------------------------------------------------------------
find-latest-otp flo | Find latest OTP version archive + sha1
-------------------------------------------------------------------------------------------------
metrics m | Run all metrics containers
-------------------------------------------------------------------------------------------------
overview o | Make RabbitMQ Overview panels come alive
-------------------------------------------------------------------------------------------------
dist-tls dt | Make Erlang-Distribution panels come alive - HIGH LOAD
-------------------------------------------------------------------------------------------------
qq | Make RabbitMQ-Quorum-Queues-Raft panels come alive - HIGH LOAD
-------------------------------------------------------------------------------------------------
Some properties had queue_ appended, while others used messages_ instead
of message_. This meant that metrics such as rabbitmq_queue_consumers
were not reported correctly, as captured in https://github.com/rabbitmq/rabbitmq-prometheus/issues/9#issuecomment-558233464
The test needs fixing before this can be merged, it's currently failing with:
$ make ct-rabbit_prometheus_http t=with_metrics:metrics_test
== rabbit_prometheus_http_SUITE ==
* [with_metrics]
rabbit_prometheus_http_SUITE > with_metrics
{error,
{shutdown,
{gen_server,call,
[<0.245.0>,
{call,
{'basic.cancel',<<"amq.ctag-uHUunE5EoozMKYG8Bf6s1Q">>,
false},
none,<0.252.0>},
infinity]}}}
Closes#19
It captures the Quorum-Queues Raft, so let's be specific, especially
since we know that there will be other Raft implementations in RabbitMQ,
not just Quorum Queues.
[#166926415]
It is essential to know which RabbitMQ & Erlang/OTP version the cluster
is running, as well as how many nodes there are in the cluster. We now
have a table which lists this information, right under all singlestat
panels.
The singlestat panels have been re-organized to make room for 2 new
ones: Nodes & Publishers. Classic & Quorum Queues would be great to
have, as would VHosts. The last singlestats that I would add are Alarms
& Partitions. This would bring the total number of singlestat panels to
14 (we currently have 10). While 14 feel overwhelming, it captures all
the important information that I believe is worth knowing about any
RabbitMQ cluster.
All message-related sections now display 2 graph panels instead of 3.
While 3 panels look good on 27" screens, they don't work as well on 15"
screens, which is what the majority will be using. Also the 3rd panel
would always be for anti-pattern graphs (e.g. unroutable messages,
polling operations, etc.) and would be mostly empty in the majority of
cases. Fitting fewer panels per row not only helps focusing and
understanding what is being displayed, but it also makes it easier to
compare when viewing 2 panels side-by-side, on 27" screens. Nodes &
churn sections still have 3 panels, which works well when 1 panel is
more important than the others. The compromise that we need to make is
between giving enough horizontal space to equally important panels vs
making the dashboard page too long. RabbitMQ-Overview has always been a
comprehensive dashboard which captures a lot of imformation, it was
always tough balancing the important vs the complete.
[finishes #167836027]
9.313226 GiB is a lot harder to read than 9.31 GiB, and therefore less
useful. Observing other people use this made it obvious that limiting
the precision was the human-friendly thing to do.
* explains source of metrics via row names
* makes tables slightly wider to mitigate long names line wrapping
* do not limit entries in tables, refresh resets table pagination
[finishes #168734621]
The yardstick for all Grafana dashboards should be 1920 x 1200, the
screen format most common in our team. If the dashboards look good on
our screens, they will look good on other screens too. Smalle
resolutions won't look too crammed, and bigger resolutions can be split
in half (e.g. 27" iMacs).
Some take-aways from optimising the layout of this dashboard:
* limit horizontal graph panels to 3
* limit horizontal panels to 2 if the information is dense (e.g. table + graph)
* use the same width for graph panels that need comparing, stack vertically
To get an import-friendly RabbitMQ Overview dashboard, run the following
command:
make RabbitMQ-Overview.json
On macOS, to send this output to clipboard:
make RabbitMQ-Overview.json | pbcopy
This is the preferred alternative to
9aa22e1895
See
dae49b5c08
for more context. cc @mkuratczyk
This commit introduces a few other somewhat related changes:
* BASH autocompletion for make targets - make ac
* descriptions for all custom targets - make h
* continuous feedback loops for ac & h targets - make CFac
I would really like to see some of the above features be part of
erlang.mk. What do you think @essen? Anything in particular that you
would like me to PR?
@dumbbell, my other Make partner-in-crime, may be interested in
discussing the above ; )
== LINKS
* https://medium.com/@lavieenroux20/how-to-win-friends-influence-people-and-autocomplete-makefile-targets-e6cd228d856d
* https://github.com/Bash-it/bash-it/blob/master/completion/available/makefile.completion.bash
While __inputs are required for the dashboards to work in environments
where Prometheus is not the default datasource, it breaks the local
development flow. In other words,
9aa22e1895
prevents `make metrics overview` from working as designed.
We are going to add shortly a simple way of converting the local
dashboards into a format that can be imported in Grafana and will work
when Prometheus is not the default datasource (e.g. when using
https://github.com/coreos/kube-prometheus)
Long-term, these dashboards will be available via grafana.com, which is
the preferred way of consuming them.
cc @mkuratczyk
Some metrics were of type gauge while they should have been of type
counter. Thanks @brian-brazil for making the distinction clear. This is
now captured as a comment above the metric definitions.
Because all metrics are from RabbitMQ's perspective, cached for up to 5
seconds by default (configurable), we prepend `rabbitmq_` to all metrics
emitted by this collector. While Some metrics are for Erlang (erlang_),
Mnesia (schema_db_) or the System (io_), they are all observed & cached
by RabbitMQ, hence the prefix.
This is the last PR which started in the context of prometheus/docs#1414
[#167846096]
As described in
https://prometheus.io/docs/instrumenting/writing_clientlibs/#process-metrics.
Until prometheus.erl has the prometheus_process_collector functionality
built-in - this may not happen -, we are exposing a subset of those
metrics via rabbitmq_core_metrics_collector, so we are going to stick to
the expected naming conventions.
This commit supercedes the thought process captured in
1e5f4de4cb
[#167846096]
While `process_open_fds` would have been ideal, because the value is
cached within RabbitMQ, and computed differently across platforms, it is
important to keep the distinction from, say, what the kernel reports
just-in-time.
I am also capturing the Erlang context by adding `erlang_` to the
relevant metrics. The full context is: RabbitMQ observed this Erlang VM
process metric to be X, so this is why some metrics are prefixed with
`rabbitmq_erlang_process_`
Because there is a difference betwen what RabbitMQ limits are set to,
e.g. `rabbitmq_memory_used_limit_bytes`, vs. what RabbitMQ reports about
the Erlang process, e.g. `rabbitmq_erlang_process_memory_used_bytes`.
This is the best that we can do while staying honest about what is being
reported. cc @brian-brazil
[#167846096]
This started in the context of prometheus/docs#1414, specifically
https://github.com/prometheus/docs/pull/1414#issuecomment-520505757
Rather than labelling all metrics with the same label, we are
introducing 2 new metrics: rabbitmq_build_info & rabbitmq_identity_info.
I suspect that we may want to revert deadtrickster/prometheus.erl#91
when we agree that the proposed alternative is better.
We are yet to see through changes in Grafana dashboards. I am most
interested in how the updated queries will look like and, more
importantly, if we will have the same panels as we do now. More commits
to follow shortly, wanted to get this out the door first.
In summary, this commit changes:
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
# TYPE erlang_mnesia_lock_queue gauge
# HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
erlang_mnesia_lock_queue{node="rabbit@920f1e3272af",cluster="rabbit@920f1e3272af",rabbitmq_version="3.8.0-alpha.806",erlang_version="22.0.7"} 0
...
To this:
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks 0
# TYPE erlang_mnesia_lock_queue gauge
# HELP erlang_mnesia_lock_queue Number of transactions waiting for a lock.
erlang_mnesia_lock_queue 0
...
# TYPE rabbitmq_build_info untyped
# HELP rabbitmq_build_info RabbitMQ & Erlang/OTP version info
rabbitmq_build_info{rabbitmq_version="3.8.0-alpha.809",prometheus_plugin_version="3.8.0-alpha.809-2019.08.15",prometheus_client_version="4.4.0",erlang_version="22.0.7"} 1
# TYPE rabbitmq_identity_info untyped
# HELP rabbitmq_identity_info Node & cluster identity info
rabbitmq_identity_info{node="rabbit@bc7aeb0c2564",cluster="rabbit@bc7aeb0c2564"} 1
...
[#167846096]
We want to use a consistent range for all metrics that use rate() and a
safe value (4x the Prometheus scrape interval):
https://www.robustperception.io/what-range-should-i-use-with-rate
This also prompted a change in RabbitMQ's default
collect_statistics_interval, so that we don't update metrics
unnecessarily. We are OK if the Management UI doesn't update on every 5s
auto-refresh.
Related a929f22233
[#167846096]
Started as a Prometheus docs discussion in prometheus/docs#1414, mostly
based on https://prometheus.io/docs/instrumenting/writing_exporters/
Raft metrics are of type gauge, not counter. _If you care about the
absolute value rather than only how fast it's increasing, that's a
gauge_
All node_persister_metrics are now counters - some were gauges before.
They are now named using metric naming best practices:
https://prometheus.io/docs/practices/naming/
All metrics names that should have units, do. Some use microseconds,
others milliseconds and others bytes or ops (operations). We don't do
any unit conversion in the collector but simply expose the units that
are used when the metric value is written to ETS.
While some metrics such as io_sync_time_microseconds_total would be
better expressed as Sumarries, the refactoring required to achieve that
is not worth the effort. Will keep things simple & imperfect for now,
especially since we don't have a dashboard that helps visualise these
metrics.
The next step is to address global labels - will submit as a separate
PR.
[#167846096]
Now that there is a 3.8 alpha build that includes
rabbitmq/rabbitmq-server#2075, let's make use of it!
Without this, when a new cluster was started, some nodes ended up wtih
`rabbit@localhost` for the cluster label, instead of e.g. `rmq-gcp-38`.
The main suspect was a race condition, where the rabbitmq_prometheus app
starts before the cluster name is set via `rabbitmqctl
set_cluster_name`.
[finishes #167835770]
It's hard to understand what the different colours mean otherwise. Also,
yellow is preferable to purple when it comes to displaying runnable
processes - those stuck in the run queue.
cc @michaelklishin
It explains the correlation between inet packets & TCP packets, and why
the inet packet size varies when TLS is used for inter-node
communication.
[finishes 166419953]
It makes a big difference for stable throughput. See screenshots from
https://bugs.erlang.org/browse/ERL-959
We need to test this in a real network - I'm thinking GCP -, outside of
Docker. The results will inform whether we should change the default -
which is 1436 bytes.
[#166419953]
Add cadvisor & node-exporter & Docker metrics.
Inspired by https://github.com/stefanprodan/dockprom
There are no Grafana dashboards for these metrics yet. The dockprom ones
don't show any panels in Grafana 6.
[#165818813]
Even though this slows down Grafana container startup, we need to ensure
that this plugin is present, otherwise the panels that track process
state won't work. This will be slow the first time the plugin is
downloaded, and slightly faster on subsequent runs.
[#166004512]
* pin nodes to specific colours
* add message-related single-stats
* reshuffle rows
* node metrics are most useful
* queue, channel & connection churn are least useful
Includes Erlang node to colour pinning
Adds a few make targets to help with docker-compose repetitive commands
& Grafana dashboard updates.
Split Overview & Distribution Docker deployments
re deadtrickster/prometheus.erl#92
[finishes #166004512]
We (+@essen) have answered a bunch of questions (see the story) and
improved the metrics + dashboard in the process. Added some improvements
to the RabbitMQ Overview metrics as well.
[#166004104]
This puts load on the distribution and makes the Erlang-Distribution
dashboard show an interesting behaviour in TCP sockets. @dcorbacho
thinks so too.
re deadtrickster/prometheus.erl#92
[#166004512]
Use 1m instead of $__interval for rates that track metrics with slow
rate of change. Using $__interval will miss changes.
Stop rounding, it skews values.
All `basic.get` metrics are bad. The 0 threshold and the red colour for
all lines is hopefully enought to convey this.
re rabbitmq/rabbitmq-perf-test#203
[finishes #165852775]
Otherwise it's really hard to know what we are looking at when expanding
panels.
Also, pin to colours. Otherwise, rabbit@rabbitmq1 metrics in one panel
will appear yellow, and green in another panel. This is a one-off
which doesn't scale, should be automated in some way. Grafana doesn't
support pinning colors to labels 🤔
This includes the global_labels feature introduced in deadtrickster/prometheus.erl#91
To test, run `docker-compose up` in docker dir, then navigate to
localhost:15692/metrics & localhost:3000/dashboards (admin:admin) to see
the Grafana RabbitMQ Overview dashboard.
Add nodes, alarms & partitions to global counts. These are too important
to not show them. Need to discuss how to expose these via metrics.
[#164374397]
Set memory high watermark to 256MiB to force trigger the memory alarm,
as well as ensure messages get paged to disk (forces disk reads).
Make all legends display as table so that values are easier to see when
toggling them.
This produces a bad rabbitmq-server build, perf-test crashes & so do
rabbit_channels. Will build a full rabbitmq-server-generic-unix locally,
this mix & matching is definitely trouble.
publisher-confirms_1 | Main thread caught exception: java.io.IOException
publisher-confirms_1 | 13:07:38.003 [main] ERROR com.rabbitmq.perf.PerfTest - Main thread caught exception
publisher-confirms_1 | java.io.IOException: null
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:129)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:125)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:147)
publisher-confirms_1 | at com.rabbitmq.client.impl.ChannelN.open(ChannelN.java:133)
publisher-confirms_1 | at com.rabbitmq.client.impl.ChannelManager.createChannel(ChannelManager.java:182)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQConnection.createChannel(AMQConnection.java:555)
publisher-confirms_1 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.createChannel(AutorecoveringConnection.java:165)
publisher-confirms_1 | at com.rabbitmq.perf.MulticastParams$TopologyHandlerSupport.configureQueues(MulticastParams.java:616)
publisher-confirms_1 | at com.rabbitmq.perf.MulticastParams$FixedQueuesTopologyHandler.configureQueuesForClient(MulticastParams.java:699)
publisher-confirms_1 | at com.rabbitmq.perf.MulticastParams.createConsumer(MulticastParams.java:405)
publisher-confirms_1 | at com.rabbitmq.perf.MulticastSet.createConsumers(MulticastSet.java:244)
publisher-confirms_1 | at com.rabbitmq.perf.MulticastSet.run(MulticastSet.java:126)
publisher-confirms_1 | at com.rabbitmq.perf.PerfTest.main(PerfTest.java:276)
publisher-confirms_1 | at com.rabbitmq.perf.PerfTest.main(PerfTest.java:374)
publisher-confirms_1 | Caused by: com.rabbitmq.client.ShutdownSignalException: connection error
publisher-confirms_1 | at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:66)
publisher-confirms_1 | at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:502)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:293)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:141)
publisher-confirms_1 | ... 11 common frames omitted
publisher-confirms_1 | Caused by: java.net.SocketException: Connection reset
publisher-confirms_1 | at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
publisher-confirms_1 | at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
publisher-confirms_1 | at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
publisher-confirms_1 | at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
publisher-confirms_1 | at java.base/java.io.DataInputStream.readUnsignedByte(DataInputStream.java:293)
publisher-confirms_1 | at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:91)
publisher-confirms_1 | at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:164)
publisher-confirms_1 | at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:598)
publisher-confirms_1 | at java.base/java.lang.Thread.run(Thread.java:834)
rabbitmq1_1 | 2019-04-25 12:40:53.778 [info] <0.1215.0> accepting AMQP connection <0.1215.0> (172.25.0.7:38752 -> 172.25.0.4:5672)
rabbitmq1_1 | 2019-04-25 12:40:53.840 [info] <0.1215.0> Connection <0.1215.0> (172.25.0.7:38752 -> 172.25.0.4:5672) has a client-provided name: perf-test-test
rabbitmq1_1 | 2019-04-25 12:40:53.849 [info] <0.1215.0> connection <0.1215.0> (172.25.0.7:38752 -> 172.25.0.4:5672 - perf-test-test): user 'guest' authenticated and granted access to vhost '/'
rabbitmq1_1 | 2019-04-25 12:40:53.855 [info] <0.1215.0> closing AMQP connection <0.1215.0> (172.25.0.7:38752 -> 172.25.0.4:5672 - perf-test-test, vhost: '/', user: 'guest')
rabbitmq1_1 | 2019-04-25 12:40:53.860 [info] <0.1224.0> accepting AMQP connection <0.1224.0> (172.25.0.7:38754 -> 172.25.0.4:5672)
rabbitmq1_1 | 2019-04-25 12:40:53.862 [info] <0.1224.0> Connection <0.1224.0> (172.25.0.7:38754 -> 172.25.0.4:5672) has a client-provided name: perf-test-configuration
rabbitmq1_1 | 2019-04-25 12:40:53.864 [info] <0.1224.0> connection <0.1224.0> (172.25.0.7:38754 -> 172.25.0.4:5672 - perf-test-configuration): user 'guest' authenticated and granted access to vhost '/'
rabbitmq1_1 | 2019-04-25 12:40:53.877 [info] <0.1231.0> accepting AMQP connection <0.1231.0> (172.25.0.7:38756 -> 172.25.0.4:5672)
rabbitmq1_1 | 2019-04-25 12:40:53.880 [info] <0.1231.0> Connection <0.1231.0> (172.25.0.7:38756 -> 172.25.0.4:5672) has a client-provided name: perf-test-consumer-0
rabbitmq1_1 | 2019-04-25 12:40:53.882 [info] <0.1231.0> connection <0.1231.0> (172.25.0.7:38756 -> 172.25.0.4:5672 - perf-test-consumer-0): user 'guest' authenticated and granted access to vhost '/'
rabbitmq1_1 | 2019-04-25 12:40:53.890 [error] <0.1239.0> CRASH REPORT Process <0.1239.0> with 0 neighbours exited with reason: no match of right hand value undefined in rabbit_channel:init_queue_cleanup_timer/1 line 2604 in gen_server2:init_it/6 line 597
rabbitmq1_1 | 2019-04-25 12:40:53.891 [error] <0.1231.0> CRASH REPORT Process <0.1231.0> with 0 neighbours crashed with reason: no match of right hand value {error,{'EXIT',{{badmatch,{error,{{{badmatch,undefined},[{rabbit_channel,init_queue_cleanup_timer,1,[{file,"src/rabbit_channel.erl"},{line,2604}]},{rabbit_channel,init,1,[{file,"src/rabbit_channel.erl"},{line,528}]},{gen_server2,init_it,6,[{file,"src/gen_server2.erl"},{line,554}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{child,undefined,channel,{rabbit_channel,start_link,[1,<0.1231.0>,<0.1237.0>,<0.1231.0>,<<"172.25.0.7:38756 -> 172.25.0.4:5672">>,rabbit_framing_amqp_0_9_1,...]},...}}}},...}}} in rabbit_reader:create_channel/2 line 923
rabbitmq1_1 | 2019-04-25 12:40:53.891 [error] <0.1229.0> Supervisor {<0.1229.0>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.1230.0>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.1231.0> exit with reason no match of right hand value {error,{'EXIT',{{badmatch,{error,{{{badmatch,undefined},[{rabbit_channel,init_queue_cleanup_timer,1,[{file,"src/rabbit_channel.erl"},{line,2604}]},{rabbit_channel,init,1,[{file,"src/rabbit_channel.erl"},{line,528}]},{gen_server2,init_it,6,[{file,"src/gen_server2.erl"},{line,554}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{child,undefined,channel,{rabbit_channel,start_link,[1,<0.1231.0>,<0.1237.0>,<0.1231.0>,<<"172.25.0.7:38756 -> 172.25.0.4:5672">>,rabbit_framing_amqp_0_9_1,...]},...}}}},...}}} in rabbit_reader:create_channel/2 line 923 in context child_terminated
rabbitmq1_1 | 2019-04-25 12:40:53.891 [error] <0.1229.0> Supervisor {<0.1229.0>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.1230.0>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.1231.0> exit with reason reached_max_restart_intensity in context shutdown
rabbitmq1_1 | 2019-04-25 12:40:54.376 [warning] <0.1224.0> closing AMQP connection <0.1224.0> (172.25.0.7:38754 -> 172.25.0.4:5672 - perf-test-configuration, vhost: '/', user: 'guest'):
rabbitmq1_1 | client unexpectedly closed TCP connection
Capture limits in thresholds. Even if they are static and somewhat
specific to this RabbitMQ deployment, it's better to have them when
demo-ing the end-to-end Prometheus/Grafana experience.
[#164374751]
This lights up `Published confirmed / s` Grafana panel.
To light up `Published unroutable / s`, unbind all queues from the
direct exchange.
[#164374751]
This has support for disabling metrics_collector, as captured in
rabbitmq/rabbitmq-management-agent#78 & rabbitmq/rabbitmq-management#691
Since we want management to be enabled, this doesn't help our use-case,
but this option is perfect for users that want metrics, but don't want
to pay the overhead of Management - especially metric aggregations.
[#164376052]
After running `docker-compose up`, open Grafana via
http://localhost:3000 and login with user admin & password admin. After
logging in, you will see a RabbitMQ Overview dashboard pre-loaded (/・0・)
Thanks @cirocosta! https://github.com/cirocosta/sample-grafana
cc @MarcialRosales
[finishes #164374321]
Captures all nodes metrics shown on the Overview page:
* File descriptors
* Socket descriptors
* Erlang processes
* Memory
* Disk
Not displaying any limits since they would make the variations
impossible to see. For example, when file descriptors go for 90 to 30,
if one of the metrics on the graph is 1048576 (Docker image default for
rabbitmq_node_sockets_total), it's impossible to see the metric change
from 90 to 30. The same problem is present in the current RabbitMQ Management
graphs on the node page, under Node statistics.
No thresholds have been set. Threshold values must be defined as
integers in Grafana 6, we can't reference metrics e.g.
rabbitmq_node_sockets_total. Templating the dashboard would be one way,
but the problem with that is keeping it in sync with limits. It's a more
difficult problem than meets the eye, deferring it for now.
Created on Grafana v6.1
[finishes #164374321]
Bumping all prometheus-related deps to latest stable. Defining them in
rabbitmq-components.mk, so that they can be promoted to all deps in
umbrella.
rabbitmq_management_agent is required for alarm-related metrics to be
available.
Added node label to most `rabbitmq_` metrics. I need help adding them to
mfa_totals - metrics_node_label_test test currently fails. The new unit
tests ensure that label/0 behaves as expected in all cases - made
refactoring easy. Run unit tests via:
gmake eunit EUNIT_MODS=prometheus_rabbitmq_core_metrics_collector
Updating to latest erlang.mk makes running eunit tests much faster: 2s
vs 10s. To do this, comment `ERLANG_MK_*` in Makefile and run `gmake
erlank-mk`.