Technically duplicate names is supported by common test, but we have
seen it contribute to flakiness in our suite in practice
(cherry picked from commit 513446b6d1)
In test suite. Note a snapshot for 1.0-SNAPSHOT has been
pushed by mistake a while ago, so this one must be
excluded. It's unlikely it will be erased, as the snapshots
for the first stable version should be 1.0.0-SNAPSHOT.
The protocol documentation uses decimal values for error and request key
codes.
Let's use hex values instead. This helps when looking at a request and
its response - 0x0006 and 0x8006 vs. 6 and 32774.
Also, when looking at output of protocol analysis tools like Wireshark,
a hexadecimal value will be printed, for example:
"Nov 1, 2021 23:05:19.395825508 GMT","60216,5552","00000009000600010000000701"
"Nov 1, 2021 23:05:19.396069528 GMT","5552,60216","0000000a80060001000000070001"
Above, we can visually identify delete publisher request and response
(0x0006 and 0x8006) and easily match them in the documentation of the
protocol.
Finally, above argument applies to logging as it is common to log
hex values, not decimal.
1. Response for publisher declaration request does not contain
publisher id.
2. Add mechanism entry to the details of SASL handshake request.
3. SASL handshake response contains list of mechanisms, not just single
mechanism.
Use case: Allow plain connections over one (internal IP), and TLS
connections over another IP (eg. internet routable IP). Without this
patch a cluster can only support access over one or the other IP, not
both.
(cherry picked from commit b9e6aad035)
We expect to have 1 stream for each routing key, but
as binding can return several queues for a given key we
let that possibility open in the stream protocol.
Instead of injecting it into varios places inside the code.
When the osiris log is closed it will decrement the global "readers"
counter which is why it is much safer to do this in terminate.
Otherwise metrics will not get cleaned up correctly when processes crash.
It's also tidier to do this in a single place, in terminate/3
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
since it fails with Bazel.
As discussed with @pjk25, let's set this value via application env,
make it configurable to the test, but not configurable to the user.
Add state timeouts.
If the client takes more than 10s for a single step in the authentication
protocol, make the server close the TCP connection.
Also close the TCP connection if the server times out in state
close_sent. That's the case when the client sends an invalid command
(after successful authentication), the server requests the client to
close the connection, but the client doesn't respond anymore.
Default gen_server timeout is not enough to list busy connections.
Setting it to infinity allows the caller to decide the timeout,
as classic queues do. The `emit_info` function's family sets its
own timeout for all the cli commands.
Calling ensure_stats_timer after init_stats_timer and reset_stats_timer
is enough.
The idea is to call stop_stats_timer before hibernation and
ensure_stats_timer on wakeup. However, since we never call
stop_stats_timer in rabbit_stream_reader, we don't need to call
ensure_stats_timer on every network activity.
Before this commit test AlarmsTest.diskAlarmShouldNotPreventConsumption
of the Java client was failing.
When executing that test, the server failed with:
2021-06-25 16:11:02.886935+02:00 [error] <0.1301.0> exception exit: {unexpected_message,resume}
2021-06-25 16:11:02.886935+02:00 [error] <0.1301.0> in function rabbit_heartbeat:heartbeater/3 (src/rabbit_heartbeat.erl, line 138
because the heartbeater was tried to be resumed without being paused
before.
Above exception exit also happens on master branch when executing this
test. However, the test falsely succeeds on master because the following FIXME was
never implemented:
8e569ad8bf/deps/rabbitmq_stream/src/rabbit_stream_reader.erl (L778)
This is pure refactoring - no functional change.
Benefits:
* code is more maintainable
* smaller methods (instead of previous 350 lines listen_loop_post_auth function)
* well defined state transitions (e.g. useful to enforce authentication protocol)
* we get some gen_statem helper functions for free (e.g. debug utilities)
Useful doc: https://ninenines.eu/docs/en/ranch/2.0/guide/protocols/
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.
The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.
This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.
We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.
While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.
Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.
Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Most tests that can start rabbitmq nodes have some chance of
flaking. Rather than chase individual flakes for now, this commit
changes the default (though it can still be overriden, as is the case
for config_scheme_SUITE in many places, since I have yet to see that
particular suite flake).
Before this commit sending garbarge data to the server stream port
caused the RabbitMQ node to eat more and more memory.
In this commit, we fix it by expecting the client to go through the
proper authentication sequence. Otherwise, the server closes the socket.
Co-authored-by: Michal Kuratczyk <mkuratczyk@pivotal.io>
This is an important metric to keep track of and be aware (maybe even
alert on) when consumers fall behind consuming stream messages. While
they should be able to catch up, if they fall behind too much and the
stream gets truncated, they may miss on messages.
This is something that we want to expose via Prometheus metrics as well,
but we've started closer to the core, CLI & Management.
This should be merged as soon as it passes CI, we shouldn't wait on the
Prometheus changes - they can come later.
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
To make sure the PID is alive, as the mnesia record can stale after a
failure.
Make also the local PID lookup in the stream coordinator do a consistent
query over the cluster if the PID is not alive.
Co-authored-by: Karl Nilsson <kjnilsson@users.noreply.github.com>
This allows including additional applications or third party
plugins when creating a release, running the broker locally,
or just building from the top-level Makefile.
To include Looking Glass in a release, for example:
$ make package-generic-unix ADDITIONAL_PLUGINS="looking_glass"
A Docker image can then be built using this release and will
contain Looking Glass:
$ make docker-image
Beware macOS users! Applications such as Looking Glass include
NIFs. NIFs must be compiled in the right environment. If you
are building a Docker image then make sure to build the NIF
on Linux! In the two steps above, this corresponds to Step 1.
To run the broker with Looking Glass available:
$ make run-broker ADDITIONAL_PLUGINS="looking_glass"
This commit also moves Looking Glass dependency information
into rabbitmq-components.mk so it is available at all times.
Lager strips trailing newline characters but OTP logger with the default
formatter adds a newline at the end. To avoid unintentional multi-line log
messages we have to revisit most messages logged.
Some log entries are intentionally multiline, others
are printed to stdout directly: newlines are required there
for sensible formatting.
The configuration remains the same for the end-user. The only exception
is the log root directory: it is now set through the `log_root`
application env. variable in `rabbit`. People using the Cuttlefish-based
configuration file are not affected by this exception.
The main change is how the logging facility is configured. It now
happens in `rabbit_prelaunch_logging`. The `rabbit_lager` module is
removed.
The supported outputs remain the same: the console, text files, the
`amq.rabbitmq.log` exchange and syslog.
The message text format slightly changed: the timestamp is more precise
(now to the microsecond) and the level can be abbreviated to always be
4-character long to align all messages and improve readability. Here is
an example:
2021-03-03 10:22:30.377392+01:00 [dbug] <0.229.0> == Prelaunch DONE ==
2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>
2021-03-03 10:22:30.377860+01:00 [info] <0.229.0> Starting RabbitMQ 3.8.10+115.g071f3fb on Erlang 23.2.5
2021-03-03 10:22:30.377860+01:00 [info] <0.229.0> Licensed under the MPL 2.0. Website: https://rabbitmq.com
The example above also shows that multiline messages are supported and
each line is prepended with the same prefix (the timestamp, the level
and the Erlang process PID).
JSON is also supported as a message format and now for any outputs.
Indeed, it is possible to use it with e.g. syslog or the exchange. Here
is an example of a JSON-formatted message sent to syslog:
Mar 3 11:23:06 localhost rabbitmq-server[27908] <0.229.0> - {"time":"2021-03-03T11:23:06.998466+01:00","level":"notice","msg":"Logging: configured log handlers are now ACTIVE","meta":{"domain":"rabbitmq.prelaunch","file":"src/rabbit_prelaunch_logging.erl","gl":"<0.228.0>","line":311,"mfa":["rabbit_prelaunch_logging","configure_logger",1],"pid":"<0.229.0>"}}
For quick testing, the values accepted by the `$RABBITMQ_LOGS`
environment variables were extended:
* `-` still means stdout
* `-stderr` means stderr
* `syslog:` means syslog on localhost
* `exchange:` means logging to `amq.rabbitmq.log`
`$RABBITMQ_LOG` was also extended. It now accepts a `+json` modifier (in
addition to the existing `+color` one). With that modifier, messages are
formatted as JSON intead of plain text.
The `rabbitmqctl rotate_logs` command is deprecated. The reason is
Logger does not expose a function to force log rotation. However, it
will detect when a file was rotated by an external tool.
From a developer point of view, the old `rabbit_log*` API remains
supported, though it is now deprecated. It is implemented as regular
modules: there is no `parse_transform` involved anymore.
In the code, it is recommended to use the new Logger macros. For
instance, `?LOG_INFO(Format, Args)`. If possible, messages should be
augmented with some metadata. For instance (note the map after the
message):
?LOG_NOTICE("Logging: switching to configured handler(s); following "
"messages may not be visible in this log output",
#{domain => ?RMQLOG_DOMAIN_PRELAUNCH}),
Domains in Erlang Logger parlance are the way to categorize messages.
Some predefined domains, matching previous categories, are currently
defined in `rabbit_common/include/logging.hrl` or headers in the
relevant plugins for plugin-specific categories.
At this point, very few messages have been converted from the old
`rabbit_log*` API to the new macros. It can be done gradually when
working on a particular module or logging.
The Erlang builtin console/file handler, `logger_std_h`, has been forked
because it lacks date-based file rotation. The configuration of
date-based rotation is identical to Lager. Once the dust has settled for
this feature, the goal is to submit it upstream for inclusion in Erlang.
The forked module is calld `rabbit_logger_std_h` and is based
`logger_std_h` in Erlang 23.0.
Rabbit channels are responsible of this check before calling declare,
skipping it on the manager meant that the queue was partly redeclared
and a new data directory created. The old one was still on disk with
a different timestamp, but from the user point of view the queue data
has been erased.
They are a bit more defensive. The subscription is also now more
reliable by returning a stream-not-available code if necessary.
Using also Aten poll interval to 1 second (bumped to 5 seconds in master
now).
This is done after checking rabbit_queue and if it returns that the
queue does not exist. The coordinator may be recovering the queue, so
thanks to this double check we know the queue exists but is not
available, instead of thinking it does not exist at all.
If the heartbeat frame is sent from a dedicated process, it interleaves
between 2 socket calls from the reader process. Frames are typically
sent in one call, so this is fine, but a chunk is delivered with 2
calls, one for the frame header and one send_file for the chunk. So the
heartbeat frame can sneak in between these 2 calls, which makes clients
fail to parse frames.
Stream leader node fails while publisher is publishing, publisher
reconnects on the new leader. A consumer should read all the confirmed
messages afterwards.
From "stream deleted" to "stream not available". It covers now the
deletion of a stream and the unavailibility due to a failure. This is up
to the client to find out what to do (typically send a metadata request
about the stream and see if it's still there).