Without this change the test could take a very long time to
cleanup the queues and finish because of a race condition
between the queue deletion and the federation link being
restarted and declaring the queue again.
(The test bidirectional was renamed to message_flow to better
represent what it is doing.)
We expect to have 1 stream for each routing key, but
as binding can return several queues for a given key we
let that possibility open in the stream protocol.
Instead of injecting it into varios places inside the code.
When the osiris log is closed it will decrement the global "readers"
counter which is why it is much safer to do this in terminate.
A value that is too low will prevent the index from shutting
down in time when there are many queues. This leads to the
process being killed and on the next RabbitMQ restart a
(potentially very long) dirty recovery is needed.
The value of 10 minutes was chosen to mirror the shutdown
timeout of the message store. Since both queues and message
store need to have shut down gracefully in order to have
a clean restart it makes sense to use the same value.
Related: c40c2628a9
* max_message_size had an off-by-one error and unfortunate naming
* classic mirrored queue batch size was not validating the size in messages.
The limit of over 2B messages did not make much sense. 1M is a still very
high but a more reasonable upper bound
Fixes#3390
When we fail to parse name of cipher suite from PROXY protocol
just say that no ssl is used, instead of trying to fill that
with data from connection between proxy and our server.
With this change and if the RabbitMQ node is running on Unix and accepts
input, the titlebar of an Xterm-compatible terminal emulator will show a
few details about the running node. Specifically, it will indicate the
name of the node and the version of RabbitMQ.
A user could already enable single-line logging (the `single_line`
option of `logger_formatter` or RabbitMQ internal formatters) from the
configuration file. For example:
log.console.formatter.single_line = on
With this patch, the option can be enabled from the `$RABBITMQ_LOG`
environment variable as well:
make run-broker RABBITMQ_LOG=+single_line
Rather than sleeping for 6 seconds, we want to check that replica
recovered multiple times within 30 seconds, and either eventually
succeed, or fail if this does not recover within 30 seconds, the default
await_condition time interval.
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Also increased the tick timeout to avoid checking for new rabbit nodes
to auto add too often.
Also increased sleep times for nodedowns to retry less often.
Some logs used ~p to format a full stack trace. Given these warnings are emitted during
any nodedown this unnecessarily pollutes the logs. Trimmed using ~W instead.
This ensures that only nodes that are ready to host stream members
are included in the election. This avoids continuous restart attempts
when the rabbit application is stopped.
Otherwise metrics will not get cleaned up correctly when processes crash.
It's also tidier to do this in a single place, in terminate/3
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Prior to this change, exclusive queues have been subject to the queue
location process, just like other queues. Therefore, if
queue_master_locator was not client-local and x-queue-master-locator was
not set to client-local, an exclusive queue was likely to be located on
a different node than the connection it is exclusive to. This is
suboptimal and may lead to inconsistencies when the queue's node goes
down while the connection's node is still up.
The table might not exist yet (or is already gone) between the time
rabbit_nodes:all_running() runs and returns a specific node, and
mnesia:dirty_match_object() is called for that node's table.
This seems to happen frequently in CI.
The classic local filesystem source is still supported
using the same traditional configuration key, load_definitions.
Configuration schema follows peer discovery in spirit:
* definitions.import_backend configures the mechanism to use,
which can be a module provided by a plugin
* definitions.* keys can be defined by plugins and contain any
keys a specific mechanism needs
For example, the classic local filesystem source can now be
configured like this:
``` ini
definitions.import_backend = local_filesystem
definitions.local.path = /path/to/definitions.d/definition.json
```
``` ini
definitions.import_backend = https
definitions.https.url = https://hostname/path/to/definitions.json
```
HTTPS may require additional configuration keys related to TLS/x.509
peer verification. Such extra keys will be added as the need for them
becomes evident.
References #3249
It is the equivalent of the content of the Erlang cookie file. Note this
variable IS the cookie value, NOT the path to a cookie file.
If it is set, it will take precedence over the content of the Erlang
cookie file.
Fixesdocker-library/rabbitmq#508.
They are the equivalent of the `default_{user,pass,vhost}` configuration
settings. Each set environment variable will take precedence over its
configuration file counterpart.
Fixesdocker-library/rabbitmq#508.
Those environment variables are unset by default. The default values are
set in the `rabbit` application environment and can be configured in the
configuration file. However, the environment variables will take
precedence over them respectively if they are set.
Otherwise, messages are being dropped, in particular during RabbitMQ
startup because of the amount of debug messages logged in that phase.
Burst limit is enabled again if the log level is set to `info` or
higher.
The problem is we only know about the state of the `rabbit` Erlang
application — when it is started and stopped. But we can't know the fate
of the Erlang VM, except if `rabbit:stop_and_halt()` is called. This
function is not called if `init:stop()` or a SIGTERM are used for
instance.
systemd is interested in the state of the system process (the Erlang
VM), not what's happening inside. But inside, we have multiple
situations where the Erlang application is stopped, but not the Erlang
VM. For instance:
* When clustering, the Erlang application is stopped before the
cluster is created or expanded. The application is restarted once
done. This is controled either manually or using the peer
discovery plugins.
* The `pause_minority` or `pause_if_all_down` partition strategies
both stop the Erlang application for an indefinite period of time,
but RabbitMQ as a service is still up (even though it is managing
its own degraded mode and no connections are accepted).
In both cases, the service is still running from the system's service
manager's point of view.
As said above, we can never tell "the VM is being terminated" with
confidence. We can only know about the Erlang application itself.
Therefore, it is best to report the latter as a systemd state
description, but not reporting the "STOPPING=1" state at all. systemd
will figure out itself that the Erlang VM exited anyway.
Before this change, we were reporting the "STOPPING=1" state to systemd
every time the Elang application was stopped. The problem was that
systemd expected the system process (the Erlang VM) to exit within a
configured period of time (90 seconds by default) or report that's it's
ready again ("READY=1"). This issue remained unnoticed when the cluster
was created/expanded because it probably happened within that time
frame. However, it was reported with the partition healing strategies
because the partition might last longer than 90 seconds. When this
happened, the Erlang VM was killed (SIGKILL) and the service restarted.
References #3262.
Fixes#3289.
This is the same separator as the field mapping. I don't remember why I
picked a different character... Now, it feels awkward and non-intuitive
for users.
The code was passing a number (the timestamp) to
unicode:characters_to_binary/1 which expects an iolist to convert to
UTF-8.
We now verify if we have a number before calling that function. If this
is a number (integer or float), we keep it as is because JSON supports
that type.
`EXIT` messages captured by ra polute the log.
The link is only needed to ensure no orphan processes are left behind,
so they can be safely unlinked once the work is done.
Maybe resizing cluster coordinator does not require linking, only
phases are problematic when a coordinator is stopped
They were trying to run `hostname` and `which`, which produced a bunch
of error messages in a hermetic build environment.
And performance of those `shell` calls is not very important, as they
are caled just a few times during script runtime anyway (there is a
hack to make these lazy, but evaluating only once - but it's hardly
worth it).