Otherwise metrics will not get cleaned up correctly when processes crash.
It's also tidier to do this in a single place, in terminate/3
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Prior to this change, exclusive queues have been subject to the queue
location process, just like other queues. Therefore, if
queue_master_locator was not client-local and x-queue-master-locator was
not set to client-local, an exclusive queue was likely to be located on
a different node than the connection it is exclusive to. This is
suboptimal and may lead to inconsistencies when the queue's node goes
down while the connection's node is still up.
The table might not exist yet (or is already gone) between the time
rabbit_nodes:all_running() runs and returns a specific node, and
mnesia:dirty_match_object() is called for that node's table.
This seems to happen frequently in CI.
The classic local filesystem source is still supported
using the same traditional configuration key, load_definitions.
Configuration schema follows peer discovery in spirit:
* definitions.import_backend configures the mechanism to use,
which can be a module provided by a plugin
* definitions.* keys can be defined by plugins and contain any
keys a specific mechanism needs
For example, the classic local filesystem source can now be
configured like this:
``` ini
definitions.import_backend = local_filesystem
definitions.local.path = /path/to/definitions.d/definition.json
```
``` ini
definitions.import_backend = https
definitions.https.url = https://hostname/path/to/definitions.json
```
HTTPS may require additional configuration keys related to TLS/x.509
peer verification. Such extra keys will be added as the need for them
becomes evident.
References #3249
It is the equivalent of the content of the Erlang cookie file. Note this
variable IS the cookie value, NOT the path to a cookie file.
If it is set, it will take precedence over the content of the Erlang
cookie file.
Fixesdocker-library/rabbitmq#508.
They are the equivalent of the `default_{user,pass,vhost}` configuration
settings. Each set environment variable will take precedence over its
configuration file counterpart.
Fixesdocker-library/rabbitmq#508.
Those environment variables are unset by default. The default values are
set in the `rabbit` application environment and can be configured in the
configuration file. However, the environment variables will take
precedence over them respectively if they are set.
Otherwise, messages are being dropped, in particular during RabbitMQ
startup because of the amount of debug messages logged in that phase.
Burst limit is enabled again if the log level is set to `info` or
higher.
The problem is we only know about the state of the `rabbit` Erlang
application — when it is started and stopped. But we can't know the fate
of the Erlang VM, except if `rabbit:stop_and_halt()` is called. This
function is not called if `init:stop()` or a SIGTERM are used for
instance.
systemd is interested in the state of the system process (the Erlang
VM), not what's happening inside. But inside, we have multiple
situations where the Erlang application is stopped, but not the Erlang
VM. For instance:
* When clustering, the Erlang application is stopped before the
cluster is created or expanded. The application is restarted once
done. This is controled either manually or using the peer
discovery plugins.
* The `pause_minority` or `pause_if_all_down` partition strategies
both stop the Erlang application for an indefinite period of time,
but RabbitMQ as a service is still up (even though it is managing
its own degraded mode and no connections are accepted).
In both cases, the service is still running from the system's service
manager's point of view.
As said above, we can never tell "the VM is being terminated" with
confidence. We can only know about the Erlang application itself.
Therefore, it is best to report the latter as a systemd state
description, but not reporting the "STOPPING=1" state at all. systemd
will figure out itself that the Erlang VM exited anyway.
Before this change, we were reporting the "STOPPING=1" state to systemd
every time the Elang application was stopped. The problem was that
systemd expected the system process (the Erlang VM) to exit within a
configured period of time (90 seconds by default) or report that's it's
ready again ("READY=1"). This issue remained unnoticed when the cluster
was created/expanded because it probably happened within that time
frame. However, it was reported with the partition healing strategies
because the partition might last longer than 90 seconds. When this
happened, the Erlang VM was killed (SIGKILL) and the service restarted.
References #3262.
Fixes#3289.
This is the same separator as the field mapping. I don't remember why I
picked a different character... Now, it feels awkward and non-intuitive
for users.
The code was passing a number (the timestamp) to
unicode:characters_to_binary/1 which expects an iolist to convert to
UTF-8.
We now verify if we have a number before calling that function. If this
is a number (integer or float), we keep it as is because JSON supports
that type.
`EXIT` messages captured by ra polute the log.
The link is only needed to ensure no orphan processes are left behind,
so they can be safely unlinked once the work is done.
Maybe resizing cluster coordinator does not require linking, only
phases are problematic when a coordinator is stopped
They were trying to run `hostname` and `which`, which produced a bunch
of error messages in a hermetic build environment.
And performance of those `shell` calls is not very important, as they
are caled just a few times during script runtime anyway (there is a
hack to make these lazy, but evaluating only once - but it's hardly
worth it).
Unlike pg2, pg in Erlang 24 is eventually consistent. So this
reintroduces some of the same kind of locking mirrored_supervisor
used to rely on implicitly via pg2.
Per discussion with @lhoguin.
Closes#3260.
References #3132, #3154.
Before this commit, importing the dashboard via ConfigMap as seen in
1eb1dc618e
didn't work because DS_PROMETHEUS variable was undefined in Grafana.
Related to https://github.com/rabbitmq/rabbitmq-server/pull/3250
Co-authored-by: Gerhard Lazu <gerhard@lazu.co.uk>
This breaks the docker-compose integration, but we need to move away
from it anyways, the whole dev flow needs revisiting after our focus on
K8s.
$__rate_interval does not work with irate, dropping it in favour of 60s,
same as all other dashboards.
This is a follow-up to https://github.com/rabbitmq/rabbitmq-server/pull/3250
Thanks @ansd for mentioning about the post-import issues.
It was uploaded as https://grafana.com/api/dashboards/14798/revisions/3/download
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This handles the scenario where rmq2 is not available, and
stream-perf-test exits with a non-zero exit code. Good spot @ansd!
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
to prevent `amqp_connection:start/2` from logging a warning when we use default
values.
The default for `connection_timeout` is already 60000. When we don't explicitly
set it to a lower value, `amqp_connection:maybe_update_call_timeout/2` adjusts
it to 70000 and logs a warning message, which may appear unexpected, especially
for users upgrading to 3.8.10+ with no config changes.
This change addresses that problem by pre-adjusting it to 70000, making it safe
by default and ensuring our default values don't conflict with each other.
A recent release of buildbuddy eliminated a timeout extension that
they had applied in the past. Now that they honor timeouts exactly,
we have had to adjust the timeout for many tests.
This has the unfortunate side effect of causing a rebuild of all
applications every time. I need to figure out another place to build and
install the CLI during build time (instead of as part of the dist
target).
This reverts commit 4322cca66e.
It's a runtime dependency, not a build dependency.
This is a fix and should be backported to v3.9.x, after rc.2 and just
before the final release. Would you disagree @dumbbell?
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
since it fails with Bazel.
As discussed with @pjk25, let's set this value via application env,
make it configurable to the test, but not configurable to the user.
Unlike with gnu make, mixed version testing with bazel uses a package-generic-unix for the secondary umbrella rather than the source. This brings the benefit of being able to mixed version test releases built with older erlang versions (even though all nodes will run under the single version given to bazel)
This introduces new test labels, adding a `-mixed` suffix for every existing test. They can be skipped if necessary with `--test_tag_filters` (see the github actions workflow for an example)
As part of the change, it is now possible to run an old release of rabbit with rabbitmq_run rule, such as:
`bazel run @rabbitmq-server-generic-unix-3.8.17//:rabbitmq-run run-broker`
Add state timeouts.
If the client takes more than 10s for a single step in the authentication
protocol, make the server close the TCP connection.
Also close the TCP connection if the server times out in state
close_sent. That's the case when the client sends an invalid command
(after successful authentication), the server requests the client to
close the connection, but the client doesn't respond anymore.