Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).
In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
This new module sits on top of `rabbit_mnesia` and provide an API with
all cluster-related functions.
`rabbit_mnesia` should be called directly inside Mnesia-specific code
only, `rabbit_mnesia_rename` or classic mirrored queues for instance.
Otherwise, `rabbit_db_cluster` must be used.
Several modules, in particular in `rabbitmq_cli`, continue to call
`rabbit_mnesia` as a fallback option if the `rabbit_db_cluster` module
unavailable. This will be the case when the CLI will interact with an
older RabbitMQ version.
This will help with the introduction of a new database backend.
So far, we had the following functions to list nodes in a RabbitMQ
cluster:
* `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster;
the argument was used to select members (all members or only those
running Mnesia and participating in the cluster)
* `rabbit_nodes:all/0` to get all members of the Mnesia cluster
* `rabbit_nodes:all_running/0` to get all members who currently run
Mnesia
Basically:
* `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)`
* `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)`
We also have:
* `rabbit_node_monitor:alive_nodes/1` which filters the given list of
nodes to only select those currently running Mnesia
* `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given
list of nodes to only select those currently running RabbitMQ
Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the
`rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)`
or `rabbit_nodes:all_running/0` is often used as a close approximation
of "all cluster members running RabbitMQ". This list might be incorrect
in times where a node is joining the clustered or is being worked on
(i.e. Mnesia is running but not RabbitMQ).
With Khepri, there won't be the same possible approximation because we
will try to keep Khepri/Ra running even if RabbitMQ is stopped to
expand/shrink the cluster.
So in order to clarify what we want when we query a list of nodes, this
patch introduces the following functions:
* `rabbit_nodes:list_members/0` to get all cluster members, regardless
of their state
* `rabbit_nodes:list_reachable/0` to get all cluster members we can
reach using Erlang distribution, regardless of the state of RabbitMQ
* `rabbit_nodes:list_running/0` to get all cluster members who run
RabbitMQ, regardless of the maintenance state
* `rabbit_nodes:list_serving/0` to get all cluster members who run
RabbitMQ and are accepting clients
In addition to the list functions, there are the corresponding
`rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)`
filtering functions.
The code is modified to use these new functions. One possible
significant change is that the new list functions will perform RPC calls
to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
In the MQTT test assertions, instead of checking whether the test runs
in mixed version mode where all non-required feature flags are disabled
by default, check whether the given feature flag is enabled.
Prior to this commit, once feature flag rabbit_mqtt_qos0_queue becomes
required, the test cases would have failed.
New test suite deps/rabbitmq_mqtt/test/shared_SUITE contains tests that
are executed against both MQTT and Web MQTT.
This has two major advantages:
1. Eliminates test code duplication between rabbitmq_mqtt and
rabbitmq_web_mqtt making the tests easier to maintain and to understand.
2. Increases test coverage of Web MQTT.
It's acceptable to add a **test** dependency from rabbitmq_mqtt to
rabbitmq_web_mqtt. Obviously, there should be no such dependency
for non-test code.
Prior to this commit, when connecting or disconnecting many thousands of
MQTT subscribers, RabbitMQ printed many times:
```
[warning] <0.241.0> Mnesia('rabbit@mqtt-rabbit-1-server-0.mqtt-rabbit-1-nodes.default'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
```
Each MQTT subscription causes queues and bindings to be written into Mnesia.
In order to allow for higher Mnesia load, the user can configure
```
[
{mnesia,[
{dump_log_write_threshold, 10000}
]}
].
```
in advanced.config
or set this value via
```
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-mnesia dump_log_write_threshold 10000"
```
The Mnesia default for dump_log_write_threshold is 1,000.
The Mnesia default for dump_log_time_threshold is 180,000 ms.
It is reasonable to increase the default for dump_log_write_threshold from
1,000 to 5,000 and in return decrease the default dump_log_time_threshold
from 3 minutes to 1.5 minutes.
This way, users can achieve higher MQTT scalability by default.
This setting cannot be changed at Mnesia runtime, it needs to be set
before Mnesia gets started.
Since the rabbitmq_mqtt plugin can be enabled dynamically after Mnesia
started, this setting must therefore apply globally to RabbitMQ.
Users can continue to set their own defaults via advanced.config or
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS. They continue to be respected
as shown by the new test suite included in this commit.
Directory name for tools in runfiles is a bit unpredictable. Maybe
there is a better way, but at least this should cover all the cases
that I've observed.
One unexpected consequence is that this directory name can contains `~`,
and it's not being properly escaped in `ct:pal` calls (causing
badarg's for `io:format/4`.
The location and name of this directory remains the same for
compatibility reasons. Therefore, it sill contains "mnesia" in its name.
However, semantically, we want this directory to be unrelated to Mnesia.
In the end, many subsystems write files and directories there, including
Mnesia, all Ra systems and in the future, Khepri.
Previously it was not possible to see code coverage for the majority of
test cases: integration tests that create RabbitMQ nodes.
It was only possible to see code coverage for unit tests.
This commit allows to see code coverage for tests that create RabbitMQ
nodes.
The only thing you need to do is setting the `COVER` variable, for example
```
make -C deps/rabbitmq_mqtt ct COVER=1
```
will show you coverage across all tests in the MQTT plugin.
Whenever a RabbitMQ node is started `ct_cover:add_nodes/1` is called.
Contrary to the documentation which states
> To have effect, this function is to be called from init_per_suite/1 (see common_test) before any tests are performed.
I found that it also works in init_per_group/1 or even within the test cases themselves.
Whenever a RabbitMQ node is stopped or killed `ct_cover:remove_nodes/1`
is called to transfer results from the RabbitMQ node to the CT node.
Since the erlang.mk file writes a file called `test/ct.cover.spec`
including the line:
```
{export,".../rabbitmq-server/deps/rabbitmq_mqtt/cover/ct.coverdata"}.
```
results across all test suites will be accumulated in that file.
The accumulated result can be seen through the link `Coverage log` on the test suite result pages.
Back in 2016, JSON encoding and
much of the Erlang ecosystem used
proplists, which can lead to duplicate
keys in JSON documents.
In 2022 some JSON libraries only
decode JSON to maps, and maps
have unique keys, so these tests
are not worth adjusting or reproducing
with maps.
Per discussion with the team.
When the fetch of the secondary umbrella was moved into bzlmod, this
changed its path at the time of test execution. Tests will now fail if
a secondary umbrella is specified for bazel, but does not exist. The
path is also corrected.
'us' is used when Unicode is not available.
Prior to this commit:
```
$ kubectl logs r1-server-0 -c rabbitmq | ag time
2022-06-30 13:37:35.253927+00:00 [debug] <0.336.0> wal: recovered 00000003.wal time taken 0ms
2022-06-30 13:37:35.262592+00:00 [debug] <0.349.0> wal: recovered 00000003.wal time taken 0ms
2022-06-30 13:37:35.489016+00:00 [debug] <0.352.0> Feature flags: time to find supported feature flags: 76468 �s
2022-06-30 13:37:35.495193+00:00 [debug] <0.352.0> Feature flags: time to regen registry: 6032 �s
2022-06-30 13:37:35.500574+00:00 [debug] <0.361.0> Feature flags: time to find supported feature flags: 937 �s
2022-06-30 13:37:35.500603+00:00 [debug] <26705.398.0> Feature flags: time to find supported feature flags: 891 �s
2022-06-30 13:37:35.507998+00:00 [debug] <26705.398.0> Feature flags: time to regen registry: 7199 �s
2022-06-30 13:37:35.509092+00:00 [debug] <0.361.0> Feature flags: time to regen registry: 8396 �s
```
common_test installs its own logger handler, which is great.
Unfortunately, this logger handler drops all messages having a domain,
except when the domain is among the domains used by Erlang itself.
In RabbitMQ, we use logger domains to categorize messages. Therefore
those messages are dropped by the common_test's logger handler.
This commit introduces another logger handler which sits on top of the
common_test one and makes sure messages with a domain are logged as
well.
When declaring a quorum queue or a stream, select its replicas in the
following order:
1. local RabbitMQ node (to have data locality for declaring client)
2. running RabbitMQ nodes
3. RabbitMQ nodes with least quorum queue or stream replicas (to have a "balanced" RabbitMQ cluster).
From now on, quorum queues and streams behave the same way for replica
selection strategy and leader locator strategy.
so that these functions can be reused in other tests.
Inspired by Gomega's Eventually and Consistently functions.
See https://onsi.github.io/gomega/#making-asynchronous-assertions
"Eventually checks that an assertion eventually passes. Eventually blocks
when called and attempts an assertion periodically until it passes or a
timeout occurs. Both the timeout and polling interval are configurable
as optional arguments."
"Consistently checks that an assertion passes for a period of time. It
does this by polling its argument repeatedly during the period. It fails
if the matcher ever fails during that period."
This is the build error prior to these changes:
```
* rabbit_common (/home/bakkenl/development/rabbitmq/rabbitmq-server/deps/rabbit_common)
could not find an app file at "_build/dev/lib/rabbit_common/ebin/rabbit_common.app". This may happen if the dependency was not yet compiled or the dependency indeed has no app file (then you can pass app: false as option)
** (Mix) Can't continue due to errors on dependencies
```
Telling `mix` to compile `rabbit_common` ensures that the following
links are created:
```
$ ll deps/rabbitmq_cli/_build/dev/lib/rabbit_common/
total 8
drwxr-xr-x 2 bakkenl bakkenl 4096 Jan 20 09:46 .
drwxr-xr-x 10 bakkenl bakkenl 4096 Jan 20 09:46 ..
lrwxrwxrwx 1 bakkenl bakkenl 33 Jan 20 09:46 ebin -> ../../../../../rabbit_common/ebin
lrwxrwxrwx 1 bakkenl bakkenl 36 Jan 20 09:46 include -> ../../../../../rabbit_common/include
```
bazel-erlang has been renamed rules_erlang. v2 is a substantial
refactor that brings Windows support. While this alone isn't enough to
run all rabbitmq-server suites on windows, one can at least now start
the broker (bazel run broker) and run the tests that do not start a
background broker process
Some tests would make rpc calls, ignoring the result. Since erpc
throws errors, this broke some tests. This commit restores the
non-throw behavior to add_vhost, delete_vhost, add_user,
set_user_tags, delete_user & clear_permissions.
Modern Python and OpenSSL versions
can reject certificates that use SHA-1
as insufficiently secure.
This is the case with Python 3 on Debian
Buster, for example
Per discussion with @pjk25 @dumbbell
After theses files were consolidated at the root of rabbitmq-server,
they were no longer uploaded to terraformed vms as needed. We now
check for the files at their original location, falling back to the
central location for upload.
This helper is designed to perform exact matching between the generated
configuration and the expected value. This does not work at all if the
schema has default values for untested configuration variables.
The correct solution would be to rewrite this helper and all testsuites
using it to do pattern matching instead. But in the meantime, work
around this design issue by removing the `{rabbit, {log, _}}`
configuration key.
It looks like the message sent by erlang:start_timer/4 conflicts without
something else, perhaps inside common_test.
Hopefully, by using timer:send_after/2 and thus another message format,
the possible conflict will go away.
The following code may exit with a badmatch, the vhost is gone in
between:
get_message_store_pid(Config, Node, VHost) ->
{ok, VHostSup} = rpc(Config, Node,
rabbit_vhost_sup_sup, get_vhost_sup, [VHost]),
That's what we now catch in `force_vhost_failure/4`.
We didn't do it when querying the manifest. Unfortunately, recently
terraform(1) started to prompt for confirmation. The command was thus
stuck and the testcase was timing out.
The m4.large could build Erlang and the testsuite could run in 28
minutes. That's an improvment, but we are still close to the limit.
Rather than bump the limit, try with an m5.large. It's also a bit
cheaper to my surprise.
The previous default of t2.micro was insufficient to compile Erlang from
sources in under 30 minutes. This caused the integration testsuite to
timeout.
Hopefully an m4.large instance type will be enough.
and add a VMware copyright notice.
We did not mean to make this code Incompatible with Secondary Licenses
as defined in [1].
1. https://www.mozilla.org/en-US/MPL/2.0/FAQ/
This is sometimes failing in GitHub Actions and we don't know why:
https://github.com/rabbitmq/rabbitmq-server/runs/877118328?check_suite_focus=true#step:6:6099
We confirmed in the logs that only 1 out of 3 nodes get unblocked. The
CT suite hits the 15 minute time trap and fails. We don't know whether
this rpc call doesn't make it through to the first or second node, or if
it does and the rpc call simply doesn't return within the time window.
We can't address this if we don't know where the problem lies, so this
will give us more insight when it fails again.
Signed-off-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
awaitMatch(Guard, Expr, Timeout) reevaluates Expr until it matches
Guard up to Timeout milliseconds from now. It returns the matched
value, in the event said value is useful later in a test.
Additionally simplify an instance of ?assertEqual(true, ... to ?assert(
`make test-dist` was already executed for project being tested,
therefore we can skip the build to save time when a RabbitMQ node is
started from there.
However, if the node is to be started from another place (i.e. `rabbit`
when plugins are disabled), we must not skip the build because the
project might have no .ez. files created at this point.
This reverts part of
dc5a04a503
because tests started failing in GitHub Actions with:
2020-04-30 16:38:32.238 [error] <0.228.0> Supervisor inet_tcp_proxy_dist_conn_sup had child
{undefined,false,#Ref<0.301488671.524812289.157484>}
started with
{inet_tcp_proxy_dist,dist_proc_start_link,undefined} at <0.776.0>
exit with reason net_tick_timeout in context child_terminated
We suspect that this is due to CPU contention on GitHub Actions shared runners.
When 5 Erlang VM nodes with 2 schedulers each start at the same time on a
host with 2 CPUs and then try to cluster via rabbitmqctl (which starts
5 more Erlang VMs), the 5 second net_tick_time is not long enough.
Rather than increasing the net_tick_time, we are choosing to put less
pressure on the host by clustering nodes one-by-one rather than all at
once.
Pair @dumbbell
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
I don't remember why I used `stop-rabbit-on-node` then `stop-node`. But
this was two CLI runs which costed a lot in time.
Now, `stop-node` does not use the CLI anymore and getting rid of
`stop-rabbit-on-node` reduces the number of CLI runs to 0, improving the
time it takes to stop RabbitMQ significantly: it shaved about 1 second,
giving a stop time of about 3 seconds now on my laptop.
For cases where a condition can materialize eventually but
we do not know when exactly since we have to observe it from
the outside.
E.g. a cluster of nodes can be formed in a second or two, there's
a randomized delay on startup involved by design.
If the `start-background-broker` recipe fails, it is possible that the
node was started but the follow-up wait/status failed.
To not let a, unused node around, we try to stop it in case of a failure
and ignore the result of that stop recipe.
This situation happened in CI where Elixir seems to crash (one of the
two CLI commands we run after starting the node):
Logger - error: {removed_failing_handler,'Elixir.Logger'}
Logger - error: {removed_failing_handler,'Elixir.Logger'}
Logger - error: {removed_failing_handler,'Elixir.Logger'}
escript: exception error: undefined function 'Elixir.Exception':blame/3
in function 'Elixir.Kernel.CLI':format_error/3 (lib/kernel/cli.ex, line 82)
in call from 'Elixir.Kernel.CLI':print_error/3 (lib/kernel/cli.ex, line 173)
in call from 'Elixir.Kernel.CLI':exec_fun/2 (lib/kernel/cli.ex, line 150)
in call from 'Elixir.Kernel.CLI':run/1 (lib/kernel/cli.ex, line 47)
in call from escript:run/2 (escript.erl, line 758)
in call from escript:start/1 (escript.erl, line 277)
in call from init:start_em/1
.../rabbit_common/mk/rabbitmq-run.mk:323: recipe for target 'start-background-broker' failed
In this case, rabbit_ct_broker_helpers tried again to start the node and
it worked. But it affected an unrelated testcase later because it tried
to use a TCP port already used by that left-over node.
rabbitmq_ct_helpers ensures everything is built earlier, so no need to
try again. This saves a bit of time and hopefully fixes a few
situations where RabbitMQ is recompiled without test code.