[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.
[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.
However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.
At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.
While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
The problem comes from `ct_master` which doesn't tell us
in the return value whether the tests succeeded. In order
to get that information a CT hook was created. But then
we run into another problem: despite its documentation
claiming otherwise, `ct_master` does not handle `ct_hooks`
instructions in the test spec.
So for the time being we fork `ct_master` into a new
`ct_master_fork` module and insert our hook directly
in the code. Later on we will submit patches to OTP.
Reverting back to the default 1 minute. The problem with
3 minutes is that this is exceedingly long and when there
are problems the test time increases exponentially.
Because `ct_master` is yet another Erlang node, and it is used
to run multiple CT nodes, meaning it is in a cluster of CT
nodes, the tests that change the net_ticktime could not
work properly anymore. This is because net_ticktime must
be the same value across the cluster.
The same value had to be set for all tests in order to solve
this. This is why it was changed to 5s across the board. The
lower net_ticktime was used in most places to speed up tests
that must deal with cluster failures, so that value is good
enough for these cases.
One test in amqp_client was using the net_ticktime to test
the behavior of the direct connection timeout with varying
net_ticktime configurations. The test now mocks the
`net_kernel:get_net_ticktime()` function to achieve the
same result.
This has no real impact on performance[1] but should
make it clear which application can run the broker
and/or publish to Hex.pm. In particular, applications
that we can't run the broker from will now give up
early if we try to.
Note that while the broker can't normally run from the
amqp_client application's directory, it can run from
tests and some of the tests start the broker.
[1] on my machine
This relaxes assert_list/2 assertion to
not require the size of an actually returned list element
to be exactly equal to the size of the expected one.
Sometimes it makes perfect sense to not assert on
every single key but only a subset, and with this
change, it now will be possible.
Individual tests may choose to assert on all
keys by listing them explicitly.
RabbitMQ should advertise the SASL mechanisms in the order as
configured in `rabbitmq.conf`.
Starting RabbitMQ with the following `rabbitmq.conf`:
```
auth_mechanisms.1 = PLAIN
auth_mechanisms.2 = AMQPLAIN
auth_mechanisms.3 = ANONYMOUS
```
translates prior to this commit to:
```
1> application:get_env(rabbit, auth_mechanisms).
{ok,['ANONYMOUS','AMQPLAIN','PLAIN']}
```
and after this commit to:
```
1> application:get_env(rabbit, auth_mechanisms).
{ok,['PLAIN','AMQPLAIN','ANONYMOUS']}
```
In our 4.0 docs we write:
> The server mechanisms are ordered in decreasing level of preference.
which complies with https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-security-v1.0-os.html#type-sasl-mechanisms
This suite uses the mixed version secondary umbrella as a starting
version for a cluster and then has a helper to upgrade the cluster to
the current code. This is meant to ensure that we can upgrade from the
previous minor.
Test case rabbit_mqtt_qos0_queue_kill_node flaked because after an
MQTT client subscribes on node 0, RabbitMQ returns success
and replicated the new binding to node 0 and node 1, but not
yet to node 2. Another MQTT client then publishes on node 2
without the binding being present yet on node 2, and the
message therefore isn't routed.
This commit attempts to eliminate this flake.
It adds a function to rabbit_ct_broker_helpers which waits until a given
node has caught up with the leader node.
We can reuse that function in future to eliminate more test flakes.
We don't need to duplicate so many patterns in so many
files since we have a monorepo (and want to keep it).
If I managed to miss something or remove something that
should stay, please put it back. Note that monorepo-wide
patterns should go in the top-level .gitignore file.
Other .gitignore files are for application or folder-
specific patterns.
The DIST step used rsync for copying files; changing this
to using cp/rm provides a noticeable speed boost.
Before this commit the situation was as follows. With
FAST_RUN_BROKER=1 we are pretty fast but don't benefit
from parallel make:
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=1
2,04s user 1,57s system 90% cpu 4,016 total
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=1 -j8
2,08s user 1,55s system 89% cpu 4,069 total
With FAST_RUN_BROKER=0 we are slow; on the other hand
we greatly benefit from parallel make:
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0
3,29s user 1,93s system 81% cpu 6,425 total
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0 -j8
3,36s user 1,90s system 142% cpu 3,695 total
The reason this method achieves such a result is because
the DIST step that takes a lot of time can be run in
parallel. In addition, this method results on only
the necessary plugins being available in the path,
therefore it doesn't discover unrelated plugins
during node startup, saving time.
By changing rsync to cp/rm, we get great results even
without parallel make:
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0
3,28s user 1,64s system 105% cpu 4,684 total
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0 -j8
3,27s user 1,65s system 135% cpu 3,640 total
We are within 1s of FAST_RUN_BROKER=1 by default, and
faster than FAST_RUN_BROKER=1 with parallel make. On
top of that, we greatly benefit when rebuilding as the
DIST files do not need to be rebuilt every time:
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0
2,94s user 1,40s system 107% cpu 4,035 total
make -C deps/rabbitmq_management run-broker FAST_RUN_BROKER=0 -j8
2,85s user 1,51s system 138% cpu 3,140 total
Therefore it only makes sense to remove FAST_RUN_BROKER,
and instead use the old method which is both more correct
and has more potential for optimisation.
Both FULL and MAKEFLAGS env variables need to be unset
as FULL=1 is present in both. This is a bit of a band-aid,
it's possible that other variables get propagated that
shouldn't be, but we'll fix them when they are detected.
We want to be strict for tests. Only the code that should
be available when testing a plugin should be made available.
With FAST_RUN_BROKER=1, for the time being, this is not the
case: all plugins' code is available to be loaded.