A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.
This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.
This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.
Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.
The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.
So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.
2 new configuration entries are introduced:
* rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
disconnected-to-forgotten timer.
* rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
in the coordinator. It used to be a fixed value of 30 seconds. The
default value is still the same. The setting has been introduced to
make integration tests faster.
Fixes#14070
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
## What?
PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.
This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.
## Why?
This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.
## How?
CI runs currently tests with Erlang 27.
This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.
Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.
By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
Helps cleaning-up/coloring stdout for parallel targets
TODO: there are obvious races for different nodes outputs
In the next iteration I hope to implement cursor tracking for each node
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.
Elixir is only enabled from within rabbitmq_cli currently.
Eunit is disabled since there are only Elixir tests.
Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.
This commit also includes a few changes that are
related:
* The Erlang distribution will now be started for parallel-ct
* Many unnecessary PROJECT_MOD lines have been removed
* `eunit_formatters` has been removed, it provides little value
* The new `maybe_flock` Erlang.mk function is used where possible
* Build test deps when testing rabbitmq_cli (Mix won't do it anymore)
* rabbitmq_ct_helpers now use the early plugins to have Dialyzer
properly set up
... are being used at the same time.
[Why]
Depending on which node clusters with which, a node running an older
version of the Khepri Ra machine may not be able to apply Ra commands
and could be stuck.
There is no real solution and this clearly an unsupported scenario. An
old node won't always be able to join a newer cluster.
[How]
In the testsuites, we skip clustering tests if we detect that multiple
Khepri Ra machine versions are being used.
[Why]
The `force_reset` command simply removes local files on disk for the
local node.
In the case of Ra, this can't work because the rest of the cluster does
not know about the forced-reset node. Therefore the leader will continue
to send `append_entry` commands to the reset node.
If that forced-reset node restarts and receives these messages, it will
either join the cluster again (because it's on an older Raft term) or it
will hit an assertion and exit (because it's on the same Raft term).
[How]
Given we can't really support this scenario and it has little value, the
command will now return an error if someone attemps a `force_reset` with
a node running Khepri.
This also deprecates the command: once Mnesia support is removed, the
command will be removed at the same time. This is noted in the
rabbitmqctl.8 manpage.
[Why]
When running mixed-version tests, nodes 1/3/5/... are using the primary
umbrella, so usually the newest version. Nodes 2/4/6/... are using the
secondary umbrella, thus the old version.
When clustering, we used to use node 1 (running a new version) as the
seed node, meaning other nodes would join it.
This complicates things with feature flags because we have to make sure
that we start node 1 with new stable feature flags disabled to allow old
nodes to join.
This is also a problem with Khepri machine versions because the cluster
would start with the latest version, which old nodes might not have.
[How]
This patch changes the logic to use a node running the secondary
umbrella as the seed node instead. If there is no node running it, we
pick the first node as before.
V2: Revert part of "rabbitmq_ct_helpers: Fix how we set
`$RABBITMQ_FEATURE_FLAGS` in tests" (commit
57ed962ef6). These changes are no
longer needed with the new logic.
V3: The check that verifies that the correct metadata store is used has
a special case for nodes that use the secondary umbrella: if Khepri
is supposed to be used but it's not, the feature flag is enabled.
The reason is that the `v4.0.x` branch doesn't know about the `rel`
configuration of `forced_feature_flags_on_init`. The nodes will
have ignored thies parameter and booted with the stable feature
flags only.
Many testsuites are adapted to the new clustering order. If they
manage which node joins which node, either the order is changed in
the testcases, or nodes are started with only required feature
flags. For testsuites that rely on peer discovery where the order is
unknown, nodes are started with only required feature flags.
Exits the with reason "killed" only occurs "naturally" in OTP
when a supervisor tries to shut a child down and it times out.
It is used for failure simulation in tests quite frequently however.
[Why]
In order to make `khepri_db` the default in the future, the handling of
`$RABBITMQ_FEATURE_FLAGS` had to be adapted to be able to *disable*
Khepri instead.
Unfortunately I broke the behavior with stable feature flags that are
only available in the primary umbrella. In this case, they were
automatically enabled and thus, clustering with an old umbrella that did
not have these feature flags failed with `incompatible_feature_flags`.
[How]
The solution is to always use an absolute list of feature flags, not the
new relative list.
V2: Allow a testsuite to skip the configuration of the metadata store.
This is needed for the feature_flags_SUITE testsuite because it
tests the default behavior and the configuration of the metadata
store changes that behavior.
While here, fix a ct log message where variables were swapped
compared to the format strieg expectation.
V3: Enable `rabbitmq_4.0.0` feature flag in rabbit_mgmt_http_SUITE. This
testsuite apparently requires it and if it's not enabled, it fails.
Accidental "fat finger" virtual deletion accidents
would be easier to avoid if there was a protection mechanism
that would apply equally even to CLI tools and external
applications that do not use confirmations for deletion
operations.
This introduce the following changes:
* Virtual host metadata now supports a new queue,
'protected_from_deletion', which, when set,
will be considered by key virtual host deletion function(s)
* DELETE /api/vhosts/{name} was adapted to handle
such blocked deletion attempts to respond with
a 412 Precondition Failed status
* 'rabbitmqctl list_vhosts' and 'rabbitmqctl delete_vhost'
were adapted accordingly
* DELETE /api/vhosts/{name}/deletion/protection
is a new endpoint that can be used to remove
the protective seal (the metadata key)
* POST /api/vhosts/{name}/deletion/protection
marks the virtual host as protected
In the case of the HTTP API, all operations on
virtual host metadata require administrative
privileges from the target user.
Other considerations:
* When a virtual host does not exist, the behavior
remains the same: the original, protection-unaware
code path is used to preserve backwards compatibility
References #12772.
[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.
[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.
However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.
At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.
While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
The problem comes from `ct_master` which doesn't tell us
in the return value whether the tests succeeded. In order
to get that information a CT hook was created. But then
we run into another problem: despite its documentation
claiming otherwise, `ct_master` does not handle `ct_hooks`
instructions in the test spec.
So for the time being we fork `ct_master` into a new
`ct_master_fork` module and insert our hook directly
in the code. Later on we will submit patches to OTP.
Reverting back to the default 1 minute. The problem with
3 minutes is that this is exceedingly long and when there
are problems the test time increases exponentially.