New CLI command to trigger a rebalancing in a SAC group and activate a
consumer. This is a last resort solution if all consumers in a group
accidently end up in {connected, waiting} state.
The command re-uses an existing function, which only picks the consumer
that should be active. This means it does not try to "fix" the state
(e.g. removing a disconnected consumer because its node is definitely
gone from the cluster).
Fixes#14055
Calls to the stream SAC coordinator can fail for various reason
(e.g. a timeout because of a network partition). The stream reader does not
take into account what the SAC coordinator returns and moves on even
in case of errors. This can lead to inconsistent state for SAC groups.
This commit changes this behavior by handling unexpected errors from the
SAC coordinator and closing the connection. The client is expected to
reconnect. This is safer than risking inconsistent state.
Fixes#14040
The clean-up of a stream connection state when a stream member goes down can
remove subscriptions not affected by the member. The subscription state is
removed from the connection, but the subscription is not removed from
the SAC state (if the subscription is a SAC), because the subscription member
PID does not match the down member PID.
When the actual member of the subscription goes down, the subscription is no
longer part of the state, so the clean-up does not find the subscription
and does not remove it from the SAC state. This lets a ghost consumer in
the corresponding SAC group.
This commit makes sure only the affected subscriptions are removed from
the state when a stream member goes down.
Fixes#13961
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.
This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.
This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.
Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.
The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.
So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.
2 new configuration entries are introduced:
* rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
disconnected-to-forgotten timer.
* rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
in the coordinator. It used to be a fixed value of 30 seconds. The
default value is still the same. The setting has been introduced to
make integration tests faster.
Fixes#14070
This is simmilar to https://github.com/rabbitmq/rabbitmq-server/pull/14056.
The performance benefit is probably negligbile though since this is
called only after each batch of Ra commands.
Nevertheless, it's unnecessary to allocate a list with 3 elements and
therefore 6 words on the heap, so let's optimise it.
A stream may not have a leader temporarily for several reasons, e.g.
after it has been restarted. The stream manager may return undefined in
this case. Some client code may crash because it expects a PID or an
error, but not undefined.
This commit makes sure the leader PID is an actual Erlang PID and
returns {error, not_available} if it is not.
References #13962
[Why]
They make it more difficult to compile RabbitMQ on Windows. They were
probably useful at the time of the switch to a monorepository but I
don't see their need anymore.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (main, 27, 4.2.0) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (v4.0.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (v4.1.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (main, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (v4.0.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (v4.1.x, 27) (push) Has been cancelledDetails
The correct format is:
```
-export(Functions).
```
ELP detected this malformed syntax.
Interestingly, prior to this commit, the functions were still exported:
```
rabbitmq_amqp_address:module_info(exports).
[{exchange,1},
{exchange,2},
{queue,1},
{module_info,0},
{module_info,1}]
```
The correct format is:
```
-export(Functions).
```
ELP detected this malformed syntax.
Interestingly, prior to this commit, the functions were still exported:
```
rabbitmq_amqp_address:module_info(exports).
[{exchange,1},
{exchange,2},
{queue,1},
{module_info,0},
{module_info,1}]
```
MQTT tests depend on a few plugins, which are just used in 1 or 2
suites each. These have caused issues in CI, triggering a bug in
rabbitmq_federation where the mirrored supervisor submits a transaction
while the cluster is being shut down. The transaction hangs and the
whole rabbitmq_mqtt job times out.
This bug has been addressed, however it is best to start just the required
plugins on each SUITE.
[Why]
Links are started by the plugins but put under the `rabbit` supervision
tree. The federation plugins supervision tree is empty unfortunately...
Links are stopped by a boot step executed by `rabbit`, as a concequence
of unregistering the plugins' parameters.
Unfortunately, links can be terminated if the channel, and implicitly
the connection stops. This happens when the `amqp_client` application
stops.
We end up with a race here:
* Because the federation plugins supervision trees are empty and the
application stop functions barely stop the pg group (which doesn't
terminate the group members), nothing waits for the links to stop.
Therefore, `rabbit` can stop `amqp_client' which is a dependency of
the federation plugins. Therefore, the links underlying channels and
connections are stopped.
* `rabbit` unregister the federation parameters, terminating the links.
The exchange links `terminate/2` function needs the channel to delete
the remote queue. But the channel and the underlying connection might
be gone.
This simply logs a `badmatch` exception:
[error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
[error] <0.884.0> {error,
[error] <0.884.0> {noproc,
[error] <0.884.0> {gen_server,
[error] <0.884.0> call,
[error] <0.884.0> [<0.911.0>,
[error] <0.884.0> {command,
[error] <0.884.0> {open_channel,
[error] <0.884.0> none,
[error] <0.884.0> {amqp_selective_consumer,
[error] <0.884.0> []}}},
[error] <0.884.0> 130000]}}}}
[How]
The solution is to make sure links are stopped as part of the stop of
the plugins.
`rabbit_federation_pg:stop_scope/1` is expanded to stop all members of
all groups in this scope, before terminating the pg scope itself. The
new code waits for the stopped processes to exit.
We have to handle the `EXIT` signal in the link processes and change
their restart strategy in their parent supervisor from permanent to
transient. This ensures they are restarted only if they crash. This also
skips a error log message about each stopped link.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
## What?
PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.
This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.
## Why?
This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.
## How?
CI runs currently tests with Erlang 27.
This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.
Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.
By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```