The test creates network partitions and checks how the stream SAC
coordinator deals with them. It can be flaky on CI, the log statements
should help diagnose the flakiness.
On Windows the file may be in "DELETE PENDING" state following
its deletion (when the last message was acked). A subsequent
message leads us to writing to that file again but we can't
and get an {error,eacces}. In that case we wait 10ms and retry
up to 3 times.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Waiting to runDetails
Test (make) / Build and Xref (1.17, 26) (push) Waiting to runDetails
Test (make) / Build and Xref (1.17, 27) (push) Waiting to runDetails
Test (make) / Test (1.17, 27, khepri) (push) Waiting to runDetails
Test (make) / Test (1.17, 27, mnesia) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Waiting to runDetails
Test (make) / Type check (1.17, 27) (push) Waiting to runDetails
We've experienced lots of failures in CI:
```
GEN test/system_SUITE_data/apache-activemq-5.18.3-bin.tar.gz
make: *** [Makefile:65: test/system_SUITE_data/apache-activemq-5.18.3-bin.tar.gz] Error 28
make: Leaving directory '/home/runner/work/rabbitmq-server/rabbitmq-server/deps/amqp10_client'
Error: Process completed with exit code 2.
```
Bumping to the latest ActiveMQ Classic version may or may not help with
these failures.
Either way, we want to test against the latest ActiveMQ version. Version
5.18.3 reached end-of-life and is no longer maintained.
This commit handles edge cases in the stream SAC coordinator to make
sure it does not crash during execution. Most of these edge cases
consist in an inconsistent state, so there are very unlikely to happen.
This commit also makes sure there is no duplicate in the consumer list
of a group. Consumers are also now identified only by their connection
PID and their subscription ID, as now the timestamp they contain in
their state does not allow a field-by-field comparison.
New CLI command to trigger a rebalancing in a SAC group and activate a
consumer. This is a last resort solution if all consumers in a group
accidently end up in {connected, waiting} state.
The command re-uses an existing function, which only picks the consumer
that should be active. This means it does not try to "fix" the state
(e.g. removing a disconnected consumer because its node is definitely
gone from the cluster).
Fixes#14055
Calls to the stream SAC coordinator can fail for various reason
(e.g. a timeout because of a network partition). The stream reader does not
take into account what the SAC coordinator returns and moves on even
in case of errors. This can lead to inconsistent state for SAC groups.
This commit changes this behavior by handling unexpected errors from the
SAC coordinator and closing the connection. The client is expected to
reconnect. This is safer than risking inconsistent state.
Fixes#14040
The clean-up of a stream connection state when a stream member goes down can
remove subscriptions not affected by the member. The subscription state is
removed from the connection, but the subscription is not removed from
the SAC state (if the subscription is a SAC), because the subscription member
PID does not match the down member PID.
When the actual member of the subscription goes down, the subscription is no
longer part of the state, so the clean-up does not find the subscription
and does not remove it from the SAC state. This lets a ghost consumer in
the corresponding SAC group.
This commit makes sure only the affected subscriptions are removed from
the state when a stream member goes down.
Fixes#13961
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.
This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.
This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.
Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.
The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.
So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.
2 new configuration entries are introduced:
* rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
disconnected-to-forgotten timer.
* rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
in the coordinator. It used to be a fixed value of 30 seconds. The
default value is still the same. The setting has been introduced to
make integration tests faster.
Fixes#14070