This ensures that quorum_queues shuts down _before_
coordination where khepri run inside.
Quorum queues depend on khepri so need to be shut down first.
[Why]
The feature flag enable function is called during the initial migration
or when a node is later added to a cluster.
In this latter situation, the cluster is already formed and the Mnesia
tables were already migrated. Syncing the cluster in this specific
situation might kick another node that is currently unreachable.
[How]
If the node running the enable function is already clustered, we skip
the cluster sync.
[Why]
All callers of `khepri_adv` and `khepri_tx_adv` need updates to handle
the now uniform return type of `khepri:node_props_map()` in Khepri
0.17.0.
[How]
We don't need any compatibility code to handle "either the old return
type or the new return type" from the khepri_adv API because the
translation is done entirely in the "client side" code in Khepri -
meaning that the return value from the Ra server is the same but it is
translated differently by the functions in `khepri_adv`.
However, we need to adapt transaction functions because they may be
executed on different versions of Khepri and the behaviour of
`khepri_tx_adv` can be different. To take the possible change of return
value format, we use the new `khepri_tx:does_api_comply_with/1` to know
what to expect.
[Why]
In Khepri 0.17.0, `khepri_cluster:locally_known_members/1` and
`khepri_cluster:locally_known_node/1` were replaced with
`khepri_cluster:members/2` and `khepri_cluster:nodes/2` with `favor` set
to `low_latency` - this matches the interface for queries in Khepri.
Move leader repair earlier in tick function to ensure more
timely update of meta data store record after leader change.
Also use RPC_TIMEOUT macro for metric/stats multicalls to improve
liveness when a node is connected but partitioned / frozen.
This should address crashes like this in (found in user's logs):
```
exception error: no case clause matching
[[{connection_details,[]},
{name,<<"10.0.13.41:50497 -> 10.2.230.128:5671 (1)">>},
{node,rabbit@foobar},
{number,1},
{user,<<"...">>},
{user_who_performed_action,<<"...">>},
{vhost,<<"/">>}],
[{connection_details,[]},
{name,<<"10.0.13.41:50142 -> 10.2.230.128:5671 (1)">>},
{node,rabbit@foobar},
{number,1},
{user,<<"...">>},
{user_who_performed_action,<<"...">>},
{vhost,<<"/">>}]]
in function rabbit_federation_mgmt:format/3 (rabbit_federation_mgmt.erl, line 100)
in call from rabbit_federation_mgmt:'-status/3-lc$^0/1-0-'/4 (rabbit_federation_mgmt.erl, line 89)
in call from rabbit_federation_mgmt:'-status/4-lc$^0/1-0-'/3 (rabbit_federation_mgmt.erl, line 82)
in call from rabbit_federation_mgmt:'-status/4-lc$^0/1-0-'/3 (rabbit_federation_mgmt.erl, line 82)
in call from rabbit_federation_mgmt:status/4 (rabbit_federation_mgmt.erl, line 82)
in call from rabbit_federation_mgmt:to_json/2 (rabbit_federation_mgmt.erl, line 57)
in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
in call from cowboy_rest:set_resp_body/2 (src/cowboy_rest.erl, line 1473)
```
## What?
This commit determines the queue topology without checking the queue type.
## Why?
This way, checking leader and replicas works the same across all queue
types without the need to introduce other rabbit_queue_type behaviour as
suggested in other PRs.
## How?
pid is the leader, nodes in queue_type_states are the members/replicas.
This commit results in an unknown stream leader during queue
declaration. However the correct leader will be returned eventually when
calling GET on the stream.
This allows restricting access to the /api/index.html and
the /cli/index.html page to authenticated users should the
user really want to. This can be enabled via advanced.config.
A connection which terminated before it was fully established
would lead to a function_clause, since metadata is not available
to really call notify_connection_closed. We can just ignore such
connections and not notify about them.
Resolves https://github.com/rabbitmq/rabbitmq-server/discussions/13670
The same connection can contain several consumers belonging to a SAC
group (group key = vhost + stream + consumer name). The whole new group
must be re-evaluated to select a new active consumer after the consumers
of the down connection are removed from it.
The previous behavior would not re-evaluate the new group and could
select a consumer from the down connection, letting the group with only
inactive consumers, as the selected active consumer would never receive
the activation message from the stream SAC coordinator.
This commit fixes this problem by removing the consumers of the down
down connection from the affected groups and then performing the
appropriate operations for the groups to keep on consuming (e.g.
notifying an active consumer that it needs to step down).
References #13372
Ra improvements:
* Don't allow a non-voter to start elections
* Register with ra directory before initialising ra server.
* Trigger tick_timeout immediately after entering leader state.
* Set a configurable segment max size
This commit also includes a change to turn the quorum queue
become leader callback to become a noop and instead rely on
the more promptly tick_handler to handle the meta data store
update after a leader election.
This more prompt tick update means there should be a much shorter
gap between the queue metrics being deleted from the old leader
node to them being available again on the new node resulting
in smoother message count metrics.
Fix test that relied on waiting on too simplistic a property
before asserting.
Prior to this commit, when a client consumed from an unavailable quorum
queue, the following crash occurred:
```
{badmatch,{error,noproc}}
[{rabbit_quorum_queue,consume,3,[{file,\"rabbit_quorum_queue.erl\"},{line,993}]}
```
This commit fixes this bug by returning any error when registering a
quorum queue consumer to rabbit_queue_type.
This commit also refactors errors returned by
rabbit_queue_type:consume/3 to simplify and ensure seperation of
concerns.
For example prior to this commit, the channel did error
formatting specifically for consuming from streams. It's better if
the channel is unaware of what queue type it consumes from and have each
queue type implementation format their own errors.
Bump the timeout for management operations and link attachments from 20s
to 30s. We've seen timeouts in CI.
We bump the poll interval of the `?awaitMatch` macro because CI
sometimes flaked by crashing in
0e803de6dd/deps/rabbitmq_amqp_client/src/rabbitmq_amqp_client.erl (L411)
which indicates that the client lib received a response from a previous
request.