This ensures that quorum_queues shuts down _before_
coordination where khepri run inside.
Quorum queues depend on khepri so need to be shut down first.
[Why]
The feature flag enable function is called during the initial migration
or when a node is later added to a cluster.
In this latter situation, the cluster is already formed and the Mnesia
tables were already migrated. Syncing the cluster in this specific
situation might kick another node that is currently unreachable.
[How]
If the node running the enable function is already clustered, we skip
the cluster sync.
[Why]
All callers of `khepri_adv` and `khepri_tx_adv` need updates to handle
the now uniform return type of `khepri:node_props_map()` in Khepri
0.17.0.
[How]
We don't need any compatibility code to handle "either the old return
type or the new return type" from the khepri_adv API because the
translation is done entirely in the "client side" code in Khepri -
meaning that the return value from the Ra server is the same but it is
translated differently by the functions in `khepri_adv`.
However, we need to adapt transaction functions because they may be
executed on different versions of Khepri and the behaviour of
`khepri_tx_adv` can be different. To take the possible change of return
value format, we use the new `khepri_tx:does_api_comply_with/1` to know
what to expect.
[Why]
In Khepri 0.17.0, `khepri_cluster:locally_known_members/1` and
`khepri_cluster:locally_known_node/1` were replaced with
`khepri_cluster:members/2` and `khepri_cluster:nodes/2` with `favor` set
to `low_latency` - this matches the interface for queries in Khepri.
Move leader repair earlier in tick function to ensure more
timely update of meta data store record after leader change.
Also use RPC_TIMEOUT macro for metric/stats multicalls to improve
liveness when a node is connected but partitioned / frozen.
## What?
This commit determines the queue topology without checking the queue type.
## Why?
This way, checking leader and replicas works the same across all queue
types without the need to introduce other rabbit_queue_type behaviour as
suggested in other PRs.
## How?
pid is the leader, nodes in queue_type_states are the members/replicas.
This commit results in an unknown stream leader during queue
declaration. However the correct leader will be returned eventually when
calling GET on the stream.
The same connection can contain several consumers belonging to a SAC
group (group key = vhost + stream + consumer name). The whole new group
must be re-evaluated to select a new active consumer after the consumers
of the down connection are removed from it.
The previous behavior would not re-evaluate the new group and could
select a consumer from the down connection, letting the group with only
inactive consumers, as the selected active consumer would never receive
the activation message from the stream SAC coordinator.
This commit fixes this problem by removing the consumers of the down
down connection from the affected groups and then performing the
appropriate operations for the groups to keep on consuming (e.g.
notifying an active consumer that it needs to step down).
References #13372
Ra improvements:
* Don't allow a non-voter to start elections
* Register with ra directory before initialising ra server.
* Trigger tick_timeout immediately after entering leader state.
* Set a configurable segment max size
This commit also includes a change to turn the quorum queue
become leader callback to become a noop and instead rely on
the more promptly tick_handler to handle the meta data store
update after a leader election.
This more prompt tick update means there should be a much shorter
gap between the queue metrics being deleted from the old leader
node to them being available again on the new node resulting
in smoother message count metrics.
Fix test that relied on waiting on too simplistic a property
before asserting.
Prior to this commit, when a client consumed from an unavailable quorum
queue, the following crash occurred:
```
{badmatch,{error,noproc}}
[{rabbit_quorum_queue,consume,3,[{file,\"rabbit_quorum_queue.erl\"},{line,993}]}
```
This commit fixes this bug by returning any error when registering a
quorum queue consumer to rabbit_queue_type.
This commit also refactors errors returned by
rabbit_queue_type:consume/3 to simplify and ensure seperation of
concerns.
For example prior to this commit, the channel did error
formatting specifically for consuming from streams. It's better if
the channel is unaware of what queue type it consumes from and have each
queue type implementation format their own errors.
Bump the timeout for management operations and link attachments from 20s
to 30s. We've seen timeouts in CI.
We bump the poll interval of the `?awaitMatch` macro because CI
sometimes flaked by crashing in
0e803de6dd/deps/rabbitmq_amqp_client/src/rabbitmq_amqp_client.erl (L411)
which indicates that the client lib received a response from a previous
request.
To take more frequent checkpoints for large message workload
Lower the min_checkpoint_interval substantially to allow quorum queues
better control over when checkpoints are taken.
Track bytes enqueued in the aux state and suggest a checkpoint after
every 64MB enqueued (this value is scaled according to backlog just
like the indexes condition).
This should help with more timely checkpointing when very large
messages is used.
Try evaluating byte size independently of time window
also increase max size
This commit fixes a bug in the Erlang AMQP 1.0 client.
Prior to this commit, to repro this bug:
1. Send more than 2^16 messages to a queue.
2. Grant more than a total of 2^16 link credit initially (on a single link
or across multiple links) on a single session without any
auto or manual link credit renewal.
The expectation is that thanks to sufficiently granted initial link-credit,
the client will receive all messages.
However, consumption stops after exactly 2^16-1 messages.
That's because the client lib was never sending a flow frame to the server.
So, after the client received all 2^16-1 messages (the initial
incoming-window set by the client), the server's remote-incoming-window
reached 0 causing the server to stop delivering messages.
The expectation is that the client lib automatically handles session
flow control without any manual involvement of the client app.
This commit implements this fix:
* We keep the server's remote-incoming window always large by default as
explained in https://www.rabbitmq.com/blog/2024/09/02/amqp-flow-control#incoming-window
* Hence, the client lib sets its incoming-window to 100,000 initially.
* The client lib tracks its incoming-window decrementing it by 1 for
every transfer it received. (This wasn't done prior to this commit.)
* Whenever this window shrinks below 50,000, the client sends a flow
frame without any link information widening its incoming-window back to 100,000.
* For test cases (maybe later for apps as well), there is a new function
`amqp10_client_session:flow/3`, which allows for a test case to do manual
session flow control. Its API is designed very similar to
`amqp10_client_session:flow_link/4` in that the test can optionally request
the lib to auto widen the session window whenever it falls below a certain threshold.
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.
Elixir is only enabled from within rabbitmq_cli currently.
Eunit is disabled since there are only Elixir tests.
Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.
This commit also includes a few changes that are
related:
* The Erlang distribution will now be started for parallel-ct
* Many unnecessary PROJECT_MOD lines have been removed
* `eunit_formatters` has been removed, it provides little value
* The new `maybe_flock` Erlang.mk function is used where possible
* Build test deps when testing rabbitmq_cli (Mix won't do it anymore)
* rabbitmq_ct_helpers now use the early plugins to have Dialyzer
properly set up
It also happens from time to time that HTTP clients use the wrong port
5672. Like for TLS clients connecting to 5672, RabbitMQ now prints a
more descriptive log message.
For example
```
curl http://localhost:5672
```
will log
```
[info] <0.946.0> accepting AMQP connection [::1]:57736 -> [::1]:5672
[error] <0.946.0> closing AMQP connection <0.946.0> ([::1]:57736 -> [::1]:5672, duration: '1ms'):
[error] <0.946.0> {detected_unexpected_http_header,<<"GET / HT">>}
```
We only check here for GET and not for all other HTTP methods, since
that's the most common case.
## What?
If a TLS client app is misconfigured trying to connect to AMQP port 5672
instead to the AMQPS port 5671, this commit makes RabbitMQ log a more
descriptive error message.
```
openssl s_client -connect localhost:5672 -tls1_3
openssl s_client -connect localhost:5672 -tls1_2
```
RabbitMQ logs prior to this commit:
```
[info] <0.1073.0> accepting AMQP connection [::1]:53535 -> [::1]:5672
[error] <0.1073.0> closing AMQP connection <0.1073.0> ([::1]:53535 -> [::1]:5672, duration: '0ms'):
[error] <0.1073.0> {bad_header,<<22,3,1,0,192,1,0,0>>}
[info] <0.1080.0> accepting AMQP connection [::1]:53577 -> [::1]:5672
[error] <0.1080.0> closing AMQP connection <0.1080.0> ([::1]:53577 -> [::1]:5672, duration: '1ms'):
[error] <0.1080.0> {bad_header,<<22,3,1,0,224,1,0,0>>}
```
RabbitMQ logs after this commit:
```
[info] <0.969.0> accepting AMQP connection [::1]:53632 -> [::1]:5672
[error] <0.969.0> closing AMQP connection <0.969.0> ([::1]:53632 -> [::1]:5672, duration: '0ms'):
[error] <0.969.0> {detected_unexpected_tls_header,<<22,3,1,0,192,1,0,0>>
[info] <0.975.0> accepting AMQP connection [::1]:53638 -> [::1]:5672
[error] <0.975.0> closing AMQP connection <0.975.0> ([::1]:53638 -> [::1]:5672, duration: '1ms'):
[error] <0.975.0> {detected_unexpected_tls_header,<<22,3,1,0,224,1,0,0>>}
```
## Why?
I've seen numerous occurrences in the past few years where misconfigured TLS apps
connected to the wrong port. Therefore, RabbitMQ trying to detect a TLS client
and providing a more descriptive log message seems appropriate to me.
## How?
The first few bytes of any TLS connection are:
Record Type (1 byte):
Always 0x16 (22 in decimal) for a Handshake message.
Version (2 bytes):
This represents the highest version of TLS that the client supports. Common values:
0x0301 → TLS 1.0 (or SSL 3.1)
0x0302 → TLS 1.1
0x0303 → TLS 1.2
0x0304 → TLS 1.3
Record Length (2 bytes):
Specifies the length of the following handshake message.
Handshake Type (1 byte, usually the 6th byte overall):
Always 0x01 for ClientHello.