Field #amqqueue.type_state can be undefined for queues declared on old
RabbitMQ versions before 3.8.0. Ensure `amqqueue:get_type_state/1`
always returns a map according to its type spec to make life of
calling code easier.
(cherry picked from commit cc95d37718)
This commit fixes several crashes:
1. Serialising IEEE 754-2008 decimals as well as
NaN and +-Inf for float and doubles crashed
2. Converting IEEE 754-2008 decimals as well as NaN and +-Inf
for float and dobules from amqp to amqpl crashed
The 2nd crash looks as follows:
```
exception exit: {function_clause,
[{mc_amqpl,to_091,
[<<"decimal-32">>,{as_is,116,<<124,0,0,0>>}],
[{file,"mc_amqpl.erl"},{line,747}]},
{mc_amqpl,'-convert_from/3-lc$^2/1-2-',1,
[{file,"mc_amqpl.erl"},{line,155}]},
{mc_amqpl,convert_from,3,
[{file,"mc_amqpl.erl"},{line,155}]},
{mc,convert,3,[{file,"mc.erl"},{line,358}]},
{rabbit_channel,outgoing_content,2,
[{file,"rabbit_channel.erl"},{line,2649}]},
{rabbit_channel,handle_basic_get,7,
[{file,"rabbit_channel.erl"},{line,2636}]},
{rabbit_channel,handle_cast,2,
[{file,"rabbit_channel.erl"},{line,617}]},
{gen_server2,handle_msg,2,
[{file,"gen_server2.erl"},{line,1056}]}]}
```
The 2nd crash is fixed by omitting any `{as_is, _TypeCode, _Binary}`
values during AMQP 1.0 -> AMQP 0.9.1 conversion.
This will be documented in the conversion table.
In addition to fixing these crashes, this commit adds tests that
RabbitMQ is able to store and forward IEEE 754-2008 decimals.
IEEE 754-2008 decimals can be parsed and serialsed by RabbitMQ.
However, RabbitMQ doesn't support interpreting this values. For example,
they can't be used on the headers exchange or for AMQP filter
expressions.
(cherry picked from commit 5c318c8e38)
A higher priority SAC consumer was never activated when a
quiescing consumer returned or requeued it's last message.
NB: this required a new machine version: 7
(cherry picked from commit dd6fd0c8e2)
If a stream or quorum queue has opened a file to read a consumer message
and the queue is deleted the file handle reference is lost and kept
open until the end of the channel lifetime.
(cherry picked from commit c688169f08)
On Windows the file may be in "DELETE PENDING" state following
its deletion (when the last message was acked). A subsequent
message leads us to writing to that file again but we can't
and get an {error,eacces}. In that case we wait 10ms and retry
up to 3 times.
(cherry picked from commit ff8ecf1cf7)
This commit handles edge cases in the stream SAC coordinator to make
sure it does not crash during execution. Most of these edge cases
consist in an inconsistent state, so there are very unlikely to happen.
This commit also makes sure there is no duplicate in the consumer list
of a group. Consumers are also now identified only by their connection
PID and their subscription ID, as now the timestamp they contain in
their state does not allow a field-by-field comparison.
(cherry picked from commit b4f7d46842)
New CLI command to trigger a rebalancing in a SAC group and activate a
consumer. This is a last resort solution if all consumers in a group
accidently end up in {connected, waiting} state.
The command re-uses an existing function, which only picks the consumer
that should be active. This means it does not try to "fix" the state
(e.g. removing a disconnected consumer because its node is definitely
gone from the cluster).
Fixes#14055
(cherry picked from commit 41acc117bd)
Calls to the stream SAC coordinator can fail for various reason
(e.g. a timeout because of a network partition). The stream reader does not
take into account what the SAC coordinator returns and moves on even
in case of errors. This can lead to inconsistent state for SAC groups.
This commit changes this behavior by handling unexpected errors from the
SAC coordinator and closing the connection. The client is expected to
reconnect. This is safer than risking inconsistent state.
Fixes#14040
(cherry picked from commit 58f4e83c22)
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.
This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.
This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.
Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.
The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.
So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.
2 new configuration entries are introduced:
* rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
disconnected-to-forgotten timer.
* rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
in the coordinator. It used to be a fixed value of 30 seconds. The
default value is still the same. The setting has been introduced to
make integration tests faster.
Fixes#14070
(cherry picked from commit d1aab61566)
This is simmilar to https://github.com/rabbitmq/rabbitmq-server/pull/14056.
The performance benefit is probably negligbile though since this is
called only after each batch of Ra commands.
Nevertheless, it's unnecessary to allocate a list with 3 elements and
therefore 6 words on the heap, so let's optimise it.
(cherry picked from commit 2ca47665be)
Trigger a 4.1.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
## What?
PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.
This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.
## Why?
This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.
## How?
CI runs currently tests with Erlang 27.
This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.
Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.
By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
(cherry picked from commit eccf9fee1e)
For test case leader_locator_balanced the actual leaders elected were
nodes 1, 3, 1 because they know about machine version 6 while node 2
only knows about machine version 5.
(cherry picked from commit 21b6088f00)
This commit adds a property test that applies the same Ra commands in
the same order on two different Erlang nodes. The state in which both nodes end
up should be exactly the same.
Ideally, the two nodes should run different OTP versions because this
way we could test for any non-determinism across OTP versions.
However, for now, having a test with both nodes having the same OTP
verison is good enough because running this test with rabbit_fifo
machine version 5 fails while machine version 6 succeeds.
This reveales another interesting: The default "undefined" map order can
even be different using different Erlang nodes with the **same** OTP
version.
(cherry picked from commit 2f78318ee3)
Prior to this commit map iteration order was undefined in quorum queues
and could therefore be different on different versions of Erlang/OTP.
Example:
OTP 26.2.5.3
```
Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok
```
OTP 27.3.3
```
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok
```
This can lead to non-determinism on different members. For example, different
members could potentially return messages in a different order.
This commit introduces a new machine version fixing this bug.
(cherry picked from commit 2db48432d9)
[Why]
The retry logic I added in 4621fe7730
was completely wrong. If Khepri reached its own timeout of 30 seconds (as
of this writing), the mirrored supervisor would retry 50 times because
it would not check the time spent. This means it would retry for 25
minutes. Nice.
That retry would be terminated forcefully by the parent supervisor after
5 minutes if it was part of a shutdown.
[How]
This time, the code simply pass the error (timeout or something else)
down to the following `case`. It will shut the mirrored supervisor down.
This fixes very long RabbitMQ node termination (at least 5 minutes,
sometimes more) in testsuites. An example to reproduce:
gmake -C deps/rabbitmq_mqtt \
RABBITMQ_METADATA_STORE=khepri \
ct-v5 t=cluster_size_3:session_takeover_v3_v5
In this one, the third node of the cluster will take 5+ minutes to stop.
(cherry picked from commit 376dd2ca60)
Test case `tcp_back_pressure_rabbitmq_internal_flow_quorum_queue` succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.
CI also shows a test failure instance of `tcp_back_pressure_rabbitmq_internal_flow_classic_queue`, albeit much rearer.
This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.
File `images/ubuntu/Ubuntu2404-Readme.md` in https://github.com/actions/runner-images/compare/ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file `images/ubuntu/scripts/build/configure-environment.sh`
There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.
Neither `meck` nor `inet:setopts(Socket, [{active, once}])` cause the
test failure because the test also fails with the former
`erlang:suspend_process/1` and `erlang:resume_process/1`.
The test fails due to the following timeout in the writer proc on the
server:
```
** Last message in was {'$gen_cast',
{send_command,<0.760.0>,0,
{'v1_0.transfer',
{uint,3},
{uint,2211},
{binary,<<0,0,8,162>>},
{uint,0},
true,undefined,undefined,undefined,
undefined,undefined,undefined},
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
reader => <0.755.0>,
monitored_sessions => [<0.760.0>],
pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
[{rabbit_amqp_writer,flush,1,
[{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
{rabbit_amqp_writer,handle_cast,2,
[{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
{gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
```
For unknown reasons, even after the CT test case resumes consumption,
the server still times out writing to the socket.
The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.
(cherry picked from commit 0c391a52d3)
We do it at the latest possible moment to not break
encrypted value support in 'rabbitmq.conf' files.
See #13998 for context.
Closes#13958.
(cherry picked from commit 30c32ee502)