For test case leader_locator_balanced the actual leaders elected were
nodes 1, 3, 1 because they know about machine version 6 while node 2
only knows about machine version 5.
This commit adds a property test that applies the same Ra commands in
the same order on two different Erlang nodes. The state in which both nodes end
up should be exactly the same.
Ideally, the two nodes should run different OTP versions because this
way we could test for any non-determinism across OTP versions.
However, for now, having a test with both nodes having the same OTP
verison is good enough because running this test with rabbit_fifo
machine version 5 fails while machine version 6 succeeds.
This reveales another interesting: The default "undefined" map order can
even be different using different Erlang nodes with the **same** OTP
version.
Prior to this commit map iteration order was undefined in quorum queues
and could therefore be different on different versions of Erlang/OTP.
Example:
OTP 26.2.5.3
```
Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok
```
OTP 27.3.3
```
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok
```
This can lead to non-determinism on different members. For example, different
members could potentially return messages in a different order.
This commit introduces a new machine version fixing this bug.
[Why]
The retry logic I added in 4621fe7730
was completely wrong. If Khepri reached its own timeout of 30 seconds (as
of this writing), the mirrored supervisor would retry 50 times because
it would not check the time spent. This means it would retry for 25
minutes. Nice.
That retry would be terminated forcefully by the parent supervisor after
5 minutes if it was part of a shutdown.
[How]
This time, the code simply pass the error (timeout or something else)
down to the following `case`. It will shut the mirrored supervisor down.
This fixes very long RabbitMQ node termination (at least 5 minutes,
sometimes more) in testsuites. An example to reproduce:
gmake -C deps/rabbitmq_mqtt \
RABBITMQ_METADATA_STORE=khepri \
ct-v5 t=cluster_size_3:session_takeover_v3_v5
In this one, the third node of the cluster will take 5+ minutes to stop.
From #14014, we learned that the indentation was causing workflows to
not trigger. However, this did not seem to affect when the workflow file
itself was changed. In any case, YAML is sensible to indentation,
therefore this change is 'correct'. Removing single quotes from paths
with '*' at the end, because it is not required according to YAML and
GitHub documentation.
The path triggers now match the Selenium workflow that runs on commits
to main and release branches.
Queues are automatically declared for MQTT consumers, but they can be externally
deleted. The consumer should be disconnected in such case, because it has no way
of knowing this happened - from its perspective there are simply no messages to consume.
In RabbitMQ 3.11 the consumer was disconnected in such situation.
This behaviour changed with native MQTT, which doesn't use AMQP internally.
Test case `tcp_back_pressure_rabbitmq_internal_flow_quorum_queue` succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.
CI also shows a test failure instance of `tcp_back_pressure_rabbitmq_internal_flow_classic_queue`, albeit much rearer.
This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.
File `images/ubuntu/Ubuntu2404-Readme.md` in https://github.com/actions/runner-images/compare/ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file `images/ubuntu/scripts/build/configure-environment.sh`
There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.
Neither `meck` nor `inet:setopts(Socket, [{active, once}])` cause the
test failure because the test also fails with the former
`erlang:suspend_process/1` and `erlang:resume_process/1`.
The test fails due to the following timeout in the writer proc on the
server:
```
** Last message in was {'$gen_cast',
{send_command,<0.760.0>,0,
{'v1_0.transfer',
{uint,3},
{uint,2211},
{binary,<<0,0,8,162>>},
{uint,0},
true,undefined,undefined,undefined,
undefined,undefined,undefined},
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
reader => <0.755.0>,
monitored_sessions => [<0.760.0>],
pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
[{rabbit_amqp_writer,flush,1,
[{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
{rabbit_amqp_writer,handle_cast,2,
[{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
{gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
```
For unknown reasons, even after the CT test case resumes consumption,
the server still times out writing to the socket.
The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.
Events can be received after the boot step but before the application
is started. Creating the ETS in the supervisor solves this, as it is
started just before the event handler is installed.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
It was expensive to delete files because we had clean up
the index and to get the messages in the file we have to
scan it.
Instead of cleaning up the index on file delete this
commit deletes from the index as soon as possible.
There are two scenarios: messages that are removed
from the current write file, and messages that are
removed from other files. In the latter case, we
can just delete the index entry on remove. For messages
in the current write file, we want to keep the entry
in case fanout is used, because we don't want to write
the fanout message multiple times if we can avoid it.
So we keep track of removes in the current write file
and do a cleanup of these entries on file roll over.
Compared to the previous implementation we will no
longer increase the ref_count of messages that are
not in the current write file, meaning we may do more
writes in fanout scenarios. But at the same time the
file delete operation is much cheaper.
Additionally, we prioritise delete calls in rabbit_msg_store_gc.
Without that change, if the compaction was lagging behind,
we could have file deletion requests queued behind many compaction
requests, leading to many unnecessary compactions of files
that could already be deleted.
Co-authored-by: Michal Kuratczyk <michal.kuratczyk@broadcom.com>