It's a middle ground between building on every commit, and not building
at all. We currently have a workflow to build the OCI on every PR.
However, building on every PR does not cover some use cases. For
example, providing an image to a user to preview some changes coming in
the next minor or patch. Another use case: compare main with a PR in
Kubernetes.
It's better to have a separate workflow, even at the expense of
duplication, because the "on schedule" trigger only runs on the default
branch of the repository. This "limitation" makes it complicated to
extend the current "build OCI on PRs" to also build nightly for main and
release branches.
For test case leader_locator_balanced the actual leaders elected were
nodes 1, 3, 1 because they know about machine version 6 while node 2
only knows about machine version 5.
This commit adds a property test that applies the same Ra commands in
the same order on two different Erlang nodes. The state in which both nodes end
up should be exactly the same.
Ideally, the two nodes should run different OTP versions because this
way we could test for any non-determinism across OTP versions.
However, for now, having a test with both nodes having the same OTP
verison is good enough because running this test with rabbit_fifo
machine version 5 fails while machine version 6 succeeds.
This reveales another interesting: The default "undefined" map order can
even be different using different Erlang nodes with the **same** OTP
version.
Prior to this commit map iteration order was undefined in quorum queues
and could therefore be different on different versions of Erlang/OTP.
Example:
OTP 26.2.5.3
```
Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok
```
OTP 27.3.3
```
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok
```
This can lead to non-determinism on different members. For example, different
members could potentially return messages in a different order.
This commit introduces a new machine version fixing this bug.
[Why]
The retry logic I added in 4621fe7730
was completely wrong. If Khepri reached its own timeout of 30 seconds (as
of this writing), the mirrored supervisor would retry 50 times because
it would not check the time spent. This means it would retry for 25
minutes. Nice.
That retry would be terminated forcefully by the parent supervisor after
5 minutes if it was part of a shutdown.
[How]
This time, the code simply pass the error (timeout or something else)
down to the following `case`. It will shut the mirrored supervisor down.
This fixes very long RabbitMQ node termination (at least 5 minutes,
sometimes more) in testsuites. An example to reproduce:
gmake -C deps/rabbitmq_mqtt \
RABBITMQ_METADATA_STORE=khepri \
ct-v5 t=cluster_size_3:session_takeover_v3_v5
In this one, the third node of the cluster will take 5+ minutes to stop.
From #14014, we learned that the indentation was causing workflows to
not trigger. However, this did not seem to affect when the workflow file
itself was changed. In any case, YAML is sensible to indentation,
therefore this change is 'correct'. Removing single quotes from paths
with '*' at the end, because it is not required according to YAML and
GitHub documentation.
The path triggers now match the Selenium workflow that runs on commits
to main and release branches.
Queues are automatically declared for MQTT consumers, but they can be externally
deleted. The consumer should be disconnected in such case, because it has no way
of knowing this happened - from its perspective there are simply no messages to consume.
In RabbitMQ 3.11 the consumer was disconnected in such situation.
This behaviour changed with native MQTT, which doesn't use AMQP internally.
Test case `tcp_back_pressure_rabbitmq_internal_flow_quorum_queue` succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.
CI also shows a test failure instance of `tcp_back_pressure_rabbitmq_internal_flow_classic_queue`, albeit much rearer.
This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.
File `images/ubuntu/Ubuntu2404-Readme.md` in https://github.com/actions/runner-images/compare/ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file `images/ubuntu/scripts/build/configure-environment.sh`
There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.
Neither `meck` nor `inet:setopts(Socket, [{active, once}])` cause the
test failure because the test also fails with the former
`erlang:suspend_process/1` and `erlang:resume_process/1`.
The test fails due to the following timeout in the writer proc on the
server:
```
** Last message in was {'$gen_cast',
{send_command,<0.760.0>,0,
{'v1_0.transfer',
{uint,3},
{uint,2211},
{binary,<<0,0,8,162>>},
{uint,0},
true,undefined,undefined,undefined,
undefined,undefined,undefined},
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
reader => <0.755.0>,
monitored_sessions => [<0.760.0>],
pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
[{rabbit_amqp_writer,flush,1,
[{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
{rabbit_amqp_writer,handle_cast,2,
[{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
{gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
```
For unknown reasons, even after the CT test case resumes consumption,
the server still times out writing to the socket.
The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.
Events can be received after the boot step but before the application
is started. Creating the ETS in the supervisor solves this, as it is
started just before the event handler is installed.