Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (main, 27, 4.2.0) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (v4.0.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-package-generic-unix (v4.1.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (main, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (v4.0.x, 27) (push) Has been cancelledDetails
Nightly OCI (make) / build-and-push (v4.1.x, 27) (push) Has been cancelledDetails
The correct format is:
```
-export(Functions).
```
ELP detected this malformed syntax.
Interestingly, prior to this commit, the functions were still exported:
```
rabbitmq_amqp_address:module_info(exports).
[{exchange,1},
{exchange,2},
{queue,1},
{module_info,0},
{module_info,1}]
```
The correct format is:
```
-export(Functions).
```
ELP detected this malformed syntax.
Interestingly, prior to this commit, the functions were still exported:
```
rabbitmq_amqp_address:module_info(exports).
[{exchange,1},
{exchange,2},
{queue,1},
{module_info,0},
{module_info,1}]
```
MQTT tests depend on a few plugins, which are just used in 1 or 2
suites each. These have caused issues in CI, triggering a bug in
rabbitmq_federation where the mirrored supervisor submits a transaction
while the cluster is being shut down. The transaction hangs and the
whole rabbitmq_mqtt job times out.
This bug has been addressed, however it is best to start just the required
plugins on each SUITE.
[Why]
Links are started by the plugins but put under the `rabbit` supervision
tree. The federation plugins supervision tree is empty unfortunately...
Links are stopped by a boot step executed by `rabbit`, as a concequence
of unregistering the plugins' parameters.
Unfortunately, links can be terminated if the channel, and implicitly
the connection stops. This happens when the `amqp_client` application
stops.
We end up with a race here:
* Because the federation plugins supervision trees are empty and the
application stop functions barely stop the pg group (which doesn't
terminate the group members), nothing waits for the links to stop.
Therefore, `rabbit` can stop `amqp_client' which is a dependency of
the federation plugins. Therefore, the links underlying channels and
connections are stopped.
* `rabbit` unregister the federation parameters, terminating the links.
The exchange links `terminate/2` function needs the channel to delete
the remote queue. But the channel and the underlying connection might
be gone.
This simply logs a `badmatch` exception:
[error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
[error] <0.884.0> {error,
[error] <0.884.0> {noproc,
[error] <0.884.0> {gen_server,
[error] <0.884.0> call,
[error] <0.884.0> [<0.911.0>,
[error] <0.884.0> {command,
[error] <0.884.0> {open_channel,
[error] <0.884.0> none,
[error] <0.884.0> {amqp_selective_consumer,
[error] <0.884.0> []}}},
[error] <0.884.0> 130000]}}}}
[How]
The solution is to make sure links are stopped as part of the stop of
the plugins.
`rabbit_federation_pg:stop_scope/1` is expanded to stop all members of
all groups in this scope, before terminating the pg scope itself. The
new code waits for the stopped processes to exit.
We have to handle the `EXIT` signal in the link processes and change
their restart strategy in their parent supervisor from permanent to
transient. This ensures they are restarted only if they crash. This also
skips a error log message about each stopped link.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelledDetails
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelledDetails
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelledDetails
Test (make) / Type check (1.17, 27) (push) Has been cancelledDetails
## What?
PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.
This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.
## Why?
This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.
## How?
CI runs currently tests with Erlang 27.
This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.
Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.
By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
For test case leader_locator_balanced the actual leaders elected were
nodes 1, 3, 1 because they know about machine version 6 while node 2
only knows about machine version 5.
This commit adds a property test that applies the same Ra commands in
the same order on two different Erlang nodes. The state in which both nodes end
up should be exactly the same.
Ideally, the two nodes should run different OTP versions because this
way we could test for any non-determinism across OTP versions.
However, for now, having a test with both nodes having the same OTP
verison is good enough because running this test with rabbit_fifo
machine version 5 fails while machine version 6 succeeds.
This reveales another interesting: The default "undefined" map order can
even be different using different Erlang nodes with the **same** OTP
version.
Prior to this commit map iteration order was undefined in quorum queues
and could therefore be different on different versions of Erlang/OTP.
Example:
OTP 26.2.5.3
```
Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok
```
OTP 27.3.3
```
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok
```
This can lead to non-determinism on different members. For example, different
members could potentially return messages in a different order.
This commit introduces a new machine version fixing this bug.
[Why]
The retry logic I added in 4621fe7730
was completely wrong. If Khepri reached its own timeout of 30 seconds (as
of this writing), the mirrored supervisor would retry 50 times because
it would not check the time spent. This means it would retry for 25
minutes. Nice.
That retry would be terminated forcefully by the parent supervisor after
5 minutes if it was part of a shutdown.
[How]
This time, the code simply pass the error (timeout or something else)
down to the following `case`. It will shut the mirrored supervisor down.
This fixes very long RabbitMQ node termination (at least 5 minutes,
sometimes more) in testsuites. An example to reproduce:
gmake -C deps/rabbitmq_mqtt \
RABBITMQ_METADATA_STORE=khepri \
ct-v5 t=cluster_size_3:session_takeover_v3_v5
In this one, the third node of the cluster will take 5+ minutes to stop.
Queues are automatically declared for MQTT consumers, but they can be externally
deleted. The consumer should be disconnected in such case, because it has no way
of knowing this happened - from its perspective there are simply no messages to consume.
In RabbitMQ 3.11 the consumer was disconnected in such situation.
This behaviour changed with native MQTT, which doesn't use AMQP internally.
Test case `tcp_back_pressure_rabbitmq_internal_flow_quorum_queue` succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.
CI also shows a test failure instance of `tcp_back_pressure_rabbitmq_internal_flow_classic_queue`, albeit much rearer.
This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.
File `images/ubuntu/Ubuntu2404-Readme.md` in https://github.com/actions/runner-images/compare/ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file `images/ubuntu/scripts/build/configure-environment.sh`
There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.
Neither `meck` nor `inet:setopts(Socket, [{active, once}])` cause the
test failure because the test also fails with the former
`erlang:suspend_process/1` and `erlang:resume_process/1`.
The test fails due to the following timeout in the writer proc on the
server:
```
** Last message in was {'$gen_cast',
{send_command,<0.760.0>,0,
{'v1_0.transfer',
{uint,3},
{uint,2211},
{binary,<<0,0,8,162>>},
{uint,0},
true,undefined,undefined,undefined,
undefined,undefined,undefined},
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
reader => <0.755.0>,
monitored_sessions => [<0.760.0>],
pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
[{rabbit_amqp_writer,flush,1,
[{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
{rabbit_amqp_writer,handle_cast,2,
[{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
{gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
```
For unknown reasons, even after the CT test case resumes consumption,
the server still times out writing to the socket.
The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.
Events can be received after the boot step but before the application
is started. Creating the ETS in the supervisor solves this, as it is
started just before the event handler is installed.