[Why]
Up-to RabbitMQ 3.13.x, there was a case where if:
1. you enabled a plugin
2. you enabled its feature flags
3. you disabled the plugin
4. you restarted a node (or upgraded it)
... the node could crash on startup because it had a feature flag marked
as enabled that it didn't know about:
error:{badmatch,#{feature_flags => ...
rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514
lists:all_1/2, line 1520
rabbit_ff_controller:are_compatible/2, line 496
rabbit_ff_controller:check_node_compatibility_task1/4, line 437
rabbit_db_cluster:check_compatibility/1, line 376
This was "fixed" by the new way of keeping the registry in memory
(#10988) because it introduces a slight change of behavior. Indeed, the
old way walked through the `FeatureFlags` map and looked up the state in
the `FeatureStates` map to create the `is_enabled/1` function. The new
way just looks up the state in `FeatureStates`.
[How]
The new testcase succeeds on 4.0.x and `main`, but would fail on 3.13.x
with the aforementionne crash.
[Why]
The feature flag controller that is responsible for enabling a feature
flag may be on a node that doesn't know this feature flag. This is
supported by there is a bug when it queries the callback definition for
that feature flag: it uses its own registry which does not have anything
about this feature flag.
This leads to a crash because the `run_callback/5` funtion tries to use
the `undefined` atom returned by the registry as a map:
crasher:
initial call: rabbit_ff_controller:init/1
pid: <0.374.0>
registered_name: rabbit_ff_controller
exception error: bad map: undefined
in function rabbit_ff_controller:run_callback/5
in call from rabbit_ff_controller:do_enable/3 (rabbit_ff_controller.erl, line 1244)
in call from rabbit_ff_controller:update_feature_state_and_enable/2 (rabbit_ff_controller.erl, line 1180)
in call from rabbit_ff_controller:enable_with_registry_locked/2 (rabbit_ff_controller.erl, line 1050)
in call from rabbit_ff_controller:enable_many_locked/2 (rabbit_ff_controller.erl, line 991)
in call from rabbit_ff_controller:enable_many/2 (rabbit_ff_controller.erl, line 979)
in call from rabbit_ff_controller:updating_feature_flag_states/3 (rabbit_ff_controller.erl, line 307)
in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)
[How]
The callback definition is now queried from the first node in the list
given as argument. For the common use case where all nodes know about a
feature flag, the first node is the local one, so there should be no
latency caused by the RPC.
See #12963.
[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.
[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.
However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.
At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.
While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
Fixes#12933
The assumption that `x-last-death-*` annotations must have been set
whenever the `deaths` annotation is set was wrong.
Reproducation steps, Option 1:
1. In v3.13.7, dead letter a message from Q1 to Q2 (both can be classic queues).
2. Re-publish the message including its x-death header from Q2 back to Q1.
(RabbitMQ 3.13.7 will interpret this x-death header and set the deaths annotation.)
3. Upgrade to v4.0.4
4. Dead letter the message from Q1 to Q2 will cause the following crash:
```
crasher:
initial call: rabbit_amqqueue_process:init/1
pid: <0.577.0>
registered_name: []
exception exit: {{badkey,<<"x-last-death-exchange">>},
[{mc,record_death,4,[{file,"mc.erl"},{line,410}]},
{rabbit_dead_letter,publish,5,
[{file,"rabbit_dead_letter.erl"},{line,38}]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',
7,
[{file,"rabbit_amqqueue_process.erl"},{line,1060}]},
{rabbit_variable_queue,'-ackfold/4-fun-0-',3,
[{file,"rabbit_variable_queue.erl"},{line,655}]},
{lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
{rabbit_variable_queue,ackfold,4,
[{file,"rabbit_variable_queue.erl"},{line,652}]},
{rabbit_priority_queue,ackfold,4,
[{file,"rabbit_priority_queue.erl"},{line,309}]},
{rabbit_amqqueue_process,
'-dead_letter_rejected_msgs/3-fun-0-',5,
[{file,"rabbit_amqqueue_process.erl"},
{line,1038}]}]}
```
Reproduction steps, Option 2:
1. Run a 4.0.4 / 3.13.7 mixed version cluster where both queues Q1 and Q2
are hosted on the 4.0.4 node.
2. Send a message to Q1 which dead letters to Q2.
3. Re-publish a message with the x-death AMQP 0.9.1 header from Q2 to
Q1. However, this time make sure to publish to the 3.13.7 node which
forwards this message to Q1 on the 4.0.4 node.
4. Subsequently dead lettering this message from Q1 to Q2 (happening on
the 4.0.4 node) will also cause the crash.
The modified test case in this commit was able to repro this crash via
Option 2 in the mixed version cluster tests on the `v4.0.x` branch.
Prior to this commit, when the sending client overshot RabbitMQ's incoming-window
(which is allowed in the event of a cluster wide memory or disk alarm),
and RabbitMQ sent a FLOW frame to the client, RabbitMQ sent a negative
incoming-window field in the FLOW frame causing the following crash in
the writer proc:
```
crasher:
initial call: rabbit_amqp_writer:init/1
pid: <0.19353.0>
registered_name: []
exception error: bad argument
in function iolist_size/1
called as iolist_size([<<112,0,0,23,120>>,
[82,-15],
<<"pÿÿÿü">>,<<"pÿÿÿÿ">>,67,
<<112,0,0,23,120>>,
"Rª",64,64,64,64])
*** argument 1: not an iodata term
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 141)
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 88)
in call from amqp10_binary_generator:generate/1 (amqp10_binary_generator.erl, line 79)
in call from rabbit_amqp_writer:assemble_frame/3 (rabbit_amqp_writer.erl, line 206)
in call from rabbit_amqp_writer:internal_send_command_async/3 (rabbit_amqp_writer.erl, line 189)
in call from rabbit_amqp_writer:handle_cast/2 (rabbit_amqp_writer.erl, line 110)
in call from gen_server:try_handle_cast/3 (gen_server.erl, line 1121)
```
This commit fixes this crash by maintaning a floor of zero for
incoming-window in the FLOW frame.
Fixes#12816
The credit_flow between publishing AMQP 0.9.1 channel (or MQTT
connection) and (non-mirrored) classic queue processes was
unintentionally removed in 4.0 together with anything else related to
CQ mirroring.
By default we restore the 3.x behaviour for non-mirored classic
queues. It is possible to disable flow-control (the earlier 4.0.x
behaviour) with the new env `classic_queue_flow_control`. In 3.x this
was possible with the config `mirroring_flow_control`.
(cherry picked from commit d65bd7d07a)
Introduce a single place in the AMQP 1.0 Erlang client that infers the AMQP 1.0 type.
Erlang integers are inferred to be AMQP type `long` to avoid overflow surprises.
We don't expect random bytes to be there in the current
version of the message store as we overwrite empty spaces
with zeroes when moving messages around.
We also don't expect messages to be false flagged when
the broker is running because it checks for message
validity in the index. Therefore make sure message bodies
in the tests don't contain byte 255.
## What?
Prior to this commit, the `rabbitmq_event_exchange` internally published
always AMQP 0.9.1 messages to the `amq.rabbitmq.event` topic exchange.
This commit allows users to configure the plugin to publish AMQP 1.0
messages instead.
## Why?
Prior to this commit, when an AMQP 1.0 client consumed events,
event properties that are lists were omitted. For example property
`client_properties` of event `connection.created` or property
`arguments` of event `queue.created` were omitted because of the following sequence:
1. The event exchange plugins listens for all kind of internal events.
2. The event exchange plugin re-publishes all events as AMQP 0.9.1 message to the event exchange.
3. Later, when an AMQP 1.0 client consumes this message, the broker must translate the message from AMQP 0.9.1 to AMQP 1.0.
4. This translation follows the rules outlined in https://www.rabbitmq.com/docs/conversions#amqpl-amqp
5. Specifically, in this table the row before the last one describes the rule we're hitting here. It says that if the AMQP 0.9.1
header value is not an `x-` prefixed header and its value is an array or table, then this header is not converted.
That's because AMQP 1.0 application-properties must be simple types as mandated in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-application-properties
## How?
The user can configure the plugin as follows to have the plugin
internally publish AMQP 1.0 messages:
```
event_exchange.protocol = amqp_1_0
```
To support complex types such as lists, the plugin sets all event
properties as AMQP 1.0 message-annotations. The plugin prefixes all message
annotation keys with `x-opt-` to comply with the AMQP 1.0 spec.
## Alternative Design
An alternative design would have been to format all event properties
e.g. as JSON within the message body. However, this breaks routing on
specific event property values via a headers exchange.
## Documentation
https://github.com/rabbitmq/rabbitmq-website/pull/2129
This test flaked in CI with the following error:
```
=== === Reason: no match of right hand side value {error,half_attached}
in function amqp_utils:detach_link_sync/1 (amqp_utils.erl, line 100)
in call from amqp_filtex_SUITE:properties_section/1 (amqp_filtex_SUITE.erl, line 187)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
Increase waiting for credit being applied as described in commit
aeedad7b51 since this test case still flakes rarely with:
```
=== === Reason: {assertEqual,[{module,amqp_client_SUITE},
{line,3030},
{expression,"amqp10_msg : body ( Msg1 )"},
{expected,[<<"1">>]},
{value,[<<"2">>]}]}
in function amqp_client_SUITE:detach_requeues_two_connections/2 (amqp_client_SUITE.erl, line 3030)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
It is possible for a slow running follower with local consumers
to crash after a snapshot installation as it tries to read an entry
from its log that is no longer there (as it has been consumed and
completed by another node but still refers to prior consumers on the
current node).
This commit makes the log effect callback function more defensive
to check that the number of commands returned by the log effect
isn't different from what was requested. if it is different we
consider this a stale read request and return no further effects.
Conflicts:
deps/rabbit/test/quorum_queue_SUITE.erl
[Why]
Before this patch, required feature flags were basically checked during
boot: they must have been enabled when they were mere stable feature
flags. If they were not, the node refused to boot.
This was easy for the developer because making a feature flag required
allowed to remove the entire compatibility code. Very satisfying.
Unfortunately, this was a pain point to end users, especially those who
did not pay attention to RabbitMQ and the release notes and were just
asking their package manager to update everything. They could end up
with a node that refuse to boot. The only solution was to downgrade,
enable the disabled stable feature flags, upgrade again.
[How]
This patch introduces two levels of requirement to required feature
flags:
* `hard`: this corresponds to the existing behavior where a node will
refuse to boot if a hard required feature flag is not enabled before
the upgrade.
* `soft`: such a required feature flag will be automatically enabled
during the upgrade to a version where it is marked as required.
The level of requirement is set in the feature flag definition:
-rabbit_feature_flag(
{my_feature_flag,
#{stability => required,
require_level => hard
}}).
The default requirement level is `soft`. All existing required feature
flags have now a requirement level of `hard`.
The handling of soft required feature flag is done when the cluster
feature flags states are verified and synchronized. If a required
feature flag is not enabled yet, it is enabled at that time.
This means that as developers, we will have to keep compatibility code
forever for every soft required feature flag, like the feature flag
definition itself.
This test flakes in CI as described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
The test case fails with
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
However, RabbitMQ closes the session as expected due to the missing read
permissions to the queue as shown in the RabbitMQ logs:
```
[debug] <0.1321.0> Asked to create a new user 'access_failure', password length in bytes: 24
[info] <0.1321.0> Created user 'access_failure'
[debug] <0.1324.0> Asked to set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1324.0> Successfully set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1333.0> accepting AMQP connection 127.0.0.1:36248 -> 127.0.0.1:25000
[debug] <0.1333.0> User 'access_failure' authenticated successfully by backend rabbit_auth_backend_internal
[info] <0.1333.0> Connection from AMQP 1.0 container 'AMQPNetLite-101d7d51': user 'access_failure' authenticated using SASL mechanism PLAIN and granted access to vhost '/'
[debug] <0.1333.0> AMQP 1.0 connection.open frame: hostname = 127.0.0.1, extracted vhost = /, idle-time-out = undefined
[debug] <0.1333.0> AMQP 1.0 created session process <0.1338.0> for channel number 0
[warning] <0.1338.0> Closing session for connection <0.1333.0>: {'v1_0.error',
[warning] <0.1338.0> {symbol,
[warning] <0.1338.0> <<"amqp:unauthorized-access">>},
[warning] <0.1338.0> {utf8,
[warning] <0.1338.0> <<"read access to queue 'test' in vhost '/' refused for user 'access_failure'">>},
[warning] <0.1338.0> undefined}
[debug] <0.1333.0> AMQP 1.0 closed session process <0.1338.0> with channel number 0
[warning] <0.1333.0> closing AMQP connection <0.1333.0> (127.0.0.1:36248 -> 127.0.0.1:25000, duration: '269ms'):
[warning] <0.1333.0> client unexpectedly closed TCP connection
```
```
let receiver = ReceiverLink(ac.Session, "test-receiver", src)
```
uses a null constructur for the onAttached callback.
ReceiverLink doesn't seem to block.
Given that the exact same authorization error is already tested in test
case attach_source_queue of amqp_auth_SUITE, it's safe to delete this F#
test.
Prior to this commit tests
* leader_transfer_quorum_queue_credit_single
* leader_transfer_quorum_queue_credit_batches
flaked in CI during 4.1 (main) and 4.0 mixed version testing.
The follwing error occurred on node 0:
```
[error] <0.1950.0> Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0
[warning] <0.1950.0> Closing session for connection <0.1945.0>: {'v1_0.error',
[warning] <0.1950.0> {symbol,<<"amqp:internal-error">>},
[warning] <0.1950.0> {utf8,
[warning] <0.1950.0> <<"Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0">>},
[warning] <0.1950.0> undefined}
```
Therefore we enable this feature flag for both tests.
This commit also simplifies some test setups that were necessary for
4.0/3.13 mixed version testing, but isn't necessary anymore for 4.1/4.0
mixed version testing.
Support x-cc message annotation
Support an `x-cc` message annotation in AMQP 1.0
similar to the [CC](https://www.rabbitmq.com/docs/sender-selected) header in AMQP 0.9.1.
The value of the `x-cc` message annotation must by a list of strings.
A message annotation is used since application properties allow only simple types.
in order to troubleshoot the flake described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received\n
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477\n
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
It is possible for a slow running follower with local consumers
to crash after a snapshot installation as it tries to read an entry
from its log that is no longer there (as it has been consumed and
completed by another node but still refers to prior consumers on the
current node).
This commit makes the log effect callback function more defensive
to check that the number of commands returned by the log effect
isn't different from what was requested. if it is different we
consider this a stale read request and return no further effects.
[Why]
Before this patch, required feature flags were basically checked during
boot: they must have been enabled when they were mere stable feature
flags. If they were not, the node refused to boot.
This was easy for the developer because making a feature flag required
allowed to remove the entire compatibility code. Very satisfying.
Unfortunately, this was a pain point to end users, especially those who
did not pay attention to RabbitMQ and the release notes and were just
asking their package manager to update everything. They could end up
with a node that refuse to boot. The only solution was to downgrade,
enable the disabled stable feature flags, upgrade again.
[How]
This patch introduces two levels of requirement to required feature
flags:
* `hard`: this corresponds to the existing behavior where a node will
refuse to boot if a hard required feature flag is not enabled before
the upgrade.
* `soft`: such a required feature flag will be automatically enabled
during the upgrade to a version where it is marked as required.
The level of requirement is set in the feature flag definition:
-rabbit_feature_flag(
{my_feature_flag,
#{stability => required,
require_level => hard
}}).
The default requirement level is `soft`. All existing required feature
flags have now a requirement level of `hard`.
The handling of soft required feature flag is done when the cluster
feature flags states are verified and synchronized. If a required
feature flag is not enabled yet, it is enabled at that time.
This means that as developers, we will have to keep compatibility code
forever for every soft required feature flag, like the feature flag
definition itself.
This test flakes in CI as described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
The test case fails with
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
However, RabbitMQ closes the session as expected due to the missing read
permissions to the queue as shown in the RabbitMQ logs:
```
[debug] <0.1321.0> Asked to create a new user 'access_failure', password length in bytes: 24
[info] <0.1321.0> Created user 'access_failure'
[debug] <0.1324.0> Asked to set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1324.0> Successfully set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1333.0> accepting AMQP connection 127.0.0.1:36248 -> 127.0.0.1:25000
[debug] <0.1333.0> User 'access_failure' authenticated successfully by backend rabbit_auth_backend_internal
[info] <0.1333.0> Connection from AMQP 1.0 container 'AMQPNetLite-101d7d51': user 'access_failure' authenticated using SASL mechanism PLAIN and granted access to vhost '/'
[debug] <0.1333.0> AMQP 1.0 connection.open frame: hostname = 127.0.0.1, extracted vhost = /, idle-time-out = undefined
[debug] <0.1333.0> AMQP 1.0 created session process <0.1338.0> for channel number 0
[warning] <0.1338.0> Closing session for connection <0.1333.0>: {'v1_0.error',
[warning] <0.1338.0> {symbol,
[warning] <0.1338.0> <<"amqp:unauthorized-access">>},
[warning] <0.1338.0> {utf8,
[warning] <0.1338.0> <<"read access to queue 'test' in vhost '/' refused for user 'access_failure'">>},
[warning] <0.1338.0> undefined}
[debug] <0.1333.0> AMQP 1.0 closed session process <0.1338.0> with channel number 0
[warning] <0.1333.0> closing AMQP connection <0.1333.0> (127.0.0.1:36248 -> 127.0.0.1:25000, duration: '269ms'):
[warning] <0.1333.0> client unexpectedly closed TCP connection
```
```
let receiver = ReceiverLink(ac.Session, "test-receiver", src)
```
uses a null constructur for the onAttached callback.
ReceiverLink doesn't seem to block.
Given that the exact same authorization error is already tested in test
case attach_source_queue of amqp_auth_SUITE, it's safe to delete this F#
test.