Commit Graph

3094 Commits

Author SHA1 Message Date
Michal Kuratczyk 68de3fdb77
Fix channel crash when publishing to a new stream (#12969)
The following scenario led to a channel crash:
1. Publish to a non-existing stream: `perf-test -y 0 -p -e amq.default -t direct -k stream`
2. Declare the stream: `rabbitmqadmin declare queue name=stream queue_type=stream`

There is no pid yet, so we got a function_clause with `none`
```
{function_clause,
   [{osiris_writer,write,
        [none,<0.877.0>,<<"<0.877.0>_-65ZKFz18ll5lau0phi7CsQ">>,1,
         [[0,"Sp",[192,6,5,"B@@AC"]],
          [0,"Sr",
           [193,38,4,
            [[[163,10,<<"x-exchange">>],[161,0,<<>>]],
             [[163,13,<<"x-routing-key">>],[161,6,<<"stream">>]]]]],
          [0,"Su",[160,12,[<<0,19,252,1,0,0,98,171,20,16,108,167>>]]]]],
        [{file,"src/osiris_writer.erl"},{line,158}]},
    {rabbit_stream_queue,deliver0,4,
        [{file,"rabbit_stream_queue.erl"},{line,540}]},
    {rabbit_stream_queue,'-deliver/3-fun-0-',4,
        [{file,"rabbit_stream_queue.erl"},{line,526}]},
    {lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
    {rabbit_queue_type,'-deliver0/4-fun-5-',5,
        [{file,"rabbit_queue_type.erl"},{line,707}]},
    {maps,fold_1,4,[{file,"maps.erl"},{line,860}]},
    {rabbit_queue_type,deliver0,4,
        [{file,"rabbit_queue_type.erl"},{line,704}]},
    {rabbit_queue_type,deliver,4,
        [{file,"rabbit_queue_type.erl"},{line,662}]}]}
```

Co-authored-by: Karl Nilsson <kjnilsson@gmail.com>
2024-12-20 08:56:25 +01:00
Jean-Sébastien Pédron ea2c8db2d1
rabbit_feature_flags: Add testcase after issue #12963
[Why]
Up-to RabbitMQ 3.13.x, there was a case where if:
1. you enabled a plugin
2. you enabled its feature flags
3. you disabled the plugin
4. you restarted a node (or upgraded it)

... the node could crash on startup because it had a feature flag marked
as enabled that it didn't know about:

    error:{badmatch,#{feature_flags => ...

        rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514
        lists:all_1/2, line 1520
        rabbit_ff_controller:are_compatible/2, line 496
        rabbit_ff_controller:check_node_compatibility_task1/4, line 437
        rabbit_db_cluster:check_compatibility/1, line 376

This was "fixed" by the new way of keeping the registry in memory
(#10988) because it introduces a slight change of behavior. Indeed, the
old way walked through the `FeatureFlags` map and looked up the state in
the `FeatureStates` map to create the `is_enabled/1` function. The new
way just looks up the state in `FeatureStates`.

[How]
The new testcase succeeds on 4.0.x and `main`, but would fail on 3.13.x
with the aforementionne crash.
2024-12-19 16:33:43 +01:00
Jean-Sébastien Pédron 3325def8eb
rabbit_feature_flags: Take callback definition from correct node
[Why]
The feature flag controller that is responsible for enabling a feature
flag may be on a node that doesn't know this feature flag. This is
supported by there is a bug when it queries the callback definition for
that feature flag: it uses its own registry which does not have anything
about this feature flag.

This leads to a crash because the `run_callback/5` funtion tries to use
the `undefined` atom returned by the registry as a map:

    crasher:
      initial call: rabbit_ff_controller:init/1
      pid: <0.374.0>
      registered_name: rabbit_ff_controller
      exception error: bad map: undefined
        in function  rabbit_ff_controller:run_callback/5
        in call from rabbit_ff_controller:do_enable/3 (rabbit_ff_controller.erl, line 1244)
        in call from rabbit_ff_controller:update_feature_state_and_enable/2 (rabbit_ff_controller.erl, line 1180)
        in call from rabbit_ff_controller:enable_with_registry_locked/2 (rabbit_ff_controller.erl, line 1050)
        in call from rabbit_ff_controller:enable_many_locked/2 (rabbit_ff_controller.erl, line 991)
        in call from rabbit_ff_controller:enable_many/2 (rabbit_ff_controller.erl, line 979)
        in call from rabbit_ff_controller:updating_feature_flag_states/3 (rabbit_ff_controller.erl, line 307)
        in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)

[How]
The callback definition is now queried from the first node in the list
given as argument. For the common use case where all nodes know about a
feature flag, the first node is the local one, so there should be no
latency caused by the RPC.

See #12963.
2024-12-19 13:45:27 +01:00
Jean-Sébastien Pédron dbec429fba
rabbit_feature_flags: Fix function name in the controller
[Why]
`state_after_virtual_state()` meant nothing.

`state_after_virtual_reset()` was the name I had in mind.
2024-12-19 11:54:25 +01:00
Jean-Sébastien Pédron debe2a118c
rabbitmq_ct_helpers: Change how Mnesia/Khepri is selected
[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.

[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.

However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.

At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.

While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
2024-12-17 09:56:54 +01:00
Michael Klishin 0db3d7b014
Merge pull request #12950 from rabbitmq/qq-handle-tick
Quorum queues: ignore handle_tick with an old overview format
2024-12-16 11:31:55 -05:00
Michael Klishin 62ce1c954a
Merge pull request #12948 from rabbitmq/fix-flakes
Test fixes for a few more CI flakes
2024-12-16 11:24:10 -05:00
Diana Parra Corbacho a97ec92785 Quorum queues: ignore handle_tick with an old overview format
If handle_tick is called before the machine has finished the upgrade
process, it could receive an old overview format (stats tuple vs map).
Let's ignore it and the next handle tick should be fine.

Unlikely to happen in production, detected on CI with a very low tick timeout
2024-12-16 15:39:39 +01:00
Diana Parra Corbacho fe7a141331 Test: Increase receive timeout in all rabbit test suites 2024-12-16 11:58:05 +01:00
GitHub 0d750769f9 bazel run gazelle 2024-12-14 04:02:32 +00:00
David Ansari b6027ece28 Fix dead lettering crash
Fixes #12933

The assumption that `x-last-death-*` annotations must have been set
whenever the `deaths` annotation is set was wrong.

Reproducation steps, Option 1:
1. In v3.13.7, dead letter a message from Q1 to Q2 (both can be classic queues).
2. Re-publish the message including its x-death header from Q2 back to Q1.
(RabbitMQ 3.13.7 will interpret this x-death header and set the deaths annotation.)
3. Upgrade to v4.0.4
4. Dead letter the message from Q1 to Q2 will cause the following crash:
```
crasher:
  initial call: rabbit_amqqueue_process:init/1
  pid: <0.577.0>
  registered_name: []
  exception exit: {{badkey,<<"x-last-death-exchange">>},
                   [{mc,record_death,4,[{file,"mc.erl"},{line,410}]},
                    {rabbit_dead_letter,publish,5,
                        [{file,"rabbit_dead_letter.erl"},{line,38}]},
                    {rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',
                        7,
                        [{file,"rabbit_amqqueue_process.erl"},{line,1060}]},
                    {rabbit_variable_queue,'-ackfold/4-fun-0-',3,
                        [{file,"rabbit_variable_queue.erl"},{line,655}]},
                    {lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
                    {rabbit_variable_queue,ackfold,4,
                        [{file,"rabbit_variable_queue.erl"},{line,652}]},
                    {rabbit_priority_queue,ackfold,4,
                        [{file,"rabbit_priority_queue.erl"},{line,309}]},
                    {rabbit_amqqueue_process,
                        '-dead_letter_rejected_msgs/3-fun-0-',5,
                        [{file,"rabbit_amqqueue_process.erl"},
                         {line,1038}]}]}
```

Reproduction steps, Option 2:
1. Run a 4.0.4 / 3.13.7 mixed version cluster where both queues Q1 and Q2
   are hosted on the 4.0.4 node.
2. Send a message to Q1 which dead letters to Q2.
3. Re-publish a message with the x-death AMQP 0.9.1 header from Q2 to
   Q1. However, this time make sure to publish to the 3.13.7 node which
   forwards this message to Q1 on the 4.0.4 node.
4. Subsequently dead lettering this message from Q1 to Q2 (happening on
   the 4.0.4 node) will also cause the crash.

The modified test case in this commit was able to repro this crash via
Option 2 in the mixed version cluster tests on the `v4.0.x` branch.
2024-12-13 19:25:43 +01:00
Matteo Cafasso 8d7535e0b1
amqqueue_process: adopt new `is_duplicate` backing queue callback
As the de-duplication plugin is the only adopter of the `is_duplicate`
callback, we now use a simpler signature.

When a message is deemed duplicated, we discard it and re-route it to
dead letter exchange.

Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit f93baa35cb)
2024-12-11 19:43:45 -05:00
Matteo Cafasso 6a6e760107
backing_queue: simplify `is_duplicate` callback signature
`is_duplicate` callback signature was changed in order to support both
the mirroring queues as well as the de-duplication ones.

As the mirroring queues are now deprecated and removed, we can fall
back to a simpler boolean as return value.

Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit c927446e17)
2024-12-11 19:43:38 -05:00
David Ansari 9d8ae14e27 Use correct AMQP filter expression string modifier prefix
Section 4.1.1 of AMQP Filter Expressions Working Draft 09
defines `&` (ampersand) instead of `$` (dollar) as the string modifier prefix.
2024-12-11 16:48:56 +01:00
Michael Klishin b84483ab5c
Merge pull request #12907 from rabbitmq/rabbitmq-server-12906
By @gomoripeti: Restore credit_flow between AMQP 0.9.1 channel/MQTT connection -> CQ processes
2024-12-10 10:03:47 -05:00
David Ansari 0d34ef6047 Set a floor of zero for incoming-window
Prior to this commit, when the sending client overshot RabbitMQ's incoming-window
(which is allowed in the event of a cluster wide memory or disk alarm),
and RabbitMQ sent a FLOW frame to the client, RabbitMQ sent a negative
incoming-window field in the FLOW frame causing the following crash in
the writer proc:
```
crasher:
  initial call: rabbit_amqp_writer:init/1
  pid: <0.19353.0>
  registered_name: []
  exception error: bad argument
    in function  iolist_size/1
       called as iolist_size([<<112,0,0,23,120>>,
                              [82,-15],
                              <<"pÿÿÿü">>,<<"pÿÿÿÿ">>,67,
                              <<112,0,0,23,120>>,
                              "Rª",64,64,64,64])
       *** argument 1: not an iodata term
    in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 141)
    in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 88)
    in call from amqp10_binary_generator:generate/1 (amqp10_binary_generator.erl, line 79)
    in call from rabbit_amqp_writer:assemble_frame/3 (rabbit_amqp_writer.erl, line 206)
    in call from rabbit_amqp_writer:internal_send_command_async/3 (rabbit_amqp_writer.erl, line 189)
    in call from rabbit_amqp_writer:handle_cast/2 (rabbit_amqp_writer.erl, line 110)
    in call from gen_server:try_handle_cast/3 (gen_server.erl, line 1121)
```

This commit fixes this crash by maintaning a floor of zero for
incoming-window in the FLOW frame.

Fixes #12816
2024-12-10 09:39:21 +01:00
Péter Gömöri 2c1f1a1387
Restore credit_flow between channel/MQTT connection -> CQ processes
The credit_flow between publishing AMQP 0.9.1 channel (or MQTT
connection) and (non-mirrored) classic queue processes was
unintentionally removed in 4.0 together with anything else related to
CQ mirroring.

By default we restore the 3.x behaviour for non-mirored classic
queues. It is possible to disable flow-control (the earlier 4.0.x
behaviour) with the new env `classic_queue_flow_control`. In 3.x this
was possible with the config `mirroring_flow_control`.

(cherry picked from commit d65bd7d07a)
2024-12-09 22:33:47 -05:00
Jean-Sébastien Pédron 56f90a51a9
rabbit_db: Return error from `force_boot_command_test/0` with Khepri 2024-12-02 13:33:08 +01:00
Jean-Sébastien Pédron df9882417c
rabbit_khepri: Report no partitions from `cli_cluster_status/0` 2024-11-29 16:53:55 +01:00
Jean-Sébastien Pédron 4621fe7730
mirrored_supervisor: Catch timeout from Khepri in `hanlde_info/2`
[Why]
The code assumed that the transaction would always succeed. It was kind
of the case with Mnesia because it would throw an exception if it
failed.

Khepri returns an error instead. The code has to handle it. In
particular, we see timeouts in CI and before this patch, they caused a
crash because the list comprehension was asked to work on a tuple.

[How]
We now retry a few times for 10 seconds.
2024-11-29 12:03:59 +01:00
Jean-Sébastien Pédron 913bd9fa42
rabbit_db: Fix `rabbit_db_msup:update_all/2` spec
[Why]
It can return an error.
2024-11-29 12:03:35 +01:00
Jean-Sébastien Pédron ae9fbb7bd5
Pin Horus to 0.3.1 temporarily
[Why]
We pin a version of Horus even if we don't use it directly (it is a
dependency of Khepri). But currently, we can't update Khepri while still
needing the fix in Horus 0.3.1.

Horus 0.3.1 works around a crash in `cover` that mostly affects CI for
now.

This pinning will have to go away with the next update of Khepri.
2024-11-29 09:50:08 +01:00
Jean-Sébastien Pédron 99d8e90df3
rabbit_quorum_queue: Wait for member add in `add_member/4`
[Why]
The `ra:member_add/3` call returns before the change is committed. This
is ok for that addition but any follow-up changes to the cluster might
be rejected with the `cluster_change_not_permitted` error.

[How]
Instead of changing other places to wait or retry their cluster
membership change, this patch waits for the current add to be applied
before proceeding and returning.

This fixes some transient failures in CI where such follow-up changes
are rejected and not retried, leaving the cluster in an unexpected state
for the testcase.

An example is with
`quorum_queue_SUITE:force_shrink_member_to_current_member/1`
2024-11-28 11:27:40 +01:00
Michal Kuratczyk 46259b5a48
Fix invalid warning about transient queues being used
This fixes the issue where RabbitMQ would warn about
transient queues being used in a cluster with no transient queues.

Fixes https://github.com/rabbitmq/rabbitmq-server/issues/12802
2024-11-27 22:04:01 +01:00
Michal Kuratczyk 1552f89dd7
Skip are_transient_nonexcl_used check on virin node
This check fails on a virin node, because the metadata store
is not yet ready to handle the query. However, a virin
node by definition can't have any queues, so let's just return
false without asking.
2024-11-27 22:03:53 +01:00
Michael Klishin 1cae417dbf
Merge pull request #12821 from rabbitmq/rabbitmq-server-12776
Definition export: inject default queue type into virtual host metadata
2024-11-27 14:53:25 -05:00
Michael Klishin 8a5ea76fe4
Inject DQT into 'ctl export_definitions' 2024-11-27 12:29:48 -05:00
Diana Parra Corbacho d004d69200 Tests: feature_flags_v2_SUITE ignore peer:stop/1 return value 2024-11-27 15:45:58 +01:00
Michael Klishin 090d11818f
HTTP API tests for injected default queue type 2024-11-26 18:00:37 -05:00
Michael Klishin 51e6004840
Inject DQT into GET /api/definitions and /api/vhosts
References #12776
2024-11-26 02:04:30 -05:00
Jean-Sébastien Pédron f6314d06b3
rabbit_peer_discovery: Retry RPC calls
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.

[How]
We introduce some query retries to reduce the risk of an incomplete
query.

While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.
2024-11-25 16:16:16 +01:00
Jean-Sébastien Pédron 4d4985f254
rabbit_peer_discovery: Fix non-tail-recursive `query_node_props2()`
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.
2024-11-25 16:16:15 +01:00
Jean-Sébastien Pédron 62f22a7655
rabbit_peer_discovery: Remove the use of group leader proxy
[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.

However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.

[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.
2024-11-25 16:16:12 +01:00
D Corbacho 1fa4fe2735
Merge pull request #12775 from rabbitmq/fix-flakes
Fixes for test flakes
2024-11-25 16:12:29 +01:00
Jean-Sébastien Pédron fe2061b13b
quorum_queue_member_reconciliation_SUITE: Improve `reset_nodes/2`
[How]
The function now accepts that the node to reset is already out of the
cluster. This avoids a mismatch exception for a situation that is ok.
2024-11-25 12:55:26 +01:00
Jean-Sébastien Pédron 03f9d36988
rabbit_vhosts: Don't reconcile vhosts if `rabbit` is stopped
[Why]
That timer was started during boot and continued regardless if `rabbit`
was running or stopped.

This caused the reconsiliation to crash if the `rabbit` app was stopped
before the it ended because it tried to access the database even though
it was stopped or even reset.

[How]
We just check if `rabbit` is running before running one reconciliation
and scheduling a new one.
2024-11-25 12:39:13 +01:00
Diana Parra Corbacho 73924ba08e Tests: amqp_client_SUITE delete all queues on end per testcase 2024-11-25 09:06:33 +01:00
Diana Parra Corbacho a35f56fdc2 Tests: amqp_filtex_SUITE wait for link attachment and longer timeouts 2024-11-25 09:06:32 +01:00
GitHub 4e8d0f3ac2 bazel run gazelle 2024-11-23 04:02:38 +00:00
Michael Davis c3c7675bda
rabbit_khepri: Add macros for path patterns 2024-11-22 11:21:11 -05:00
Michael Davis e8fb9b6889
rabbit: Move include/{khepri.hrl => rabbit_khepri.hrl}
This fixes erlang_ls's header resolution. Previously it would confuse
the include_lib of the `khepri.hrl` from Khepri with this header in
the rabbit app.

This header is also specific to how rabbit uses Khepri so I think the
new name fits better.
2024-11-22 11:21:11 -05:00
Michael Klishin ea58fb1b48
crashing_queues_SUITE: squash a compiler warning 2024-11-17 17:23:00 -05:00
Michael Klishin 9f026f7a4b
Merge pull request #12727 from rabbitmq/rabbitmq-server-12709
By @Ayanda-D: Ensure only alive leaders and followers when fetching QQ replica states
2024-11-15 13:53:41 -05:00
Ayanda Dube 53cc8f8f2b
Update unit_quorum_queue_SUITE to use temporary alive & registered
test queue processes (since we now check/return only alive members
when fetching replica states)

(cherry picked from commit ebc0387b81)
2024-11-15 12:49:55 -05:00
David Ansari 6e8b566323 Deduplicate AMQP type inference
Introduce a single place in the AMQP 1.0 Erlang client that infers the AMQP 1.0 type.

Erlang integers are inferred to be AMQP type `long` to avoid overflow surprises.
2024-11-15 17:40:36 +01:00
Jean-Sébastien Pédron 2938338182
rabbit_khepri: Do not hard-code `coordination`, use the constant instead 2024-11-15 16:41:16 +01:00
Jean-Sébastien Pédron 05717ccccf
rabbit_khepri: Remove serial file during reset 2024-11-15 16:40:50 +01:00
Jean-Sébastien Pédron e41d766b29
rabbit_khepri: Ensure RabbitMQ is stopped before resetting with Khepri 2024-11-15 16:40:45 +01:00
Jean-Sébastien Pédron 7e2e7b79f2
rabbit_feature_flags: Support relative setting in `forced_feature_flags_on_init`
[Why]
We already support that from the environment variable, it is easy to add
to the configuration setting.
2024-11-15 14:50:35 +01:00
Michael Klishin 3e509c9f30
Merge pull request #12714 from rabbitmq/amqp-event-exchange
Support publishing AMQP 1.0 to Event Exchange
2024-11-14 18:09:19 -05:00