`rabbitmq_management` is missing one suite definition and `rabbit_mqtt`
is missing two. `assert_suites` causes a build failure because of the
missing suites. This change comments out `assert_suites` for these apps
instead of adding the missing suite definitions because Bazel is no
longer used to test these apps.
## What?
Implement the AMQP over WebSocket Binding Committee Specification 01 in
the AMQP 1.0 Erlang client:
https://docs.oasis-open.org/amqp-bindmap/amqp-wsb/v1.0/cs01/amqp-wsb-v1.0-cs01.html
## Why?
1. This allows writing integration tests for the server implementation
of AMQP over WebSocket.
2. Erlang and Elixir clients can use AMQP over WebSocket in environments
where firewalls prohibit access to the AMQP port.
## How?
Use gun as WebSocket client.
The new module `amqp10_client_socket` handles socket operations (open, close, send) for:
* TCP sockets
* SSL sockets
* WebSockets
Prior to this commit, the amqp10_client_connection process closed only the
write end of the socket after it sent the AMQP close performative.
This commit removed premature socket closure because:
1. There is no equivalent feature provided in Gun since sending a
WebSocket close frame causes Gun to cleanly close the connection for
both writing and reading.
2. It's unnecessary and can result in unexpected and confusing behaviour on the server.
3. It's better practive to keep the TCP connection fully open until
the AMQP closing handshake completes.
4. When amqp10_client_frame_reader terminates, it will cleanly close
the socket for both writing and reading.
from rabbit_fifo version 0.
The same was also implemented for the stream coordinator.
QQ: avoid dead lock in queue federation.
When processing the queue federation startup even the process
may call back into the ra process causing a deadlock. in this
case we spawn a temporary process to avoid this.
This offloads the work of reading messages from on-disk segments
to the interacting process rather than doing this blocking, performance
affecting work in the ra server process.
QQ: ensure opened segments are closed after some time of inactivity
Processes that havea received messages that had to be read from disks
may keep a segment open indefinitely. This introduces a timer which
after some time of inactivity will close all opened segments to ensure
file descriptors are not kept open indefinitely.
[Why]
When running mixed-version tests, nodes 1/3/5/... are using the primary
umbrella, so usually the newest version. Nodes 2/4/6/... are using the
secondary umbrella, thus the old version.
When clustering, we used to use node 1 (running a new version) as the
seed node, meaning other nodes would join it.
This complicates things with feature flags because we have to make sure
that we start node 1 with new stable feature flags disabled to allow old
nodes to join.
This is also a problem with Khepri machine versions because the cluster
would start with the latest version, which old nodes might not have.
[How]
This patch changes the logic to use a node running the secondary
umbrella as the seed node instead. If there is no node running it, we
pick the first node as before.
V2: Revert part of "rabbitmq_ct_helpers: Fix how we set
`$RABBITMQ_FEATURE_FLAGS` in tests" (commit
57ed962ef6). These changes are no
longer needed with the new logic.
V3: The check that verifies that the correct metadata store is used has
a special case for nodes that use the secondary umbrella: if Khepri
is supposed to be used but it's not, the feature flag is enabled.
The reason is that the `v4.0.x` branch doesn't know about the `rel`
configuration of `forced_feature_flags_on_init`. The nodes will
have ignored thies parameter and booted with the stable feature
flags only.
Many testsuites are adapted to the new clustering order. If they
manage which node joins which node, either the order is changed in
the testcases, or nodes are started with only required feature
flags. For testsuites that rely on peer discovery where the order is
unknown, nodes are started with only required feature flags.
[How]
1. Use feature flags correctly: the code shouldn't test if a feature
flag is enabled, assuming something else enabled it. It should enable
it and react to an error.
2. Use `close_connection_sync/1` instead of the asynchronous
`amqp10_client:close_connection/1` to make sure they are really
closed. The wait in `end_per_testcase/2` was not enough apparently.
3. For the two testcases that flake the most for me, enclose the code in
a try/after and make sure to close the connection at the end,
regardless of the result. This should be done for all testcases
because the testgroup use a single set of RabbitMQ nodes for all
testcases, therefore testcases are supposed to clean up after them...
This commit is no change in functionality and mostly deletes dead code.
1. Code targeting Erlang 22 and below is deleted since the mininmum
required Erlang version is higher nowadays.
"In OTP 23 distribution flag DFLAG_BIG_CREATION became mandatory. All
pids are now encoded using NEW_PID_EXT, even external pids received
as PID_EXT from older nodes."
https://www.erlang.org/doc/apps/erts/erl_ext_dist.html#new_pid_ext
2. All v1 encoding and decoding of the Pid is deleted since the lower
version RabbitMQ node supports the v2 encoding nowadays.
Exits the with reason "killed" only occurs "naturally" in OTP
when a supervisor tries to shut a child down and it times out.
It is used for failure simulation in tests quite frequently however.
When a leader changes all enqueuer and consumer processes are notified
from the `state_enter(leader,` callback. However a new leader may not
yet have applied all commands that the old leader had. If any of those
commands is a checkout or a register_enqueuer command these processes
will not be notified of the new leader and thus may never resend their
pending commands.
The new leader will however send an applied notification when it does
apply these entries and these are always sent from the leader process
so can also be used to trigger pending resends. This commit implements
that.
## What?
This commit fixes#13040.
Prior to this commit, exchange federation crashed if the MQTT topic exchange
(`amq.topic` by default) got federated and MQTT 5.0 clients subscribed on the
downstream. That's because the federation plugin sends bindings from downstream
to upstream via AMQP 0.9.1. However, binding arguments containing Erlang record
`mqtt_subscription_opts` (henceforth binding args v1) cannot be encoded in AMQP 0.9.1.
## Why?
Federating the MQTT topic exchange could be useful for warm standby use cases.
## How?
This commit makes binding arguments a valid AMQP 0.9.1 table (henceforth
binding args v2).
Binding args v2 can only be used if all nodes support it. Hence binding
args v2 comes with feature flag `rabbitmq_4.1.0`. Note that the AMQP
over WebSocket
[PR](https://github.com/rabbitmq/rabbitmq-server/pull/13071) already
introduces this same feature flag. Although the feature flag subsystem
supports plugins to define their own feature flags, and the MQTT plugin
defined its own feature flags in the past, reusing feature flag
`rabbitmq_4.1.0` is simpler.
This commit also avoids database migrations for both Mnesia and Khepri
if feature flag `rabbitmq_4.1.0` gets enabled. Instead, it's simpler to
migrate binding args v1 to binding args v2 at MQTT connection establishment
time if the feature flag is enabled. (If the feature flag is disabled at
connection etablishment time, but gets enabled during the connection
lifetime, the connection keeps using bindings args v1.)
This commit adds two new suites:
1. `federation_SUITE` which tests that federating the MQTT topic
exchange works, and
2. `feature_flag_SUITE` which tests the binding args migration from v1 to v2.
Visualise busy links from publisher to RabbitMQ. If the link credit
reaches 0, we set a yellow background colour in the cell.
Note that these credit values can change many times per second while the
management UI refreshes only every few seconds. However, it may still
give a user an idea of what links are currently busy.
We use yellow since that's consistent with the `flow` state in AMQP
0.9.1, which is also set to yellow.
We do not want want to highlight **outgoing** links with credit 0 as
that might be a paused consumer, and therefore not a busy link.
We also use yellow background color if incoming-window is 0 (in case of
a cluster wider memory or disk alarm) or if remote-incoming-window is 0
as consumers should try to keep their incoming-window open and instead
use link credit if they want to pause consumption.
Additionaly we set a grey background colour for the `/management`
address just to highlight them slightly since these are "special" link
pairs.
msg_store_io_batch_size is no longer used
msg_store_credit_disc_bound appears to be used in the code, but I don't
see any impact of that value on the performance. It should be properly
investigated and either removed completely or fixed, because there's
hardly any point in warning about the values configured
(plus, this settings is hopefully almost never used anyway)
According to the `rabbit_backing_queue` behavious it must always
return `ok`, but it used to return a list of results one for each
priority. That caused the below crash further up the call chain.
```
> rabbit_classic_queue:delete_crashed(Q)
** exception error: no case clause matching [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
in function rabbit_classic_queue:delete_crashed/2 (rabbit_classic_queue.erl, line 516)
```
Other backing_queue implementations (`rabbit_variable_queue`) just
exit with a badmatch upon error.
This (very minor) issue is present since 3.13.0 when
`rabbit_classic_queue:delete_crashed_in_backing_queue/1` was
instroduced with Khepri in commit 5f0981c5. Before that the result of
`BQ:delete_crashed/1` was simply ignored.
Include monitored session pids in format_status/1 of rabbit_amqp_writer.
They could be useful when debugging.
The maximum number of sessions per connection is limited, hence the
output won't be too large.
[Why]
In order to make `khepri_db` the default in the future, the handling of
`$RABBITMQ_FEATURE_FLAGS` had to be adapted to be able to *disable*
Khepri instead.
Unfortunately I broke the behavior with stable feature flags that are
only available in the primary umbrella. In this case, they were
automatically enabled and thus, clustering with an old umbrella that did
not have these feature flags failed with `incompatible_feature_flags`.
[How]
The solution is to always use an absolute list of feature flags, not the
new relative list.
V2: Allow a testsuite to skip the configuration of the metadata store.
This is needed for the feature_flags_SUITE testsuite because it
tests the default behavior and the configuration of the metadata
store changes that behavior.
While here, fix a ct log message where variables were swapped
compared to the format strieg expectation.
V3: Enable `rabbitmq_4.0.0` feature flag in rabbit_mgmt_http_SUITE. This
testsuite apparently requires it and if it's not enabled, it fails.
The connection cannot return some information while initializing, so we
just return no information.
The CLI info call was supported only in the open gen_statem callback, so
such a call during the connection init would make it crash. This can
happen when several stream connections get closed and the user calls
list_stream_consumers or list_stream_connections while the connection
are recovering.
This commit adds a clause for CLI info calls in the all the gen_statem
callbacks and returns actual information only when appropriate.
Without this change, consumers using protocols other than the stream
protocol would display as inactive in the Management UI/API and CLI
commands, even though they were receiving messages.
This follows the decision that was made for
'rabbitm-diagnostics node_health_check' which
is a no-op as of 4.0.0 following a few years of
deprecation.
The justification is very similar:
1. There is no such thing as "One True Health Check".
A single health check is too coarse-grained to
explain what specifically is not right about
cluster state
2. Indivual fine-grained health checks have been
available for a few years now, see
https://www.rabbitmq.com/docs/monitoring#health-checks
3. This particular check tests something that
effectively never fails, based on my 14+
years of RabbitMQ contributions and user support
of all shapes and forms
4. This check uses a deprecated feature: non-exclusive
non-durable/transient classic queues
If something about this health check is worth
preserving, we can always add a new one
under GET /api/health/checks/*
Closes#13047.
Accidental "fat finger" virtual deletion accidents
would be easier to avoid if there was a protection mechanism
that would apply equally even to CLI tools and external
applications that do not use confirmations for deletion
operations.
This introduce the following changes:
* Virtual host metadata now supports a new queue,
'protected_from_deletion', which, when set,
will be considered by key virtual host deletion function(s)
* DELETE /api/vhosts/{name} was adapted to handle
such blocked deletion attempts to respond with
a 412 Precondition Failed status
* 'rabbitmqctl list_vhosts' and 'rabbitmqctl delete_vhost'
were adapted accordingly
* DELETE /api/vhosts/{name}/deletion/protection
is a new endpoint that can be used to remove
the protective seal (the metadata key)
* POST /api/vhosts/{name}/deletion/protection
marks the virtual host as protected
In the case of the HTTP API, all operations on
virtual host metadata require administrative
privileges from the target user.
Other considerations:
* When a virtual host does not exist, the behavior
remains the same: the original, protection-unaware
code path is used to preserve backwards compatibility
References #12772.
The following scenario led to a channel crash:
1. Publish to a non-existing stream: `perf-test -y 0 -p -e amq.default -t direct -k stream`
2. Declare the stream: `rabbitmqadmin declare queue name=stream queue_type=stream`
There is no pid yet, so we got a function_clause with `none`
```
{function_clause,
[{osiris_writer,write,
[none,<0.877.0>,<<"<0.877.0>_-65ZKFz18ll5lau0phi7CsQ">>,1,
[[0,"Sp",[192,6,5,"B@@AC"]],
[0,"Sr",
[193,38,4,
[[[163,10,<<"x-exchange">>],[161,0,<<>>]],
[[163,13,<<"x-routing-key">>],[161,6,<<"stream">>]]]]],
[0,"Su",[160,12,[<<0,19,252,1,0,0,98,171,20,16,108,167>>]]]]],
[{file,"src/osiris_writer.erl"},{line,158}]},
{rabbit_stream_queue,deliver0,4,
[{file,"rabbit_stream_queue.erl"},{line,540}]},
{rabbit_stream_queue,'-deliver/3-fun-0-',4,
[{file,"rabbit_stream_queue.erl"},{line,526}]},
{lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
{rabbit_queue_type,'-deliver0/4-fun-5-',5,
[{file,"rabbit_queue_type.erl"},{line,707}]},
{maps,fold_1,4,[{file,"maps.erl"},{line,860}]},
{rabbit_queue_type,deliver0,4,
[{file,"rabbit_queue_type.erl"},{line,704}]},
{rabbit_queue_type,deliver,4,
[{file,"rabbit_queue_type.erl"},{line,662}]}]}
```
Co-authored-by: Karl Nilsson <kjnilsson@gmail.com>
build(deps): bump org.springframework.boot:spring-boot-starter-parent from 3.4.0 to 3.4.1 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot_kotlin
build(deps): bump org.springframework.boot:spring-boot-starter-parent from 3.4.0 to 3.4.1 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot
[Why]
It was possible that testcases were executed before the etcd daemon was
ready, leading to test failures.
[How]
There was already a santy check to verify that the etcd daemon was
working correctly, but it was itself a testcase.
This patch moves this code to the etcd start code to wait for it to be
ready.
This replaces the previous workaround of waiting for 2 seconds.
While here, log anything printed to stdout/stderr by etcd after it
exited.
Fixes#12981.
Running
```
make -C deps/rabbitmq_peer_discovery_etcd ct-system
```
on some macOS system causes test failures because the client cannot
connect to etcd:
```
test failed to connect [localhost:2379] by <Gun Down> {down,
{shutdown,
econnrefused}}
```
The etcd log file didn't show any error message.
However, the etcd log file showed that the etcd listener got started
after the test case tried to connect.
This commit fixes the test failure.
A better solution would be to use the HTTP API or the etcdctl CLI to
poll the listener status. However, simply waiting for 2 seconds is good
enough for this test suite.
[Why]
Up-to RabbitMQ 3.13.x, there was a case where if:
1. you enabled a plugin
2. you enabled its feature flags
3. you disabled the plugin
4. you restarted a node (or upgraded it)
... the node could crash on startup because it had a feature flag marked
as enabled that it didn't know about:
error:{badmatch,#{feature_flags => ...
rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514
lists:all_1/2, line 1520
rabbit_ff_controller:are_compatible/2, line 496
rabbit_ff_controller:check_node_compatibility_task1/4, line 437
rabbit_db_cluster:check_compatibility/1, line 376
This was "fixed" by the new way of keeping the registry in memory
(#10988) because it introduces a slight change of behavior. Indeed, the
old way walked through the `FeatureFlags` map and looked up the state in
the `FeatureStates` map to create the `is_enabled/1` function. The new
way just looks up the state in `FeatureStates`.
[How]
The new testcase succeeds on 4.0.x and `main`, but would fail on 3.13.x
with the aforementionne crash.
## Why?
To introduce AMQP over WebSocket, we will add gun to the Erlang AMQP
1.0 client. We want to add the latest version of gun for this new
feature. Since rabbitmq_peer_discovery_etcd depends on the outdated
eetcd 0.3.6 which in turn depends on the outdated gun 1.3.3, this commit
first upgrades eetcd and gun.
## How?
See https://github.com/zhongwencool/eetcd?tab=readme-ov-file#migration-from-eetcd-03x-to-04x
## Breaking Changes
This commit causes the following breaking change:
`rabbitmq.conf` settings
* `cluster_formation.etcd.ssl_options.fail_if_no_peer_cert`
* `cluster_formation.etcd.ssl_options.dh`
* `cluster_formation.etcd.ssl_options.dhfile`
are unsupported because they are not valid `ssl:tls_client_option()`.
See https://github.com/erlang/otp/issues/7497#issuecomment-1636012198
[Why]
The feature flag controller that is responsible for enabling a feature
flag may be on a node that doesn't know this feature flag. This is
supported by there is a bug when it queries the callback definition for
that feature flag: it uses its own registry which does not have anything
about this feature flag.
This leads to a crash because the `run_callback/5` funtion tries to use
the `undefined` atom returned by the registry as a map:
crasher:
initial call: rabbit_ff_controller:init/1
pid: <0.374.0>
registered_name: rabbit_ff_controller
exception error: bad map: undefined
in function rabbit_ff_controller:run_callback/5
in call from rabbit_ff_controller:do_enable/3 (rabbit_ff_controller.erl, line 1244)
in call from rabbit_ff_controller:update_feature_state_and_enable/2 (rabbit_ff_controller.erl, line 1180)
in call from rabbit_ff_controller:enable_with_registry_locked/2 (rabbit_ff_controller.erl, line 1050)
in call from rabbit_ff_controller:enable_many_locked/2 (rabbit_ff_controller.erl, line 991)
in call from rabbit_ff_controller:enable_many/2 (rabbit_ff_controller.erl, line 979)
in call from rabbit_ff_controller:updating_feature_flag_states/3 (rabbit_ff_controller.erl, line 307)
in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)
[How]
The callback definition is now queried from the first node in the list
given as argument. For the common use case where all nodes know about a
feature flag, the first node is the local one, so there should be no
latency caused by the RPC.
See #12963.
[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.
[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.
However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.
At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.
While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
Transient (i.e. `durable=false`) exchanges and queues are deprecated.
Khepri will store all entities durably.
(Even exclusive queues will be stored durably. Exclusive queues are
still deleted when the declaring connection is closed.)
Similar to how the RabbitMQ AMQP 1.0 Java client already disallows the
creation of transient exchanges and queues, this commit will prohibit
the declaration of transient exchanges and queues in the RabbitMQ
AMQP 1.0 Erlang client starting with RabbitMQ 4.1.
If handle_tick is called before the machine has finished the upgrade
process, it could receive an old overview format (stats tuple vs map).
Let's ignore it and the next handle tick should be fine.
Unlikely to happen in production, detected on CI with a very low tick timeout
Fixes#12933
The assumption that `x-last-death-*` annotations must have been set
whenever the `deaths` annotation is set was wrong.
Reproducation steps, Option 1:
1. In v3.13.7, dead letter a message from Q1 to Q2 (both can be classic queues).
2. Re-publish the message including its x-death header from Q2 back to Q1.
(RabbitMQ 3.13.7 will interpret this x-death header and set the deaths annotation.)
3. Upgrade to v4.0.4
4. Dead letter the message from Q1 to Q2 will cause the following crash:
```
crasher:
initial call: rabbit_amqqueue_process:init/1
pid: <0.577.0>
registered_name: []
exception exit: {{badkey,<<"x-last-death-exchange">>},
[{mc,record_death,4,[{file,"mc.erl"},{line,410}]},
{rabbit_dead_letter,publish,5,
[{file,"rabbit_dead_letter.erl"},{line,38}]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',
7,
[{file,"rabbit_amqqueue_process.erl"},{line,1060}]},
{rabbit_variable_queue,'-ackfold/4-fun-0-',3,
[{file,"rabbit_variable_queue.erl"},{line,655}]},
{lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
{rabbit_variable_queue,ackfold,4,
[{file,"rabbit_variable_queue.erl"},{line,652}]},
{rabbit_priority_queue,ackfold,4,
[{file,"rabbit_priority_queue.erl"},{line,309}]},
{rabbit_amqqueue_process,
'-dead_letter_rejected_msgs/3-fun-0-',5,
[{file,"rabbit_amqqueue_process.erl"},
{line,1038}]}]}
```
Reproduction steps, Option 2:
1. Run a 4.0.4 / 3.13.7 mixed version cluster where both queues Q1 and Q2
are hosted on the 4.0.4 node.
2. Send a message to Q1 which dead letters to Q2.
3. Re-publish a message with the x-death AMQP 0.9.1 header from Q2 to
Q1. However, this time make sure to publish to the 3.13.7 node which
forwards this message to Q1 on the 4.0.4 node.
4. Subsequently dead lettering this message from Q1 to Q2 (happening on
the 4.0.4 node) will also cause the crash.
The modified test case in this commit was able to repro this crash via
Option 2 in the mixed version cluster tests on the `v4.0.x` branch.
As the de-duplication plugin is the only adopter of the `is_duplicate`
callback, we now use a simpler signature.
When a message is deemed duplicated, we discard it and re-route it to
dead letter exchange.
Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit f93baa35cb)
`is_duplicate` callback signature was changed in order to support both
the mirroring queues as well as the de-duplication ones.
As the mirroring queues are now deprecated and removed, we can fall
back to a simpler boolean as return value.
Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit c927446e17)
Prior to this commit, when the sending client overshot RabbitMQ's incoming-window
(which is allowed in the event of a cluster wide memory or disk alarm),
and RabbitMQ sent a FLOW frame to the client, RabbitMQ sent a negative
incoming-window field in the FLOW frame causing the following crash in
the writer proc:
```
crasher:
initial call: rabbit_amqp_writer:init/1
pid: <0.19353.0>
registered_name: []
exception error: bad argument
in function iolist_size/1
called as iolist_size([<<112,0,0,23,120>>,
[82,-15],
<<"pÿÿÿü">>,<<"pÿÿÿÿ">>,67,
<<112,0,0,23,120>>,
"Rª",64,64,64,64])
*** argument 1: not an iodata term
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 141)
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 88)
in call from amqp10_binary_generator:generate/1 (amqp10_binary_generator.erl, line 79)
in call from rabbit_amqp_writer:assemble_frame/3 (rabbit_amqp_writer.erl, line 206)
in call from rabbit_amqp_writer:internal_send_command_async/3 (rabbit_amqp_writer.erl, line 189)
in call from rabbit_amqp_writer:handle_cast/2 (rabbit_amqp_writer.erl, line 110)
in call from gen_server:try_handle_cast/3 (gen_server.erl, line 1121)
```
This commit fixes this crash by maintaning a floor of zero for
incoming-window in the FLOW frame.
Fixes#12816
The credit_flow between publishing AMQP 0.9.1 channel (or MQTT
connection) and (non-mirrored) classic queue processes was
unintentionally removed in 4.0 together with anything else related to
CQ mirroring.
By default we restore the 3.x behaviour for non-mirored classic
queues. It is possible to disable flow-control (the earlier 4.0.x
behaviour) with the new env `classic_queue_flow_control`. In 3.x this
was possible with the config `mirroring_flow_control`.
(cherry picked from commit d65bd7d07a)
This check is expected to succeed and the status is expected to be
printed to stdout rather than stderr. This change silences the status
output. The status text was printed mistakenly previously because we
captured stderr rather than stdout.
This previously emitted a warning because Elixir will rebind `this_node`
by default, so the `this_node` binding in the line above was unused.
(As opposed to Erlang which would treat this as a match - rejecting
the binding if `this_node` was not equal to the value being matched.)
The node needed to be adjusted as well - `node()` returned the ExUnit
runner's node while the command returned the remote node, which is
stored in the context under `opts.node`.
`rabbit_binary_generator:map_exception/3` will crash when there are
unicode characters in the `explaination` field of `Reason#amqp_error`
parameter. The explaination string (list) is assumed to be ascii, with
each character/member in the range of a byte. Any unicode characters
in the string will trigger `badarg` crash of `list_to_binary/1` in
`rabbit_binary_generator:amqp_exception_explanation/2`.
Amqp091 shovel crash due to this is reported,
https://github.com/rabbitmq/rabbitmq-server/discussions/12874
When a queue as shovel source/destination does not exist, and its
name contains non-ascii characters, the explaination of amqp_error
will be like `no queue non_ascii_name_😍 in vhost /`. It will
subsequently crash and even affect management console.
To fix this, `unicode:characters_to_binary/1` is used instead of
`list_to_binary/1`, and unicode-safe truncation of long explaination
with `io_lib:format/3` chars_limit replaces direct bytes truncation.
The `:io.format/2` call was originally passed a single-quote string
(i.e. a charlist in Elixir terminology) which emits a warning in more
recent Elixir versions:
warning: single-quoted strings represent charlists. Use ~c"" if you indeed want a charlist or use "" instead
└─ nofile:1:12
This warning would pop up a few times when using `make dialyze` within
a deps directory. To resolve it we can switch the quoting so that the
eval string is wrapped in single quotes (equivalent for shell since this
line doesn't use variables) and the format argument is wrapped in double
quotes. This uses a binary in Elixir instead, but that's ok because
`io:format/3`'s `io:format()` parameter may either be an atom, string,
or binary.
This trick was copied from Makefile:49 which uses the same quoting.
[Why]
The test configuration was querying a network interface IP address based
on its name. However, the name, "eth0", is very specific to Linux. This
broke the test on other systems.
[How]
We still have to set an explicit `bind_addr` because Consul refuses to
start if the host has multiple private IPv4 addresses, as it is the case
in CI.
Therefore, we hard-code 127.0.0.1 as the IPv4 address to use because it has a
great chance to exist about anywhere.
[Why]
Two reasons:
1. We need to set the correct feature flags on the test node we have to
start.
2. We can skip Mnesia- or Khepri-specific tests if they are marked.
[Why]
The `run-background-broker` does not wait for the node to be ready,
leading to some transient errors in the testsuite.
[How]
The `start-background-broker` does wait.
While here, export the value of `$(MAKE)`. Otherwise, nested uses of
make(1) may use the wrong make command.
[Why]
The code assumed that the transaction would always succeed. It was kind
of the case with Mnesia because it would throw an exception if it
failed.
Khepri returns an error instead. The code has to handle it. In
particular, we see timeouts in CI and before this patch, they caused a
crash because the list comprehension was asked to work on a tuple.
[How]
We now retry a few times for 10 seconds.
[Why]
We pin a version of Horus even if we don't use it directly (it is a
dependency of Khepri). But currently, we can't update Khepri while still
needing the fix in Horus 0.3.1.
Horus 0.3.1 works around a crash in `cover` that mostly affects CI for
now.
This pinning will have to go away with the next update of Khepri.
[Why]
The `ra:member_add/3` call returns before the change is committed. This
is ok for that addition but any follow-up changes to the cluster might
be rejected with the `cluster_change_not_permitted` error.
[How]
Instead of changing other places to wait or retry their cluster
membership change, this patch waits for the current add to be applied
before proceeding and returning.
This fixes some transient failures in CI where such follow-up changes
are rejected and not retried, leaving the cluster in an unexpected state
for the testcase.
An example is with
`quorum_queue_SUITE:force_shrink_member_to_current_member/1`
This check fails on a virin node, because the metadata store
is not yet ready to handle the query. However, a virin
node by definition can't have any queues, so let's just return
false without asking.
This changes the line `openssl x509 -in path/to/cert.pem -nameopt RFC2253 -subject -noout` to put the `-in` parameter at the end of the line, so that it's easier to ^W the path and replace it with my own.
Tested that this works with OpenSSL 3.1.6 4 Jun 2024 (Library: OpenSSL 3.1.6 4 Jun 2024) and OpenSSL 3.3.0 9 Apr 2024 (Library: OpenSSL 3.3.0 9 Apr 2024) on an Ubuntu 22.04.4 container and MacOS 14.7.1
See discussion #12807 for details.
rabbit_peer_discovery:normalize/1 can be
changed to only return lists of nodes but then
there is a number of core code paths that
treat a single node as a special "preselected"
value.
So let's keep that part and convert both
sets of nodes to lists before computing the
difference.
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.
[How]
We introduce some query retries to reduce the risk of an incomplete
query.
While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.
[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.
However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.
[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
[Why]
That timer was started during boot and continued regardless if `rabbit`
was running or stopped.
This caused the reconsiliation to crash if the `rabbit` app was stopped
before the it ended because it tried to access the database even though
it was stopped or even reset.
[How]
We just check if `rabbit` is running before running one reconciliation
and scheduling a new one.
Empty proplists will be serialized to JSON as arrays,
which they arguably are, and HTTP API clients
expect a regardless of collection size.
References #12552#12699
This undocumented key used to use a simple date-based
formula and used to help support and the core
team.
Nodes no longer have the context to return
a correct response, so all we can do is drop this
key.
This fixes erlang_ls's header resolution. Previously it would confuse
the include_lib of the `khepri.hrl` from Khepri with this header in
the rabbit app.
This header is also specific to how rabbit uses Khepri so I think the
new name fits better.
rabbit:product_version/0 should not return
an 'undefined'.
However, a fallback to the base version is
a technique we already use in 'rabbitmq-diagnostics status',
so adopt the same trick.
The application is not always recompiled which causes tests to fail
because they cannot call `serial_number:usort/1`.
(cherry picked from commit 05a3733722)
Introduce a single place in the AMQP 1.0 Erlang client that infers the AMQP 1.0 type.
Erlang integers are inferred to be AMQP type `long` to avoid overflow surprises.
We don't expect random bytes to be there in the current
version of the message store as we overwrite empty spaces
with zeroes when moving messages around.
We also don't expect messages to be false flagged when
the broker is running because it checks for message
validity in the index. Therefore make sure message bodies
in the tests don't contain byte 255.
## What?
Prior to this commit, the `rabbitmq_event_exchange` internally published
always AMQP 0.9.1 messages to the `amq.rabbitmq.event` topic exchange.
This commit allows users to configure the plugin to publish AMQP 1.0
messages instead.
## Why?
Prior to this commit, when an AMQP 1.0 client consumed events,
event properties that are lists were omitted. For example property
`client_properties` of event `connection.created` or property
`arguments` of event `queue.created` were omitted because of the following sequence:
1. The event exchange plugins listens for all kind of internal events.
2. The event exchange plugin re-publishes all events as AMQP 0.9.1 message to the event exchange.
3. Later, when an AMQP 1.0 client consumes this message, the broker must translate the message from AMQP 0.9.1 to AMQP 1.0.
4. This translation follows the rules outlined in https://www.rabbitmq.com/docs/conversions#amqpl-amqp
5. Specifically, in this table the row before the last one describes the rule we're hitting here. It says that if the AMQP 0.9.1
header value is not an `x-` prefixed header and its value is an array or table, then this header is not converted.
That's because AMQP 1.0 application-properties must be simple types as mandated in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-application-properties
## How?
The user can configure the plugin as follows to have the plugin
internally publish AMQP 1.0 messages:
```
event_exchange.protocol = amqp_1_0
```
To support complex types such as lists, the plugin sets all event
properties as AMQP 1.0 message-annotations. The plugin prefixes all message
annotation keys with `x-opt-` to comply with the AMQP 1.0 spec.
## Alternative Design
An alternative design would have been to format all event properties
e.g. as JSON within the message body. However, this breaks routing on
specific event property values via a headers exchange.
## Documentation
https://github.com/rabbitmq/rabbitmq-website/pull/2129
- Modified metric expression and legend format in State of distribution links
- Changed panel type from 'flant-statusmap-panel' to 'status-history' for Process state
- Updated metric expressions to include instance filtering with {instance=\"$node\"}
for the following metrics:
- erlang_vm_statistics_run_queues_length
- erlang_vm_statistics_dirty_io_run_queue_length
- erlang_vm_statistics_dirty_cpu_run_queue_length
- Added 'DS_PROMETHEUS' as a templated data source variable
* MQTT: avoid an exception
when an AMQP 0-9-1 publisher publishes a message
that has expiration set.
Stack trace was contributed in #12707 by @rdsilio.
* mc_mqtt_SUITE test for #12707#12710
* MQTT protocol_interop_SUITE: new test for #12710#12707
* Simplify tests
---------
Co-authored-by: David Ansari <david.ansari@gmx.de>
In a mixed cluster environment,
'rabbitmq-diagnostics status' can hit a node
that does not return any node tags.
Be more defensive and handle such cases
by simply displaying "(none)" for such
values.
[Why]
Without this callback, the deprecated features subsystem can't report if
the feature is used or not.
This reduces the usefulness of the HTTP API endpoint or the CLI command
that help verify if a cluster is using deprecated features.
[How]
The callback counts transient non-exclusive queues and return `true` if
there are one or more of them.
References #12619.
[Why]
The previous implementation bypassed the deprecated features subsystem.
It only cared about classic mirrored queues and called some
queue-related code directly to determine if this specific feature was
used.
[How]
The command code is simplified by calling the deprecated subsystem to
list used deprecated features instead.
References #12619.
It does not need to use the "worst case scenario"
default HTTP request body size limit that
is primarily necessary because definition imports
can be large (MiBs in size, for example).
Since exchange, queue names and routing key
have limits of 255 bytes and optional arguments
can practically be expected to be short, we
can lower the limit to < 10 KiB.