As the de-duplication plugin is the only adopter of the `is_duplicate`
callback, we now use a simpler signature.
When a message is deemed duplicated, we discard it and re-route it to
dead letter exchange.
Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit f93baa35cb)
`is_duplicate` callback signature was changed in order to support both
the mirroring queues as well as the de-duplication ones.
As the mirroring queues are now deprecated and removed, we can fall
back to a simpler boolean as return value.
Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>
(cherry picked from commit c927446e17)
Prior to this commit, when the sending client overshot RabbitMQ's incoming-window
(which is allowed in the event of a cluster wide memory or disk alarm),
and RabbitMQ sent a FLOW frame to the client, RabbitMQ sent a negative
incoming-window field in the FLOW frame causing the following crash in
the writer proc:
```
crasher:
initial call: rabbit_amqp_writer:init/1
pid: <0.19353.0>
registered_name: []
exception error: bad argument
in function iolist_size/1
called as iolist_size([<<112,0,0,23,120>>,
[82,-15],
<<"pÿÿÿü">>,<<"pÿÿÿÿ">>,67,
<<112,0,0,23,120>>,
"Rª",64,64,64,64])
*** argument 1: not an iodata term
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 141)
in call from amqp10_binary_generator:generate1/1 (amqp10_binary_generator.erl, line 88)
in call from amqp10_binary_generator:generate/1 (amqp10_binary_generator.erl, line 79)
in call from rabbit_amqp_writer:assemble_frame/3 (rabbit_amqp_writer.erl, line 206)
in call from rabbit_amqp_writer:internal_send_command_async/3 (rabbit_amqp_writer.erl, line 189)
in call from rabbit_amqp_writer:handle_cast/2 (rabbit_amqp_writer.erl, line 110)
in call from gen_server:try_handle_cast/3 (gen_server.erl, line 1121)
```
This commit fixes this crash by maintaning a floor of zero for
incoming-window in the FLOW frame.
Fixes#12816
The credit_flow between publishing AMQP 0.9.1 channel (or MQTT
connection) and (non-mirrored) classic queue processes was
unintentionally removed in 4.0 together with anything else related to
CQ mirroring.
By default we restore the 3.x behaviour for non-mirored classic
queues. It is possible to disable flow-control (the earlier 4.0.x
behaviour) with the new env `classic_queue_flow_control`. In 3.x this
was possible with the config `mirroring_flow_control`.
(cherry picked from commit d65bd7d07a)
This check is expected to succeed and the status is expected to be
printed to stdout rather than stderr. This change silences the status
output. The status text was printed mistakenly previously because we
captured stderr rather than stdout.
This previously emitted a warning because Elixir will rebind `this_node`
by default, so the `this_node` binding in the line above was unused.
(As opposed to Erlang which would treat this as a match - rejecting
the binding if `this_node` was not equal to the value being matched.)
The node needed to be adjusted as well - `node()` returned the ExUnit
runner's node while the command returned the remote node, which is
stored in the context under `opts.node`.
`rabbit_binary_generator:map_exception/3` will crash when there are
unicode characters in the `explaination` field of `Reason#amqp_error`
parameter. The explaination string (list) is assumed to be ascii, with
each character/member in the range of a byte. Any unicode characters
in the string will trigger `badarg` crash of `list_to_binary/1` in
`rabbit_binary_generator:amqp_exception_explanation/2`.
Amqp091 shovel crash due to this is reported,
https://github.com/rabbitmq/rabbitmq-server/discussions/12874
When a queue as shovel source/destination does not exist, and its
name contains non-ascii characters, the explaination of amqp_error
will be like `no queue non_ascii_name_😍 in vhost /`. It will
subsequently crash and even affect management console.
To fix this, `unicode:characters_to_binary/1` is used instead of
`list_to_binary/1`, and unicode-safe truncation of long explaination
with `io_lib:format/3` chars_limit replaces direct bytes truncation.
The `:io.format/2` call was originally passed a single-quote string
(i.e. a charlist in Elixir terminology) which emits a warning in more
recent Elixir versions:
warning: single-quoted strings represent charlists. Use ~c"" if you indeed want a charlist or use "" instead
└─ nofile:1:12
This warning would pop up a few times when using `make dialyze` within
a deps directory. To resolve it we can switch the quoting so that the
eval string is wrapped in single quotes (equivalent for shell since this
line doesn't use variables) and the format argument is wrapped in double
quotes. This uses a binary in Elixir instead, but that's ok because
`io:format/3`'s `io:format()` parameter may either be an atom, string,
or binary.
This trick was copied from Makefile:49 which uses the same quoting.
[Why]
The test configuration was querying a network interface IP address based
on its name. However, the name, "eth0", is very specific to Linux. This
broke the test on other systems.
[How]
We still have to set an explicit `bind_addr` because Consul refuses to
start if the host has multiple private IPv4 addresses, as it is the case
in CI.
Therefore, we hard-code 127.0.0.1 as the IPv4 address to use because it has a
great chance to exist about anywhere.
[Why]
Two reasons:
1. We need to set the correct feature flags on the test node we have to
start.
2. We can skip Mnesia- or Khepri-specific tests if they are marked.
[Why]
The `run-background-broker` does not wait for the node to be ready,
leading to some transient errors in the testsuite.
[How]
The `start-background-broker` does wait.
While here, export the value of `$(MAKE)`. Otherwise, nested uses of
make(1) may use the wrong make command.
[Why]
The code assumed that the transaction would always succeed. It was kind
of the case with Mnesia because it would throw an exception if it
failed.
Khepri returns an error instead. The code has to handle it. In
particular, we see timeouts in CI and before this patch, they caused a
crash because the list comprehension was asked to work on a tuple.
[How]
We now retry a few times for 10 seconds.
[Why]
We pin a version of Horus even if we don't use it directly (it is a
dependency of Khepri). But currently, we can't update Khepri while still
needing the fix in Horus 0.3.1.
Horus 0.3.1 works around a crash in `cover` that mostly affects CI for
now.
This pinning will have to go away with the next update of Khepri.
[Why]
The `ra:member_add/3` call returns before the change is committed. This
is ok for that addition but any follow-up changes to the cluster might
be rejected with the `cluster_change_not_permitted` error.
[How]
Instead of changing other places to wait or retry their cluster
membership change, this patch waits for the current add to be applied
before proceeding and returning.
This fixes some transient failures in CI where such follow-up changes
are rejected and not retried, leaving the cluster in an unexpected state
for the testcase.
An example is with
`quorum_queue_SUITE:force_shrink_member_to_current_member/1`
This check fails on a virin node, because the metadata store
is not yet ready to handle the query. However, a virin
node by definition can't have any queues, so let's just return
false without asking.
This changes the line `openssl x509 -in path/to/cert.pem -nameopt RFC2253 -subject -noout` to put the `-in` parameter at the end of the line, so that it's easier to ^W the path and replace it with my own.
Tested that this works with OpenSSL 3.1.6 4 Jun 2024 (Library: OpenSSL 3.1.6 4 Jun 2024) and OpenSSL 3.3.0 9 Apr 2024 (Library: OpenSSL 3.3.0 9 Apr 2024) on an Ubuntu 22.04.4 container and MacOS 14.7.1
See discussion #12807 for details.
rabbit_peer_discovery:normalize/1 can be
changed to only return lists of nodes but then
there is a number of core code paths that
treat a single node as a special "preselected"
value.
So let's keep that part and convert both
sets of nodes to lists before computing the
difference.
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.
[How]
We introduce some query retries to reduce the risk of an incomplete
query.
While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.
[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.
However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.
[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
[Why]
That timer was started during boot and continued regardless if `rabbit`
was running or stopped.
This caused the reconsiliation to crash if the `rabbit` app was stopped
before the it ended because it tried to access the database even though
it was stopped or even reset.
[How]
We just check if `rabbit` is running before running one reconciliation
and scheduling a new one.
Empty proplists will be serialized to JSON as arrays,
which they arguably are, and HTTP API clients
expect a regardless of collection size.
References #12552#12699
This undocumented key used to use a simple date-based
formula and used to help support and the core
team.
Nodes no longer have the context to return
a correct response, so all we can do is drop this
key.
This fixes erlang_ls's header resolution. Previously it would confuse
the include_lib of the `khepri.hrl` from Khepri with this header in
the rabbit app.
This header is also specific to how rabbit uses Khepri so I think the
new name fits better.
rabbit:product_version/0 should not return
an 'undefined'.
However, a fallback to the base version is
a technique we already use in 'rabbitmq-diagnostics status',
so adopt the same trick.
The application is not always recompiled which causes tests to fail
because they cannot call `serial_number:usort/1`.
(cherry picked from commit 05a3733722)
Introduce a single place in the AMQP 1.0 Erlang client that infers the AMQP 1.0 type.
Erlang integers are inferred to be AMQP type `long` to avoid overflow surprises.
We don't expect random bytes to be there in the current
version of the message store as we overwrite empty spaces
with zeroes when moving messages around.
We also don't expect messages to be false flagged when
the broker is running because it checks for message
validity in the index. Therefore make sure message bodies
in the tests don't contain byte 255.
## What?
Prior to this commit, the `rabbitmq_event_exchange` internally published
always AMQP 0.9.1 messages to the `amq.rabbitmq.event` topic exchange.
This commit allows users to configure the plugin to publish AMQP 1.0
messages instead.
## Why?
Prior to this commit, when an AMQP 1.0 client consumed events,
event properties that are lists were omitted. For example property
`client_properties` of event `connection.created` or property
`arguments` of event `queue.created` were omitted because of the following sequence:
1. The event exchange plugins listens for all kind of internal events.
2. The event exchange plugin re-publishes all events as AMQP 0.9.1 message to the event exchange.
3. Later, when an AMQP 1.0 client consumes this message, the broker must translate the message from AMQP 0.9.1 to AMQP 1.0.
4. This translation follows the rules outlined in https://www.rabbitmq.com/docs/conversions#amqpl-amqp
5. Specifically, in this table the row before the last one describes the rule we're hitting here. It says that if the AMQP 0.9.1
header value is not an `x-` prefixed header and its value is an array or table, then this header is not converted.
That's because AMQP 1.0 application-properties must be simple types as mandated in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-application-properties
## How?
The user can configure the plugin as follows to have the plugin
internally publish AMQP 1.0 messages:
```
event_exchange.protocol = amqp_1_0
```
To support complex types such as lists, the plugin sets all event
properties as AMQP 1.0 message-annotations. The plugin prefixes all message
annotation keys with `x-opt-` to comply with the AMQP 1.0 spec.
## Alternative Design
An alternative design would have been to format all event properties
e.g. as JSON within the message body. However, this breaks routing on
specific event property values via a headers exchange.
## Documentation
https://github.com/rabbitmq/rabbitmq-website/pull/2129
- Modified metric expression and legend format in State of distribution links
- Changed panel type from 'flant-statusmap-panel' to 'status-history' for Process state
- Updated metric expressions to include instance filtering with {instance=\"$node\"}
for the following metrics:
- erlang_vm_statistics_run_queues_length
- erlang_vm_statistics_dirty_io_run_queue_length
- erlang_vm_statistics_dirty_cpu_run_queue_length
- Added 'DS_PROMETHEUS' as a templated data source variable
* MQTT: avoid an exception
when an AMQP 0-9-1 publisher publishes a message
that has expiration set.
Stack trace was contributed in #12707 by @rdsilio.
* mc_mqtt_SUITE test for #12707#12710
* MQTT protocol_interop_SUITE: new test for #12710#12707
* Simplify tests
---------
Co-authored-by: David Ansari <david.ansari@gmx.de>
In a mixed cluster environment,
'rabbitmq-diagnostics status' can hit a node
that does not return any node tags.
Be more defensive and handle such cases
by simply displaying "(none)" for such
values.
[Why]
Without this callback, the deprecated features subsystem can't report if
the feature is used or not.
This reduces the usefulness of the HTTP API endpoint or the CLI command
that help verify if a cluster is using deprecated features.
[How]
The callback counts transient non-exclusive queues and return `true` if
there are one or more of them.
References #12619.
[Why]
The previous implementation bypassed the deprecated features subsystem.
It only cared about classic mirrored queues and called some
queue-related code directly to determine if this specific feature was
used.
[How]
The command code is simplified by calling the deprecated subsystem to
list used deprecated features instead.
References #12619.
It does not need to use the "worst case scenario"
default HTTP request body size limit that
is primarily necessary because definition imports
can be large (MiBs in size, for example).
Since exchange, queue names and routing key
have limits of 255 bytes and optional arguments
can practically be expected to be short, we
can lower the limit to < 10 KiB.
This test flaked in CI with the following error:
```
=== === Reason: no match of right hand side value {error,half_attached}
in function amqp_utils:detach_link_sync/1 (amqp_utils.erl, line 100)
in call from amqp_filtex_SUITE:properties_section/1 (amqp_filtex_SUITE.erl, line 187)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
Increase waiting for credit being applied as described in commit
aeedad7b51 since this test case still flakes rarely with:
```
=== === Reason: {assertEqual,[{module,amqp_client_SUITE},
{line,3030},
{expression,"amqp10_msg : body ( Msg1 )"},
{expected,[<<"1">>]},
{value,[<<"2">>]}]}
in function amqp_client_SUITE:detach_requeues_two_connections/2 (amqp_client_SUITE.erl, line 3030)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
Prior to this commit, test
```
make -C deps/rabbitmq_mqtt ct-mqtt_shared t=[mqtt,cluster_size_1,v4]:non_clean_sess_reconnect_qos0_and_qos1
```
flaked in CI with error:
```
{mqtt_shared_SUITE,non_clean_sess_reconnect_qos0_and_qos1,972}
{badmatch,{publish_not_received,<<"msg-0">>}}
```
The problem was the following race condition:
* The MQTT v4 client sends an async DISCONNECT
* The global MQTT consumer metric got decremented. However, the classic
queue still has the MQTT connection proc registered as consumer.
* The test case sends a message
* The classic queue checks out the message to the old connection instead
of checking out the message to the new connection.
The solution in this commit is to check the consumer count of the
classic queue before proceeding to send the message after disconnection.
The stream manager does not need to be a gen_server (no cast, no state)
and the gen_server can create contention for large stream deployments
(some functions make cluster-wide calls that can take some time).
This commit fixes two different bugs/crashes.
To repro, prior to this commit:
1. Create an AMQP 1.0 connection on node-1.
2. Open the Management UI on node-2 and open the connection page of this
single AMQP 1.0 connection.
The first crash was the following:
```
[error] <0.1297.0> crasher:
[error] <0.1297.0> initial call: cowboy_stream_h:request_process/3
[error] <0.1297.0> pid: <0.1297.0>
[error] <0.1297.0> registered_name: []
[error] <0.1297.0> exception error: no case clause matching
[error] <0.1297.0> {badrpc,
[error] <0.1297.0> {'EXIT',
[error] <0.1297.0> {undef,
[error] <0.1297.0> [{rabbit_connection_tracking,lookup,
[error] <0.1297.0> [<<"[::1]:51729 -> [::1]:5672">>,
[error] <0.1297.0> ['rabbit-1@ABCDDDEEAA']],
[error] <0.1297.0> []}]}}}
[error] <0.1297.0> in function rabbit_connection_tracking:lookup/2 (rabbit_connection_tracking.erl, line 235)
[error] <0.1297.0> in call from rabbit_mgmt_wm_connection_sessions:conn/1 (rabbit_mgmt_wm_connection_sessions.erl, line 72)
[error] <0.1297.0> in call from rabbit_mgmt_wm_connection_sessions:is_authorized/2 (rabbit_mgmt_wm_connection_sessions.erl, line 63)
[error] <0.1297.0> in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
[error] <0.1297.0> in call from cowboy_rest:is_authorized/2 (src/cowboy_rest.erl, line 368)
[error] <0.1297.0> in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl, line 284)
[error] <0.1297.0> in call from cowboy_stream_h:execute/3 (src/cowboy_stream_h.erl, line 306)
[error] <0.1297.0> in call from cowboy_stream_h:request_process/3 (src/cowboy_stream_h.erl, line 295)
```
The second crash was the following:
```
[error] <0.1132.0> crasher:
[error] <0.1132.0> initial call: cowboy_stream_h:request_process/3
[error] <0.1132.0> pid: <0.1132.0>
[error] <0.1132.0> registered_name: []
[error] <0.1132.0> exception error: no case clause matching
[error] <0.1132.0> {tracked_connection,
[error] <0.1132.0> {'rabbit-1@ABCDDDEEAA',
[error] <0.1132.0> <<"[::1]:65505 -> [::1]:5672">>},
[error] <0.1132.0> 'rabbit-1@ABCDDDEEAA',<<"/">>,
[error] <0.1132.0> <<"[::1]:65505 -> [::1]:5672">>,<13661.1110.0>,
[error] <0.1132.0> {1,0},
[error] <0.1132.0> network,
[error] <0.1132.0> {0,0,0,0,0,0,0,1},
[error] <0.1132.0> 65505,<<"guest">>,1730908606089}
[error] <0.1132.0> in function rabbit_connection_tracking:lookup/2 (rabbit_connection_tracking.erl, line 235)
[error] <0.1132.0> in call from rabbit_mgmt_wm_connection_sessions:conn/1 (rabbit_mgmt_wm_connection_sessions.erl, line 72)
[error] <0.1132.0> in call from rabbit_mgmt_wm_connection_sessions:is_authorized/2 (rabbit_mgmt_wm_connection_sessions.erl, line 63)
[error] <0.1132.0> in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
[error] <0.1132.0> in call from cowboy_rest:is_authorized/2 (src/cowboy_rest.erl, line 368)
[error] <0.1132.0> in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl, line 284)
[error] <0.1132.0> in call from cowboy_stream_h:execute/3 (src/cowboy_stream_h.erl, line 306)
[error] <0.1132.0> in call from cowboy_stream_h:request_process/3 (src/cowboy_stream_h.erl, line 295)
## What?
On the connection page in the Management UI, display detailed session and
link information including:
* Link names
* Link target and source addresses
* Link flow control state
* Session flow control state
* Number of unconfirmed and unacknowledged messages
## How?
A new HTTP API endpoint is added:
```
/connections/:connection_name/sessions
```
The HTTP handler first queries the Erlang connection process to find out about
all session Pids. The handler then queries each Erlang session process
of this connection.
(The table auto-refreshes by default every 5 seconds. The handler querying a single
connection with 60 idle sessions with each 250 links takes ~100 ms.)
For better user experience in the Management UI, this commit also makes the
session process store and expose link names as well as source/target addresses.
[Why]
The "Feature flags" admin section had several issues:
* It was not designed for experimental feature flags. What was done for
RabbitMQ 4.0.0 was still unclear as to what a user should expect for
experimental feature flags.
* The UI uses synchronous requests from the browser main thread. It
means that for a feature flag that has a long running migration
callback, the browser tab could freeze for a very long time.
[How]
The feature flags table is reworked and now displays:
* a series of icons to highlight the following:
* a feature flag that has a migration function and thus that can
take time to be enabled
* a feature flag that is experimental
* whether this experimental feature flag is supported or not
* a toggle to quickly show if a feature flag is enabled or not and let
the user enable it at the same time.
For stable feature flags, when a user click on the toggle, the toggle
goes into an intermediate state while waiting for the response from the
broker. If the response is successful, the toggle is green. Otherwise it
goes back to red and the error is displayed in a popup as before.
For experimental feature flags, when a user click on the toggle, a popup
is displayed to let the user know of the possible constraints and
consequences, with one or two required checkboxes to tick so the user
confirms they understand the message. The feature flag is enabled only
after the user validates the popup. The displayed message and the
checkboxes depend on if the experimental feature flag is supported or
not (it is a new attribute of experimental feature flags).
The request to enable feature flags now uses the modern `fetch()` API.
Therefore it uses Javascript promises and does not block the main
thread: the UI remains responsive while a migration callback runs.
Finally, an "Enable all stable feature flags" button has been added to
the warning that tells the user some stable feature flags are still
disabled.
V2: Pause auto-refresh while a feature flag is being handled. This fixes
some display inconsistencies.
[Why]
The previous implementation was using the blocking `is_enabled/1` API.
This meant that if a feature flag was being enabled and the enable
callback took time, the CLI's `list_feature_flag` command or any use of
the management UI would block until the feature flag was enabled.
[How]
`get_state/1` now uses the non-blocking API. However it returns a now
possible value: `state_changing`.
[Why]
Durint the development of Khepri, it was difficult to communicate that
it was unsupported in RabbitMQ 3.13.x but was then supported in 4.0.x
even though it was still experimental.
[How]
The feature flag definition now exposes that support level in a now
attribute called `experiment_level`. It can be `unsupported` or
`supported`.
We can use this now attribute in the CLI or the web UI to convey the
level of support to the end user.
In the future, we could imagine that an experimental feature flag
becomes abandoned, where upgraded from a node that has it enabled to a
version that marks the feature flag as abandoned is not possible.
It is possible for a slow running follower with local consumers
to crash after a snapshot installation as it tries to read an entry
from its log that is no longer there (as it has been consumed and
completed by another node but still refers to prior consumers on the
current node).
This commit makes the log effect callback function more defensive
to check that the number of commands returned by the log effect
isn't different from what was requested. if it is different we
consider this a stale read request and return no further effects.
Conflicts:
deps/rabbit/test/quorum_queue_SUITE.erl
Closes#9259.
## What?
Allow an AMQP 1.0 client to renew an OAuth 2.0 token before it expires.
## Why?
This allows clients to keep the AMQP connection open instead of having
to create a new connection whenever the token expires.
## How?
As explained in https://github.com/rabbitmq/rabbitmq-server/issues/9259#issuecomment-2437602040
the client can `PUT` a new token on HTTP API v2 path `/auth/tokens`.
RabbitMQ will then:
1. Store the new token on the given connection.
2. Recheck access to the connection's vhost.
3. Clear all permission caches in the AMQP sessions.
4. Recheck write permissions to exchanges for links publishing to
RabbitMQ, and recheck read permissions from queues for links
consuming from RabbitMQ. The latter complies with the user
expectation in #11364.
For example, if the first restarted node doesn't start,
don't try to restart the other nodes. This mimics what
orchestrators such as Kubernetes or BOSH would do
(although they perform this check differently)
[Why]
Before this patch, required feature flags were basically checked during
boot: they must have been enabled when they were mere stable feature
flags. If they were not, the node refused to boot.
This was easy for the developer because making a feature flag required
allowed to remove the entire compatibility code. Very satisfying.
Unfortunately, this was a pain point to end users, especially those who
did not pay attention to RabbitMQ and the release notes and were just
asking their package manager to update everything. They could end up
with a node that refuse to boot. The only solution was to downgrade,
enable the disabled stable feature flags, upgrade again.
[How]
This patch introduces two levels of requirement to required feature
flags:
* `hard`: this corresponds to the existing behavior where a node will
refuse to boot if a hard required feature flag is not enabled before
the upgrade.
* `soft`: such a required feature flag will be automatically enabled
during the upgrade to a version where it is marked as required.
The level of requirement is set in the feature flag definition:
-rabbit_feature_flag(
{my_feature_flag,
#{stability => required,
require_level => hard
}}).
The default requirement level is `soft`. All existing required feature
flags have now a requirement level of `hard`.
The handling of soft required feature flag is done when the cluster
feature flags states are verified and synchronized. If a required
feature flag is not enabled yet, it is enabled at that time.
This means that as developers, we will have to keep compatibility code
forever for every soft required feature flag, like the feature flag
definition itself.
`init_per_group/3`, which starts the broker, was already called earlier
in the function.
This fixes a bug where the node can't be stopped in `end_per_group/2`,
attecting the next group ability to start one.
This test flakes in CI as described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
The test case fails with
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
However, RabbitMQ closes the session as expected due to the missing read
permissions to the queue as shown in the RabbitMQ logs:
```
[debug] <0.1321.0> Asked to create a new user 'access_failure', password length in bytes: 24
[info] <0.1321.0> Created user 'access_failure'
[debug] <0.1324.0> Asked to set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1324.0> Successfully set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1333.0> accepting AMQP connection 127.0.0.1:36248 -> 127.0.0.1:25000
[debug] <0.1333.0> User 'access_failure' authenticated successfully by backend rabbit_auth_backend_internal
[info] <0.1333.0> Connection from AMQP 1.0 container 'AMQPNetLite-101d7d51': user 'access_failure' authenticated using SASL mechanism PLAIN and granted access to vhost '/'
[debug] <0.1333.0> AMQP 1.0 connection.open frame: hostname = 127.0.0.1, extracted vhost = /, idle-time-out = undefined
[debug] <0.1333.0> AMQP 1.0 created session process <0.1338.0> for channel number 0
[warning] <0.1338.0> Closing session for connection <0.1333.0>: {'v1_0.error',
[warning] <0.1338.0> {symbol,
[warning] <0.1338.0> <<"amqp:unauthorized-access">>},
[warning] <0.1338.0> {utf8,
[warning] <0.1338.0> <<"read access to queue 'test' in vhost '/' refused for user 'access_failure'">>},
[warning] <0.1338.0> undefined}
[debug] <0.1333.0> AMQP 1.0 closed session process <0.1338.0> with channel number 0
[warning] <0.1333.0> closing AMQP connection <0.1333.0> (127.0.0.1:36248 -> 127.0.0.1:25000, duration: '269ms'):
[warning] <0.1333.0> client unexpectedly closed TCP connection
```
```
let receiver = ReceiverLink(ac.Session, "test-receiver", src)
```
uses a null constructur for the onAttached callback.
ReceiverLink doesn't seem to block.
Given that the exact same authorization error is already tested in test
case attach_source_queue of amqp_auth_SUITE, it's safe to delete this F#
test.
Prior to this commit tests
* leader_transfer_quorum_queue_credit_single
* leader_transfer_quorum_queue_credit_batches
flaked in CI during 4.1 (main) and 4.0 mixed version testing.
The follwing error occurred on node 0:
```
[error] <0.1950.0> Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0
[warning] <0.1950.0> Closing session for connection <0.1945.0>: {'v1_0.error',
[warning] <0.1950.0> {symbol,<<"amqp:internal-error">>},
[warning] <0.1950.0> {utf8,
[warning] <0.1950.0> <<"Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0">>},
[warning] <0.1950.0> undefined}
```
Therefore we enable this feature flag for both tests.
This commit also simplifies some test setups that were necessary for
4.0/3.13 mixed version testing, but isn't necessary anymore for 4.1/4.0
mixed version testing.
Support x-cc message annotation
Support an `x-cc` message annotation in AMQP 1.0
similar to the [CC](https://www.rabbitmq.com/docs/sender-selected) header in AMQP 0.9.1.
The value of the `x-cc` message annotation must by a list of strings.
A message annotation is used since application properties allow only simple types.
in order to troubleshoot the flake described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received\n
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477\n
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
As described in https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2385379386
test case queue_topology flaked in CI with the following error:
```
rabbitmq_amqp_client > management_SUITE > cluster_size_3 > queue_topology
#1. {error,{test_case_failed,{824,
<<"rmq-ct-cluster_size_3-1-21000@localhost">>}}}
```
This flake could not be reproduced locally (neither with Mnesia nor with Khepri).
Expose the same metrics for AMQP 1.0 connections as for AMQP 0.9.1 connections.
Display the following AMQP 1.0 metrics on the Management UI:
* Network bytes per second from/to client on connections page
* Number of sessions/channels on connections page
* Network bytes per second from/to client graph on connection page
* Reductions graph on connection page
* Garbage colletion info on connection page
Expose the following AMQP 1.0 per-object Prometheus metrics:
* rabbitmq_connection_incoming_bytes_total
* rabbitmq_connection_outgoing_bytes_total
* rabbitmq_connection_process_reductions_total
* rabbitmq_connection_incoming_packets_total
* rabbitmq_connection_outgoing_packets_total
* rabbitmq_connection_pending_packets
* rabbitmq_connection_channels
The rabbit_amqp_writer proc:
* notifies the rabbit_amqp_reader proc if it sent frames
* hibernates eventually if it doesn't send any frames
The rabbit_amqp_reader proc:
* does not emit stats (update ETS tables) if no frames are received
or sent to save resources when there are many idle connections.
It is possible for a slow running follower with local consumers
to crash after a snapshot installation as it tries to read an entry
from its log that is no longer there (as it has been consumed and
completed by another node but still refers to prior consumers on the
current node).
This commit makes the log effect callback function more defensive
to check that the number of commands returned by the log effect
isn't different from what was requested. if it is different we
consider this a stale read request and return no further effects.
For example, if the first restarted node doesn't start,
don't try to restart the other nodes. This mimics what
orchestrators such as Kubernetes or BOSH would do
(although they perform this check differently)
[Why]
Before this patch, required feature flags were basically checked during
boot: they must have been enabled when they were mere stable feature
flags. If they were not, the node refused to boot.
This was easy for the developer because making a feature flag required
allowed to remove the entire compatibility code. Very satisfying.
Unfortunately, this was a pain point to end users, especially those who
did not pay attention to RabbitMQ and the release notes and were just
asking their package manager to update everything. They could end up
with a node that refuse to boot. The only solution was to downgrade,
enable the disabled stable feature flags, upgrade again.
[How]
This patch introduces two levels of requirement to required feature
flags:
* `hard`: this corresponds to the existing behavior where a node will
refuse to boot if a hard required feature flag is not enabled before
the upgrade.
* `soft`: such a required feature flag will be automatically enabled
during the upgrade to a version where it is marked as required.
The level of requirement is set in the feature flag definition:
-rabbit_feature_flag(
{my_feature_flag,
#{stability => required,
require_level => hard
}}).
The default requirement level is `soft`. All existing required feature
flags have now a requirement level of `hard`.
The handling of soft required feature flag is done when the cluster
feature flags states are verified and synchronized. If a required
feature flag is not enabled yet, it is enabled at that time.
This means that as developers, we will have to keep compatibility code
forever for every soft required feature flag, like the feature flag
definition itself.
Closes#9259.
## What?
Allow an AMQP 1.0 client to renew an OAuth 2.0 token before it expires.
## Why?
This allows clients to keep the AMQP connection open instead of having
to create a new connection whenever the token expires.
## How?
As explained in https://github.com/rabbitmq/rabbitmq-server/issues/9259#issuecomment-2437602040
the client can `PUT` a new token on HTTP API v2 path `/auth/tokens`.
RabbitMQ will then:
1. Store the new token on the given connection.
2. Recheck access to the connection's vhost.
3. Clear all permission caches in the AMQP sessions.
4. Recheck write permissions to exchanges for links publishing to
RabbitMQ, and recheck read permissions from queues for links
consuming from RabbitMQ. The latter complies with the user
expectation in #11364.
`init_per_group/3`, which starts the broker, was already called earlier
in the function.
This fixes a bug where the node can't be stopped in `end_per_group/2`,
attecting the next group ability to start one.
build(deps): bump org.springframework.boot:spring-boot-starter-parent from 3.3.4 to 3.3.5 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot_kotlin
This test flakes in CI as described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
The test case fails with
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
However, RabbitMQ closes the session as expected due to the missing read
permissions to the queue as shown in the RabbitMQ logs:
```
[debug] <0.1321.0> Asked to create a new user 'access_failure', password length in bytes: 24
[info] <0.1321.0> Created user 'access_failure'
[debug] <0.1324.0> Asked to set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1324.0> Successfully set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1333.0> accepting AMQP connection 127.0.0.1:36248 -> 127.0.0.1:25000
[debug] <0.1333.0> User 'access_failure' authenticated successfully by backend rabbit_auth_backend_internal
[info] <0.1333.0> Connection from AMQP 1.0 container 'AMQPNetLite-101d7d51': user 'access_failure' authenticated using SASL mechanism PLAIN and granted access to vhost '/'
[debug] <0.1333.0> AMQP 1.0 connection.open frame: hostname = 127.0.0.1, extracted vhost = /, idle-time-out = undefined
[debug] <0.1333.0> AMQP 1.0 created session process <0.1338.0> for channel number 0
[warning] <0.1338.0> Closing session for connection <0.1333.0>: {'v1_0.error',
[warning] <0.1338.0> {symbol,
[warning] <0.1338.0> <<"amqp:unauthorized-access">>},
[warning] <0.1338.0> {utf8,
[warning] <0.1338.0> <<"read access to queue 'test' in vhost '/' refused for user 'access_failure'">>},
[warning] <0.1338.0> undefined}
[debug] <0.1333.0> AMQP 1.0 closed session process <0.1338.0> with channel number 0
[warning] <0.1333.0> closing AMQP connection <0.1333.0> (127.0.0.1:36248 -> 127.0.0.1:25000, duration: '269ms'):
[warning] <0.1333.0> client unexpectedly closed TCP connection
```
```
let receiver = ReceiverLink(ac.Session, "test-receiver", src)
```
uses a null constructur for the onAttached callback.
ReceiverLink doesn't seem to block.
Given that the exact same authorization error is already tested in test
case attach_source_queue of amqp_auth_SUITE, it's safe to delete this F#
test.
Prior to this commit tests
* leader_transfer_quorum_queue_credit_single
* leader_transfer_quorum_queue_credit_batches
flaked in CI during 4.1 (main) and 4.0 mixed version testing.
The follwing error occurred on node 0:
```
[error] <0.1950.0> Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0
[warning] <0.1950.0> Closing session for connection <0.1945.0>: {'v1_0.error',
[warning] <0.1950.0> {symbol,<<"amqp:internal-error">>},
[warning] <0.1950.0> {utf8,
[warning] <0.1950.0> <<"Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0">>},
[warning] <0.1950.0> undefined}
```
Therefore we enable this feature flag for both tests.
This commit also simplifies some test setups that were necessary for
4.0/3.13 mixed version testing, but isn't necessary anymore for 4.1/4.0
mixed version testing.
Support x-cc message annotation
Support an `x-cc` message annotation in AMQP 1.0
similar to the [CC](https://www.rabbitmq.com/docs/sender-selected) header in AMQP 0.9.1.
The value of the `x-cc` message annotation must by a list of strings.
A message annotation is used since application properties allow only simple types.
in order to troubleshoot the flake described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received\n
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477\n
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
As described in https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2385379386
test case queue_topology flaked in CI with the following error:
```
rabbitmq_amqp_client > management_SUITE > cluster_size_3 > queue_topology
#1. {error,{test_case_failed,{824,
<<"rmq-ct-cluster_size_3-1-21000@localhost">>}}}
```
This flake could not be reproduced locally (neither with Mnesia nor with Khepri).
Removes the usage of a ShouldLog parameter on several functions
and limits the logging of the message warning about the delivery_limit
not being set to the moment of queueDeclaration
Instead of checking the values for current configuration, represented in
`rabbit_quorum_queue:handle_tick` by the `Overview` variable, against
the effective policy, just regenerate the configuration and compare with
the current configuration.
This commit attempts to eliminate the test flake described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2385449940
```
rabbitmq_mqtt > parallel-ct-set-1 > mqtt_shared_SUITE > cluster_size_3 > v4 rabbit_mqtt_qos0_queue_kill_node
=== Ended at 2024-10-01 09:59:52
=== Location: [{mqtt_shared_SUITE,rabbit_mqtt_qos0_queue_kill_node,[1165](https://github.com/rabbitmq/rabbitmq-server/issues/mqtt_shared_suite.src.html#1165)},
{test_server,ts_tc,1793},
{test_server,run_test_case_eval1,1302},
{test_server,run_test_case_eval,1234}]
=== === Reason: no match of right hand side value {publish_not_received,
<<"m1">>}
in function mqtt_shared_SUITE:rabbit_mqtt_qos0_queue_kill_node/1 (mqtt_shared_SUITE.erl, line 1165)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
This flake could not be reproduced locally.
This commit also assumes that this flake occurred under Khepri but not
under Mnesia.
The hypothesis is the following:
* Node 0 is down
* MQTT client creates binding on node 1
* Khepri commits since the binding is replicated and persisted on node 1
and node 2. However the binding isn't reflected yet in node 2's
routing projecting table.
* Publishing a message to node 2 routes to nowhere.
Provides a specific function to fix client ssl options, i.e.: apply all
fixes that are applied for TLS listeneres and clients on previous
versions but also sets `cacerts` option to CA certificates obtained by
`public_key:cacerts_get`, only when no `cacertfile` or `cacerts` are
provided.
build(deps-dev): bump org.junit.jupiter:junit-jupiter-params from 5.11.2 to 5.11.3 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot
application-properties keys are restricted to be strings.
Prior to this commit, a function_clause error occurred if the client
requested an invalid filter:
```
│ *Error{Condition: amqp:internal-error, Description: Session error: function_clause
│ [{rabbit_amqp_filtex,'-validate0/2-fun-0-',
│ [{{symbol,<<"subject">>},{utf8,<<"var">>}}],
│ [{file,"rabbit_amqp_filtex.erl"},{line,119}]},
│ {lists,map,2,[{file,"lists.erl"},{line,2077}]},
│ {rabbit_amqp_filtex,validate0,2,[{file,"rabbit_amqp_filtex.erl"},{line,119}]},
│ {rabbit_amqp_filtex,validate,1,[{file,"rabbit_amqp_filtex.erl"},{line,28}]},
│ {rabbit_amqp_session,parse_filters,2,
│ [{file,"rabbit_amqp_session.erl"},{line,3068}]},
│ {rabbit_amqp_session,parse_filter,1,
│ [{file,"rabbit_amqp_session.erl"},{line,3014}]},
│ {rabbit_amqp_session,'-handle_attach/2-fun-0-',21,
│ [{file,"rabbit_amqp_session.erl"},{line,1371}]},
│
{rabbit_misc,with_exit_handler,2,[{file,"rabbit_misc.erl"},{line,465}]}],
Info: map[]}
```
After this commit, the filter won't actually take effect without a crash occurring.
Supersedes #12520
This commit notifies the client app with the AMQP performative if
connection config `notify_with_performative` is set to `true`.
This allows the client app to learn about all fields including
properties and capabilities returned by the AMQP server.
* Add BEAM dashboard
Also update the other dashboards by opening in Grafana v11.2.2 and ensuring they work as expected.
* Update the Erlang-Distributions-Compare dashboard
* Update the RabbitMQ-Overview dashboard
* Update the RabbitMQ-Quorum-Queues-Raft dashboard
* Update the RabbitMQ-Stream dashboard
* Update distribution link status panel
---------
Co-authored-by: Michal Kuratczyk <mkuratczyk@vmware.com>
Instead of every time we run Make for these applications.
This means that during development we are free to modify
these values or create new test suites without having to
worry about the check. If we forget to then add the test
suites in PARALLEL_CT the workflow will tell us.
Prior to this commit if dotnet or mvnw failed to fetch test
dependencies, for example because dotnet isn't installed, the test setup
crashed in an unexpected way:
```
amqp_system_SUITE > dotnet
{'EXIT',
{badarg,
[{lists,keysearch,
[rmq_nodes,1,
{skip,
"Failed to fetch .NET Core test project dependencies"}],
[{error_info,#{module => erl_stdlib_errors}}]},
{test_server,lookup_config,2,
[{file,"test_server.erl"},{line,1779}]},
{rabbit_ct_broker_helpers,get_node_configs,2,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1411}]},
{rabbit_ct_broker_helpers,enable_feature_flag,2,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1999}]},
{amqp_system_SUITE,init_per_group,2,
[{file,"amqp_system_SUITE.erl"},{line,77}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1794}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1391}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1235}]}]}}
```
This commit improves the error message instead of failing with `badarg`.
This commit also decides to fail the test setup instead of skipping the
suite because we always want CI to execute this test and be notified
instead of silently skipping if the test can't be run.
This commit fixes the CI error on `main` branch where
amqp_system_SUITE failed with the following error:
```
Process terminated. Couldn't find a valid ICU package installed on the system. Set the configuration flag System.Globalization.Invariant to true if you want to run with no globalization support.
at System.Environment.FailFast(System.String)
at System.Globalization.GlobalizationMode.GetGlobalizationInvariantMode()
at System.Globalization.GlobalizationMode..cctor()
at System.Globalization.CultureData.CreateCultureWithInvariantData()
at System.Globalization.CultureData.get_Invariant()
at System.Globalization.CultureInfo..cctor()
at System.String.ToLowerInvariant()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment.GetArch()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment..cctor()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment.GetRuntimeIdentifier()
at Microsoft.DotNet.Cli.MulticoreJitProfilePathCalculator.CalculateProfileRootPath()
at Microsoft.DotNet.Cli.MulticoreJitActivator.StartCliProfileOptimization()
at Microsoft.DotNet.Cli.MulticoreJitActivator.TryActivateMulticoreJit()
at Microsoft.DotNet.Cli.Program.Main(System.String[])
Exit code: 134 (pid <0.1533.0>)
```
As described in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-annotations
> The annotations type is a map where the keys are restricted to be of type symbol or of type ulong.
> All ulong keys, and all symbolic keys except those beginning with "x-" are reserved.
Prior to this commit, if an AMQP client used a reserved annotation key,
the entire AMQP connection terminated with a function_clause error
message that might be difficult to understand for client libs:
```
<<"Session error: function_clause\n[{amqp10_framing,'-decode_annotations/1-fun-0-',\n [{{symbol,<<\"aa\">>},{utf8,<<\"bbb\">>}}],\n [{file,\"amqp10_framing.erl\"},{line,158}]},\n {lists,map,2,[{file,\"lists.erl\"},{line,1559}]},\n {amqp10_framing,decode,1,[{file,\"amqp10_framing.erl\"},{line,127}]},\n {lists,map_1,2,[{file,\"lists.erl\"},{line,1564}]},\n {lists,map,2,[{file,\"lists.erl\"},{line,1559}]},\n {mc_amqp,init,1,[{file,\"mc_amqp.erl\"},{line,102}]},\n {mc,init,4,[{file,\"mc.erl\"},{line,150}]},\n {rabbit_amqp_session,incoming_link_transfer,4,\n [{file,\"rabbit_amqp_session.erl\"},{line,2341}]}]">>
```
This commit ends only the session and provides a clearer error message.
Currently this function always falls back to the compatibility code
and never gets the benefit of using ra:key_metrics/1 due to incorrect
use of the map update operatior ":=" instead of the insert operator
"=>".
Test the use case described in https://github.com/rabbitmq/rabbitmq-website/pull/2095:
> Rather than relying solely on RabbitMQ's built-in dead lettering tracking via x-opt-deaths,
consumers can customise dead lettering event tracking.
Support tracking the requeue history as described in
https://github.com/rabbitmq/rabbitmq-website/pull/2095
This commit:
1. adds a test case tracing the requeue history via AMQP 1.0
using the modified outcome and
2. fixes bugs in the broker which crashed if a modified message
annotation value is an AMQP 1.0 list, map, or array.
Complex modified annotation values (list, map, array) are stored as tagged values from now on.
This means AMQP 0.9.1 consumers will not receive modified annotations of
type list, map, or array (which is okay).
OTP 27 reset all assumptions on how the vm reacts to processes that
buffer and process a lot of large binaries.
Substantially increasing the vheap sizes for such process restores
most of the same performance by allowing processes to hold more binary
data before major garbage collections are triggered.
This introduces a new module to capture process flag configurations.
The new vheap sizes are only applied when running on OTP 27 or
above.
It was still possible, although rare, to have message store
files lose message data, when the following conditions were
met:
* the message data contains byte values 255
(255 is used as an OK marker after a message)
* the message is located after a 0-filled hole in the file
* the length of the data is at least 4096 bytes and
if we misread it (as detailed below) we encounter
a 255 byte where we expect the OK marker
The trick for the code to previously misread the length can
be explained as follow:
A message is stored in the following format:
<<Len:64, MsgIdAndMsg:Len/unit:8, 255>>
With MsgId always being 16 bytes in length. So Len is always
at least 16, if the message data Msg is empty. But technically
it never is.
Now if we have a zero filled hole just before this message,
we may end up with this:
<<0, Len:64, MsgIdAndMsg:Len/unit:8, 255>>
When we are scanning we are testing bytes to see if there is
a message there or not. We look for a Len that gives us byte
255 after MsgIdAndMsg.
Len of value 4096 looks like this in binary:
<<0:48, 16, 0>>
Problem is if we have leading zeroes, Len may look like this:
<<0, 0:48, 16, 0>>
If we take the first 64 bits we get a potential length of 16.
We look at the byte after the next 16 bytes. If it is 255, we
think this is a message and skip by this amount of bytes, and
mistakenly miss the real message.
Solving this by changing the file format would be simple enough,
but we don't have the luxury to afford that. A different solution
was found, which is to combine file scanning with checking that
the message exists in the message store index (populated from
queues at startup, and kept up to date over the life time of
the store). Then we know for sure that the message above
doesn't exist, because the MsgId won't be found in the index.
If it is, then the file number and offset will not match,
and the check will fail.
There remains a small chance that we get it wrong during dirty
recovery. Only a better file format would improve that.
Add javascript unit tests given that amount of
javascript code it is difficult to get good coverage
with just end-to-end tests
The tests are not running yet because i need to learn
how to use Babel to convert ES5 modules into NodeJs modules
otherwise it is not possible because all the source modules
use ES5 modules whereas tests run from node.js which requires
CommonJS
split rabbit_oauth2_config into
- rabbit_oauth2_resource_server
- rabbit_oauth2_oauth_provider
and their respective test modules
Signing keys is an oauth provider
concern hence it stays with the
oauth_provider module.
* Support AMQP filter expressions
## What?
This PR implements the following property filter expressions for AMQP clients
consuming from streams as defined in
[AMQP Filter Expressions Version 1.0 Working Draft 09](https://groups.oasis-open.org/higherlogic/ws/public/document?document_id=66227):
* properties filters [section 4.2.4]
* application-properties filters [section 4.2.5]
String prefix and suffix matching is also supported.
This PR also fixes a bug where RabbitMQ would accept wrong filters.
Specifically, prior to this PR the values of the filter-set's map were
allowed to be symbols. However, "every value MUST be either null or of a
described type which provides the archetype filter."
## Why?
This feature adds the ability to RabbitMQ to have multiple concurrent clients
each consuming only a subset of messages while maintaining message order.
This feature also reduces network traffic between RabbitMQ and clients by
only dispatching those messages that the clients are actually interested in.
Note that AMQP filter expressions are more fine grained than the [bloom filter based
stream filtering](https://www.rabbitmq.com/blog/2023/10/16/stream-filtering) because
* they do not suffer false positives
* the unit of filtering is per-message instead of per-chunk
* matching can be performed on **multiple** values in the properties and
application-properties sections
* prefix and suffix matching on the actual values is supported.
Both, AMQP filter expressions and bloom filters can be used together.
## How?
If a filter isn't valid, RabbitMQ ignores the filter. RabbitMQ only
replies with filters it actually supports and validated successfully to
comply with:
"The receiving endpoint sets its desired filter, the sending endpoint
[RabbitMQ] sets the filter actually in place (including any filters defaulted at
the node)."
* Delete streams test case
The test suite constructed a wrong filter-set.
Specifically the value of the filter-set didn't use a described type as
mandated by the spec.
Using https://azure.github.io/amqpnetlite/api/Amqp.Types.DescribedValue.html
throws errors that the descriptor can't be encoded. Given that this code
path is already tests via the amqp_filtex_SUITE, this F# test gets
therefore deleted.
* Re-introduce the AMQP filter-set bug
Since clients might rely on the wrong filter-set value type, we support
the bug behind a deprecated feature flag and gradually remove support
this bug.
* Revert "Delete streams test case"
This reverts commit c95cfeaef7.
[Why]
Before this patch, the $RABBITMQ_FEATURE_FLAGS environment variable took
an exhaustive list of feature flags to enable. This list overrode the
default of enabling all stable feature flags.
It made it inconvenient when a user wanted to enable an experimental
feature flag like `khepri_db` while still leaving the default behavior.
[How]
$RABBITMQ_FEATURE_FLAGS now acceps the following syntax:
RABBITMQ_FEATURE_FLAGS=+feature1,-feature2
This will start RabbitMQ with all stable feature flags, plus `feature1`,
but without `feature2`.
For users setting `forced_feature_flags_on_init` in the config, the
corresponding syntax is:
{forced_feature_flags_on_init, {rel, [feature1], [feature2]}}
The problem comes from `ct_master` which doesn't tell us
in the return value whether the tests succeeded. In order
to get that information a CT hook was created. But then
we run into another problem: despite its documentation
claiming otherwise, `ct_master` does not handle `ct_hooks`
instructions in the test spec.
So for the time being we fork `ct_master` into a new
`ct_master_fork` module and insert our hook directly
in the code. Later on we will submit patches to OTP.
For example during the startup after RabbitMQ was upgraded but an
enabled community plugin wasn't, and the plugin's broker version
requirement isn't met any more, RabbitMQ still started the plugin
after logging an error.
build(deps-dev): bump org.junit.jupiter:junit-jupiter-params from 5.11.1 to 5.11.2 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot
[Why]
They just add noise to the UI and there is nothing the user can do about
them at that point.
Given their number will only increase, let's hide them to let the user
focus on the feature flags they can act on.
[Why]
The inventory map is huge and difficult to read when it is logged as
is.
[How]
Logging a matrix is much more compact and to the point.
Before:
Feature flags: inventory of node `rabbit-1@giotto`:
#{feature_flags =>
#{rabbit_exchange_type_local_random =>
#{name => rabbit_exchange_type_local_random,
desc => "Local random exchange",stability => stable,
provided_by => rabbit},
message_containers_deaths_v2 =>
#{name => message_containers_deaths_v2,
desc => "Bug fix for dead letter cycle detection",
...
After:
Feature flags: inventory queried from node `rabbit-2@giotto`:
,-- rabbit-2@giotto
|
amqp_address_v1:
classic_mirrored_queue_version: x
classic_queue_mirroring: x
classic_queue_type_delivery_support: x
...
[Why]
Showing that required feature flags are enabled over and over is not
useful and only adds noise to the logs.
[How]
Required feature flags and removed deprecated features are not lists
explicitly. We just log their respective numbers to still be clear that
they exist.
Before:
list of feature flags found:
[x] classic_mirrored_queue_version
[x] classic_queue_type_delivery_support
[x] direct_exchange_routing_v2
[x] feature_flags_v2
[x] implicit_default_bindings
[ ] khepri_db
[x] message_containers_deaths_v2
[x] quorum_queue_non_voters
[~] rabbit_exchange_type_local_random
[x] rabbitmq_4.0.0
...
list of deprecated features found:
[ ] amqp_address_v1
[x] classic_queue_mirroring
[ ] global_qos
[ ] queue_master_locator
[ ] ram_node_type
[ ] transient_nonexcl_queues
After:
list of feature flags found:
[ ] khepri_db
[x] message_containers_deaths_v2
[x] quorum_queue_non_voters
[~] rabbit_exchange_type_local_random
[x] rabbitmq_4.0.0
list of deprecated features found:
[ ] amqp_address_v1
[ ] global_qos
[ ] queue_master_locator
[ ] ram_node_type
[ ] transient_nonexcl_queues
required feature flags not listed above: 18
removed deprecated features not listed above: 1
Fixes a pattern matching bug for discards that come in after a consumer
has been cancelled. Because the rabbit_fifo_client does not keep
the integer consumer key after cancellation, late acks, returns, and
discards use the full {CTag, Pid} consumer id version.
As this is a state machine change the machine version has been
increased to 5.
The same bug is present for the `modify` command also however as
AMQP does not allow late settlements we don't have to make this
fix conditional on the machine version as it cannot happen.
[Why]
Before this change, the controller was looping on all feature flags to
enable, then for each:
1. it checked if it was supported
2. it acquired the registry lock
3. it enabled the feature flag
4. it released the registry lock
It was done this way to not acquire the log if the feature flag was
unsupported in the first place.
However, this put more load on the lock mechanism.
[How]
This commit changes the order. The controller acquires the registry lock
once, then loops on feature flags to enable. The support check is now
under the registry lock.
The `dict:dict()` typing of `rabbit_binding` appears to be a historical
artifact. `dict` has been superseded by `maps`. Switching to a map
makes deletions easier to inspect manually and faster. Though if
deletions grow so large that the map representation is important,
manipulation of the deletions is unlikely to be expensive compared to
any other operations that produced them, so performance is probably
irrelevant.
This commit refactors the bottom section of the `rabbit_binding` module
to switch to a map, switch the `deletions()` type to an opaque,
eliminating a TODO created when using Erlang/OTP 17.1, and the deletion
value to a record. We eliminate some historical artifacts and "cruft":
* Deletions taking multiple forms needlessly, specifically the shape
`{X, deleted | not_deleted, Bindings, none}` no longer being
handled. `process_deletions/2` was responsible for creating this
shape. Instead we now use a record to clearly define the fields.
* Clauses to catch `{error, not_found}` are unnecessary after minor
refactors of the callers. Removing them makes the type specs cleaner.
* `rabbit_binding:process_deletions/1` has no need to update or change
the deletions. This function uses `maps:foreach/2` instead and returns
`ok` instead of mapped deletions.
* Remove `undefined` from the typespec of deletions. This value is no
longer possible with a refactor to `maybe_auto_delete_exchange_in_*`
functions for Mnesia and Khepri. The value was nonsensical since you
cannot delete bindings for an exchange that does not exist.
... that considers the local node as if it was reset.
[Why]
When a node joins a cluster, we check its compatibility with the
cluster, reset the node, copy the feature flags states from the remote
cluster and add that node to the cluster.
However, the compatibility check is performed with the current feature
flags states, even though they are about to be reset. Therefore, a node
with an enabled feature flag that is unsupported by the cluster will
refuse to join. It's incorrect because after the reset and the states
copy, it could have join the cluster just fine.
[How]
We introduce a new variant of `check_node_compatibility/2` that takes an
argument to indicate if the local node should be considered as a virgin
node (i.e. like after a reset).
This way, the joining node will always be able to join, regardless of
its initial feature flags states, as long as it doesn't require a
feature flag that is unsupported by the cluster.
This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment
variable to force a new node to leave stable feature flags disabled to
allow it to join a cluster running an older version.
References #9677.
[Why]
`CheckNodesConsistency` is set to false when the
`check_cluster_consistency()` is called as part of a node joining a
cluster. And the generic compatibility check was already executed by
`rabbit_db_cluster`.
There is no need to run it again. This is even counter-productive with
the improvement to `rabbit_feature_flags:check_node_compatibility/2`
that follows.
... with older RabbitMQ versions which don't know about Khepri.
[Why]
When an older node wants to join a cluster, it calls `node_info/0` and
`cluster_status_from_mnesia/0` directly using RPC calls. If it does that
against a node already using Khepri, t will get an error telling it that
Mnesia is not running. The error is reported to the end user, making it
difficult to understand the problem: both nodes are simply incompatible.
It's better to leave the final decision to the Feature flags subsystem,
but for that, `rabbit_mnesia` on the newer Khepri-based node still needs
to return something the older version can accept.
[How]
`cluster_status_from_mnesia/0` and `node_info/0` are modified to verify
if Khepri is enabled and if it is, return a value based on Khepri's
status as if it was from Mnesia.
This will let the remote older node to continue all its checks and
eventually refuse to join because the Feature flags subsystem will
indicate they are incompatible.
Reverting back to the default 1 minute. The problem with
3 minutes is that this is exceedingly long and when there
are problems the test time increases exponentially.