See discussion #12807 for details.
rabbit_peer_discovery:normalize/1 can be
changed to only return lists of nodes but then
there is a number of core code paths that
treat a single node as a special "preselected"
value.
So let's keep that part and convert both
sets of nodes to lists before computing the
difference.
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.
[How]
We introduce some query retries to reduce the risk of an incomplete
query.
While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.
[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.
However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.
[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
[Why]
That timer was started during boot and continued regardless if `rabbit`
was running or stopped.
This caused the reconsiliation to crash if the `rabbit` app was stopped
before the it ended because it tried to access the database even though
it was stopped or even reset.
[How]
We just check if `rabbit` is running before running one reconciliation
and scheduling a new one.
Empty proplists will be serialized to JSON as arrays,
which they arguably are, and HTTP API clients
expect a regardless of collection size.
References #12552#12699
This undocumented key used to use a simple date-based
formula and used to help support and the core
team.
Nodes no longer have the context to return
a correct response, so all we can do is drop this
key.
This fixes erlang_ls's header resolution. Previously it would confuse
the include_lib of the `khepri.hrl` from Khepri with this header in
the rabbit app.
This header is also specific to how rabbit uses Khepri so I think the
new name fits better.
rabbit:product_version/0 should not return
an 'undefined'.
However, a fallback to the base version is
a technique we already use in 'rabbitmq-diagnostics status',
so adopt the same trick.
The application is not always recompiled which causes tests to fail
because they cannot call `serial_number:usort/1`.
(cherry picked from commit 05a3733722)
Introduce a single place in the AMQP 1.0 Erlang client that infers the AMQP 1.0 type.
Erlang integers are inferred to be AMQP type `long` to avoid overflow surprises.
We don't expect random bytes to be there in the current
version of the message store as we overwrite empty spaces
with zeroes when moving messages around.
We also don't expect messages to be false flagged when
the broker is running because it checks for message
validity in the index. Therefore make sure message bodies
in the tests don't contain byte 255.
## What?
Prior to this commit, the `rabbitmq_event_exchange` internally published
always AMQP 0.9.1 messages to the `amq.rabbitmq.event` topic exchange.
This commit allows users to configure the plugin to publish AMQP 1.0
messages instead.
## Why?
Prior to this commit, when an AMQP 1.0 client consumed events,
event properties that are lists were omitted. For example property
`client_properties` of event `connection.created` or property
`arguments` of event `queue.created` were omitted because of the following sequence:
1. The event exchange plugins listens for all kind of internal events.
2. The event exchange plugin re-publishes all events as AMQP 0.9.1 message to the event exchange.
3. Later, when an AMQP 1.0 client consumes this message, the broker must translate the message from AMQP 0.9.1 to AMQP 1.0.
4. This translation follows the rules outlined in https://www.rabbitmq.com/docs/conversions#amqpl-amqp
5. Specifically, in this table the row before the last one describes the rule we're hitting here. It says that if the AMQP 0.9.1
header value is not an `x-` prefixed header and its value is an array or table, then this header is not converted.
That's because AMQP 1.0 application-properties must be simple types as mandated in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-application-properties
## How?
The user can configure the plugin as follows to have the plugin
internally publish AMQP 1.0 messages:
```
event_exchange.protocol = amqp_1_0
```
To support complex types such as lists, the plugin sets all event
properties as AMQP 1.0 message-annotations. The plugin prefixes all message
annotation keys with `x-opt-` to comply with the AMQP 1.0 spec.
## Alternative Design
An alternative design would have been to format all event properties
e.g. as JSON within the message body. However, this breaks routing on
specific event property values via a headers exchange.
## Documentation
https://github.com/rabbitmq/rabbitmq-website/pull/2129
- Modified metric expression and legend format in State of distribution links
- Changed panel type from 'flant-statusmap-panel' to 'status-history' for Process state
- Updated metric expressions to include instance filtering with {instance=\"$node\"}
for the following metrics:
- erlang_vm_statistics_run_queues_length
- erlang_vm_statistics_dirty_io_run_queue_length
- erlang_vm_statistics_dirty_cpu_run_queue_length
- Added 'DS_PROMETHEUS' as a templated data source variable
* MQTT: avoid an exception
when an AMQP 0-9-1 publisher publishes a message
that has expiration set.
Stack trace was contributed in #12707 by @rdsilio.
* mc_mqtt_SUITE test for #12707#12710
* MQTT protocol_interop_SUITE: new test for #12710#12707
* Simplify tests
---------
Co-authored-by: David Ansari <david.ansari@gmx.de>
In a mixed cluster environment,
'rabbitmq-diagnostics status' can hit a node
that does not return any node tags.
Be more defensive and handle such cases
by simply displaying "(none)" for such
values.
[Why]
Without this callback, the deprecated features subsystem can't report if
the feature is used or not.
This reduces the usefulness of the HTTP API endpoint or the CLI command
that help verify if a cluster is using deprecated features.
[How]
The callback counts transient non-exclusive queues and return `true` if
there are one or more of them.
References #12619.
[Why]
The previous implementation bypassed the deprecated features subsystem.
It only cared about classic mirrored queues and called some
queue-related code directly to determine if this specific feature was
used.
[How]
The command code is simplified by calling the deprecated subsystem to
list used deprecated features instead.
References #12619.
It does not need to use the "worst case scenario"
default HTTP request body size limit that
is primarily necessary because definition imports
can be large (MiBs in size, for example).
Since exchange, queue names and routing key
have limits of 255 bytes and optional arguments
can practically be expected to be short, we
can lower the limit to < 10 KiB.
This test flaked in CI with the following error:
```
=== === Reason: no match of right hand side value {error,half_attached}
in function amqp_utils:detach_link_sync/1 (amqp_utils.erl, line 100)
in call from amqp_filtex_SUITE:properties_section/1 (amqp_filtex_SUITE.erl, line 187)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
Increase waiting for credit being applied as described in commit
aeedad7b51 since this test case still flakes rarely with:
```
=== === Reason: {assertEqual,[{module,amqp_client_SUITE},
{line,3030},
{expression,"amqp10_msg : body ( Msg1 )"},
{expected,[<<"1">>]},
{value,[<<"2">>]}]}
in function amqp_client_SUITE:detach_requeues_two_connections/2 (amqp_client_SUITE.erl, line 3030)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
Prior to this commit, test
```
make -C deps/rabbitmq_mqtt ct-mqtt_shared t=[mqtt,cluster_size_1,v4]:non_clean_sess_reconnect_qos0_and_qos1
```
flaked in CI with error:
```
{mqtt_shared_SUITE,non_clean_sess_reconnect_qos0_and_qos1,972}
{badmatch,{publish_not_received,<<"msg-0">>}}
```
The problem was the following race condition:
* The MQTT v4 client sends an async DISCONNECT
* The global MQTT consumer metric got decremented. However, the classic
queue still has the MQTT connection proc registered as consumer.
* The test case sends a message
* The classic queue checks out the message to the old connection instead
of checking out the message to the new connection.
The solution in this commit is to check the consumer count of the
classic queue before proceeding to send the message after disconnection.
The stream manager does not need to be a gen_server (no cast, no state)
and the gen_server can create contention for large stream deployments
(some functions make cluster-wide calls that can take some time).
This commit fixes two different bugs/crashes.
To repro, prior to this commit:
1. Create an AMQP 1.0 connection on node-1.
2. Open the Management UI on node-2 and open the connection page of this
single AMQP 1.0 connection.
The first crash was the following:
```
[error] <0.1297.0> crasher:
[error] <0.1297.0> initial call: cowboy_stream_h:request_process/3
[error] <0.1297.0> pid: <0.1297.0>
[error] <0.1297.0> registered_name: []
[error] <0.1297.0> exception error: no case clause matching
[error] <0.1297.0> {badrpc,
[error] <0.1297.0> {'EXIT',
[error] <0.1297.0> {undef,
[error] <0.1297.0> [{rabbit_connection_tracking,lookup,
[error] <0.1297.0> [<<"[::1]:51729 -> [::1]:5672">>,
[error] <0.1297.0> ['rabbit-1@ABCDDDEEAA']],
[error] <0.1297.0> []}]}}}
[error] <0.1297.0> in function rabbit_connection_tracking:lookup/2 (rabbit_connection_tracking.erl, line 235)
[error] <0.1297.0> in call from rabbit_mgmt_wm_connection_sessions:conn/1 (rabbit_mgmt_wm_connection_sessions.erl, line 72)
[error] <0.1297.0> in call from rabbit_mgmt_wm_connection_sessions:is_authorized/2 (rabbit_mgmt_wm_connection_sessions.erl, line 63)
[error] <0.1297.0> in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
[error] <0.1297.0> in call from cowboy_rest:is_authorized/2 (src/cowboy_rest.erl, line 368)
[error] <0.1297.0> in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl, line 284)
[error] <0.1297.0> in call from cowboy_stream_h:execute/3 (src/cowboy_stream_h.erl, line 306)
[error] <0.1297.0> in call from cowboy_stream_h:request_process/3 (src/cowboy_stream_h.erl, line 295)
```
The second crash was the following:
```
[error] <0.1132.0> crasher:
[error] <0.1132.0> initial call: cowboy_stream_h:request_process/3
[error] <0.1132.0> pid: <0.1132.0>
[error] <0.1132.0> registered_name: []
[error] <0.1132.0> exception error: no case clause matching
[error] <0.1132.0> {tracked_connection,
[error] <0.1132.0> {'rabbit-1@ABCDDDEEAA',
[error] <0.1132.0> <<"[::1]:65505 -> [::1]:5672">>},
[error] <0.1132.0> 'rabbit-1@ABCDDDEEAA',<<"/">>,
[error] <0.1132.0> <<"[::1]:65505 -> [::1]:5672">>,<13661.1110.0>,
[error] <0.1132.0> {1,0},
[error] <0.1132.0> network,
[error] <0.1132.0> {0,0,0,0,0,0,0,1},
[error] <0.1132.0> 65505,<<"guest">>,1730908606089}
[error] <0.1132.0> in function rabbit_connection_tracking:lookup/2 (rabbit_connection_tracking.erl, line 235)
[error] <0.1132.0> in call from rabbit_mgmt_wm_connection_sessions:conn/1 (rabbit_mgmt_wm_connection_sessions.erl, line 72)
[error] <0.1132.0> in call from rabbit_mgmt_wm_connection_sessions:is_authorized/2 (rabbit_mgmt_wm_connection_sessions.erl, line 63)
[error] <0.1132.0> in call from cowboy_rest:call/3 (src/cowboy_rest.erl, line 1590)
[error] <0.1132.0> in call from cowboy_rest:is_authorized/2 (src/cowboy_rest.erl, line 368)
[error] <0.1132.0> in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl, line 284)
[error] <0.1132.0> in call from cowboy_stream_h:execute/3 (src/cowboy_stream_h.erl, line 306)
[error] <0.1132.0> in call from cowboy_stream_h:request_process/3 (src/cowboy_stream_h.erl, line 295)
## What?
On the connection page in the Management UI, display detailed session and
link information including:
* Link names
* Link target and source addresses
* Link flow control state
* Session flow control state
* Number of unconfirmed and unacknowledged messages
## How?
A new HTTP API endpoint is added:
```
/connections/:connection_name/sessions
```
The HTTP handler first queries the Erlang connection process to find out about
all session Pids. The handler then queries each Erlang session process
of this connection.
(The table auto-refreshes by default every 5 seconds. The handler querying a single
connection with 60 idle sessions with each 250 links takes ~100 ms.)
For better user experience in the Management UI, this commit also makes the
session process store and expose link names as well as source/target addresses.
[Why]
The "Feature flags" admin section had several issues:
* It was not designed for experimental feature flags. What was done for
RabbitMQ 4.0.0 was still unclear as to what a user should expect for
experimental feature flags.
* The UI uses synchronous requests from the browser main thread. It
means that for a feature flag that has a long running migration
callback, the browser tab could freeze for a very long time.
[How]
The feature flags table is reworked and now displays:
* a series of icons to highlight the following:
* a feature flag that has a migration function and thus that can
take time to be enabled
* a feature flag that is experimental
* whether this experimental feature flag is supported or not
* a toggle to quickly show if a feature flag is enabled or not and let
the user enable it at the same time.
For stable feature flags, when a user click on the toggle, the toggle
goes into an intermediate state while waiting for the response from the
broker. If the response is successful, the toggle is green. Otherwise it
goes back to red and the error is displayed in a popup as before.
For experimental feature flags, when a user click on the toggle, a popup
is displayed to let the user know of the possible constraints and
consequences, with one or two required checkboxes to tick so the user
confirms they understand the message. The feature flag is enabled only
after the user validates the popup. The displayed message and the
checkboxes depend on if the experimental feature flag is supported or
not (it is a new attribute of experimental feature flags).
The request to enable feature flags now uses the modern `fetch()` API.
Therefore it uses Javascript promises and does not block the main
thread: the UI remains responsive while a migration callback runs.
Finally, an "Enable all stable feature flags" button has been added to
the warning that tells the user some stable feature flags are still
disabled.
V2: Pause auto-refresh while a feature flag is being handled. This fixes
some display inconsistencies.
[Why]
The previous implementation was using the blocking `is_enabled/1` API.
This meant that if a feature flag was being enabled and the enable
callback took time, the CLI's `list_feature_flag` command or any use of
the management UI would block until the feature flag was enabled.
[How]
`get_state/1` now uses the non-blocking API. However it returns a now
possible value: `state_changing`.
[Why]
Durint the development of Khepri, it was difficult to communicate that
it was unsupported in RabbitMQ 3.13.x but was then supported in 4.0.x
even though it was still experimental.
[How]
The feature flag definition now exposes that support level in a now
attribute called `experiment_level`. It can be `unsupported` or
`supported`.
We can use this now attribute in the CLI or the web UI to convey the
level of support to the end user.
In the future, we could imagine that an experimental feature flag
becomes abandoned, where upgraded from a node that has it enabled to a
version that marks the feature flag as abandoned is not possible.
It is possible for a slow running follower with local consumers
to crash after a snapshot installation as it tries to read an entry
from its log that is no longer there (as it has been consumed and
completed by another node but still refers to prior consumers on the
current node).
This commit makes the log effect callback function more defensive
to check that the number of commands returned by the log effect
isn't different from what was requested. if it is different we
consider this a stale read request and return no further effects.
Conflicts:
deps/rabbit/test/quorum_queue_SUITE.erl
Closes#9259.
## What?
Allow an AMQP 1.0 client to renew an OAuth 2.0 token before it expires.
## Why?
This allows clients to keep the AMQP connection open instead of having
to create a new connection whenever the token expires.
## How?
As explained in https://github.com/rabbitmq/rabbitmq-server/issues/9259#issuecomment-2437602040
the client can `PUT` a new token on HTTP API v2 path `/auth/tokens`.
RabbitMQ will then:
1. Store the new token on the given connection.
2. Recheck access to the connection's vhost.
3. Clear all permission caches in the AMQP sessions.
4. Recheck write permissions to exchanges for links publishing to
RabbitMQ, and recheck read permissions from queues for links
consuming from RabbitMQ. The latter complies with the user
expectation in #11364.
For example, if the first restarted node doesn't start,
don't try to restart the other nodes. This mimics what
orchestrators such as Kubernetes or BOSH would do
(although they perform this check differently)
[Why]
Before this patch, required feature flags were basically checked during
boot: they must have been enabled when they were mere stable feature
flags. If they were not, the node refused to boot.
This was easy for the developer because making a feature flag required
allowed to remove the entire compatibility code. Very satisfying.
Unfortunately, this was a pain point to end users, especially those who
did not pay attention to RabbitMQ and the release notes and were just
asking their package manager to update everything. They could end up
with a node that refuse to boot. The only solution was to downgrade,
enable the disabled stable feature flags, upgrade again.
[How]
This patch introduces two levels of requirement to required feature
flags:
* `hard`: this corresponds to the existing behavior where a node will
refuse to boot if a hard required feature flag is not enabled before
the upgrade.
* `soft`: such a required feature flag will be automatically enabled
during the upgrade to a version where it is marked as required.
The level of requirement is set in the feature flag definition:
-rabbit_feature_flag(
{my_feature_flag,
#{stability => required,
require_level => hard
}}).
The default requirement level is `soft`. All existing required feature
flags have now a requirement level of `hard`.
The handling of soft required feature flag is done when the cluster
feature flags states are verified and synchronized. If a required
feature flag is not enabled yet, it is enabled at that time.
This means that as developers, we will have to keep compatibility code
forever for every soft required feature flag, like the feature flag
definition itself.
`init_per_group/3`, which starts the broker, was already called earlier
in the function.
This fixes a bug where the node can't be stopped in `end_per_group/2`,
attecting the next group ability to start one.
This test flakes in CI as described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2419293869
The test case fails with
```
Node: rabbit_shard2@localhost
Case: amqp_system_SUITE:access_failure
Reason: {error,{{badmatch,{error,134,
"Unhandled exception. System.Exception: expected exception not received
at Program.Test.accessFailure(String uri) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 477
at Program.main(String[] argv) in /home/runner/work/rabbitmq-server/rabbitmq-server/deps/rabbit/test/amqp_system_SUITE_data/fsharp-tests/Program.fs:line 509\n"}},
[{amqp_system_SUITE,run_dotnet_test,2,
[{file,"amqp_system_SUITE.erl"},
{line,257}]},
```
However, RabbitMQ closes the session as expected due to the missing read
permissions to the queue as shown in the RabbitMQ logs:
```
[debug] <0.1321.0> Asked to create a new user 'access_failure', password length in bytes: 24
[info] <0.1321.0> Created user 'access_failure'
[debug] <0.1324.0> Asked to set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1324.0> Successfully set permissions for user 'access_failure' in virtual host '/' to '.*', '^banana.*', '^banana.*'
[info] <0.1333.0> accepting AMQP connection 127.0.0.1:36248 -> 127.0.0.1:25000
[debug] <0.1333.0> User 'access_failure' authenticated successfully by backend rabbit_auth_backend_internal
[info] <0.1333.0> Connection from AMQP 1.0 container 'AMQPNetLite-101d7d51': user 'access_failure' authenticated using SASL mechanism PLAIN and granted access to vhost '/'
[debug] <0.1333.0> AMQP 1.0 connection.open frame: hostname = 127.0.0.1, extracted vhost = /, idle-time-out = undefined
[debug] <0.1333.0> AMQP 1.0 created session process <0.1338.0> for channel number 0
[warning] <0.1338.0> Closing session for connection <0.1333.0>: {'v1_0.error',
[warning] <0.1338.0> {symbol,
[warning] <0.1338.0> <<"amqp:unauthorized-access">>},
[warning] <0.1338.0> {utf8,
[warning] <0.1338.0> <<"read access to queue 'test' in vhost '/' refused for user 'access_failure'">>},
[warning] <0.1338.0> undefined}
[debug] <0.1333.0> AMQP 1.0 closed session process <0.1338.0> with channel number 0
[warning] <0.1333.0> closing AMQP connection <0.1333.0> (127.0.0.1:36248 -> 127.0.0.1:25000, duration: '269ms'):
[warning] <0.1333.0> client unexpectedly closed TCP connection
```
```
let receiver = ReceiverLink(ac.Session, "test-receiver", src)
```
uses a null constructur for the onAttached callback.
ReceiverLink doesn't seem to block.
Given that the exact same authorization error is already tested in test
case attach_source_queue of amqp_auth_SUITE, it's safe to delete this F#
test.
Prior to this commit tests
* leader_transfer_quorum_queue_credit_single
* leader_transfer_quorum_queue_credit_batches
flaked in CI during 4.1 (main) and 4.0 mixed version testing.
The follwing error occurred on node 0:
```
[error] <0.1950.0> Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0
[warning] <0.1950.0> Closing session for connection <0.1945.0>: {'v1_0.error',
[warning] <0.1950.0> {symbol,<<"amqp:internal-error">>},
[warning] <0.1950.0> {utf8,
[warning] <0.1950.0> <<"Timed out waiting for credit reply from quorum queue 'leader_transfer_quorum_queue_credit_batches' in vhost '/'. Hint: Enable feature flag rabbitmq_4.0.0">>},
[warning] <0.1950.0> undefined}
```
Therefore we enable this feature flag for both tests.
This commit also simplifies some test setups that were necessary for
4.0/3.13 mixed version testing, but isn't necessary anymore for 4.1/4.0
mixed version testing.