This commit attempts to eliminate the test flake described in
https://github.com/rabbitmq/rabbitmq-server/issues/12413#issuecomment-2385449940
```
rabbitmq_mqtt > parallel-ct-set-1 > mqtt_shared_SUITE > cluster_size_3 > v4 rabbit_mqtt_qos0_queue_kill_node
=== Ended at 2024-10-01 09:59:52
=== Location: [{mqtt_shared_SUITE,rabbit_mqtt_qos0_queue_kill_node,[1165](https://github.com/rabbitmq/rabbitmq-server/issues/mqtt_shared_suite.src.html#1165)},
{test_server,ts_tc,1793},
{test_server,run_test_case_eval1,1302},
{test_server,run_test_case_eval,1234}]
=== === Reason: no match of right hand side value {publish_not_received,
<<"m1">>}
in function mqtt_shared_SUITE:rabbit_mqtt_qos0_queue_kill_node/1 (mqtt_shared_SUITE.erl, line 1165)
in call from test_server:ts_tc/3 (test_server.erl, line 1793)
in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1302)
in call from test_server:run_test_case_eval/9 (test_server.erl, line 1234)
```
This flake could not be reproduced locally.
This commit also assumes that this flake occurred under Khepri but not
under Mnesia.
The hypothesis is the following:
* Node 0 is down
* MQTT client creates binding on node 1
* Khepri commits since the binding is replicated and persisted on node 1
and node 2. However the binding isn't reflected yet in node 2's
routing projecting table.
* Publishing a message to node 2 routes to nowhere.
Provides a specific function to fix client ssl options, i.e.: apply all
fixes that are applied for TLS listeneres and clients on previous
versions but also sets `cacerts` option to CA certificates obtained by
`public_key:cacerts_get`, only when no `cacertfile` or `cacerts` are
provided.
build(deps-dev): bump org.junit.jupiter:junit-jupiter-params from 5.11.2 to 5.11.3 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot
application-properties keys are restricted to be strings.
Prior to this commit, a function_clause error occurred if the client
requested an invalid filter:
```
│ *Error{Condition: amqp:internal-error, Description: Session error: function_clause
│ [{rabbit_amqp_filtex,'-validate0/2-fun-0-',
│ [{{symbol,<<"subject">>},{utf8,<<"var">>}}],
│ [{file,"rabbit_amqp_filtex.erl"},{line,119}]},
│ {lists,map,2,[{file,"lists.erl"},{line,2077}]},
│ {rabbit_amqp_filtex,validate0,2,[{file,"rabbit_amqp_filtex.erl"},{line,119}]},
│ {rabbit_amqp_filtex,validate,1,[{file,"rabbit_amqp_filtex.erl"},{line,28}]},
│ {rabbit_amqp_session,parse_filters,2,
│ [{file,"rabbit_amqp_session.erl"},{line,3068}]},
│ {rabbit_amqp_session,parse_filter,1,
│ [{file,"rabbit_amqp_session.erl"},{line,3014}]},
│ {rabbit_amqp_session,'-handle_attach/2-fun-0-',21,
│ [{file,"rabbit_amqp_session.erl"},{line,1371}]},
│
{rabbit_misc,with_exit_handler,2,[{file,"rabbit_misc.erl"},{line,465}]}],
Info: map[]}
```
After this commit, the filter won't actually take effect without a crash occurring.
Supersedes #12520
This commit notifies the client app with the AMQP performative if
connection config `notify_with_performative` is set to `true`.
This allows the client app to learn about all fields including
properties and capabilities returned by the AMQP server.
* Add BEAM dashboard
Also update the other dashboards by opening in Grafana v11.2.2 and ensuring they work as expected.
* Update the Erlang-Distributions-Compare dashboard
* Update the RabbitMQ-Overview dashboard
* Update the RabbitMQ-Quorum-Queues-Raft dashboard
* Update the RabbitMQ-Stream dashboard
* Update distribution link status panel
---------
Co-authored-by: Michal Kuratczyk <mkuratczyk@vmware.com>
Instead of every time we run Make for these applications.
This means that during development we are free to modify
these values or create new test suites without having to
worry about the check. If we forget to then add the test
suites in PARALLEL_CT the workflow will tell us.
Prior to this commit if dotnet or mvnw failed to fetch test
dependencies, for example because dotnet isn't installed, the test setup
crashed in an unexpected way:
```
amqp_system_SUITE > dotnet
{'EXIT',
{badarg,
[{lists,keysearch,
[rmq_nodes,1,
{skip,
"Failed to fetch .NET Core test project dependencies"}],
[{error_info,#{module => erl_stdlib_errors}}]},
{test_server,lookup_config,2,
[{file,"test_server.erl"},{line,1779}]},
{rabbit_ct_broker_helpers,get_node_configs,2,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1411}]},
{rabbit_ct_broker_helpers,enable_feature_flag,2,
[{file,"rabbit_ct_broker_helpers.erl"},{line,1999}]},
{amqp_system_SUITE,init_per_group,2,
[{file,"amqp_system_SUITE.erl"},{line,77}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1794}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1391}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1235}]}]}}
```
This commit improves the error message instead of failing with `badarg`.
This commit also decides to fail the test setup instead of skipping the
suite because we always want CI to execute this test and be notified
instead of silently skipping if the test can't be run.
This commit fixes the CI error on `main` branch where
amqp_system_SUITE failed with the following error:
```
Process terminated. Couldn't find a valid ICU package installed on the system. Set the configuration flag System.Globalization.Invariant to true if you want to run with no globalization support.
at System.Environment.FailFast(System.String)
at System.Globalization.GlobalizationMode.GetGlobalizationInvariantMode()
at System.Globalization.GlobalizationMode..cctor()
at System.Globalization.CultureData.CreateCultureWithInvariantData()
at System.Globalization.CultureData.get_Invariant()
at System.Globalization.CultureInfo..cctor()
at System.String.ToLowerInvariant()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment.GetArch()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment..cctor()
at Microsoft.DotNet.PlatformAbstractions.RuntimeEnvironment.GetRuntimeIdentifier()
at Microsoft.DotNet.Cli.MulticoreJitProfilePathCalculator.CalculateProfileRootPath()
at Microsoft.DotNet.Cli.MulticoreJitActivator.StartCliProfileOptimization()
at Microsoft.DotNet.Cli.MulticoreJitActivator.TryActivateMulticoreJit()
at Microsoft.DotNet.Cli.Program.Main(System.String[])
Exit code: 134 (pid <0.1533.0>)
```
As described in https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-annotations
> The annotations type is a map where the keys are restricted to be of type symbol or of type ulong.
> All ulong keys, and all symbolic keys except those beginning with "x-" are reserved.
Prior to this commit, if an AMQP client used a reserved annotation key,
the entire AMQP connection terminated with a function_clause error
message that might be difficult to understand for client libs:
```
<<"Session error: function_clause\n[{amqp10_framing,'-decode_annotations/1-fun-0-',\n [{{symbol,<<\"aa\">>},{utf8,<<\"bbb\">>}}],\n [{file,\"amqp10_framing.erl\"},{line,158}]},\n {lists,map,2,[{file,\"lists.erl\"},{line,1559}]},\n {amqp10_framing,decode,1,[{file,\"amqp10_framing.erl\"},{line,127}]},\n {lists,map_1,2,[{file,\"lists.erl\"},{line,1564}]},\n {lists,map,2,[{file,\"lists.erl\"},{line,1559}]},\n {mc_amqp,init,1,[{file,\"mc_amqp.erl\"},{line,102}]},\n {mc,init,4,[{file,\"mc.erl\"},{line,150}]},\n {rabbit_amqp_session,incoming_link_transfer,4,\n [{file,\"rabbit_amqp_session.erl\"},{line,2341}]}]">>
```
This commit ends only the session and provides a clearer error message.
Currently this function always falls back to the compatibility code
and never gets the benefit of using ra:key_metrics/1 due to incorrect
use of the map update operatior ":=" instead of the insert operator
"=>".
Test the use case described in https://github.com/rabbitmq/rabbitmq-website/pull/2095:
> Rather than relying solely on RabbitMQ's built-in dead lettering tracking via x-opt-deaths,
consumers can customise dead lettering event tracking.
Support tracking the requeue history as described in
https://github.com/rabbitmq/rabbitmq-website/pull/2095
This commit:
1. adds a test case tracing the requeue history via AMQP 1.0
using the modified outcome and
2. fixes bugs in the broker which crashed if a modified message
annotation value is an AMQP 1.0 list, map, or array.
Complex modified annotation values (list, map, array) are stored as tagged values from now on.
This means AMQP 0.9.1 consumers will not receive modified annotations of
type list, map, or array (which is okay).
OTP 27 reset all assumptions on how the vm reacts to processes that
buffer and process a lot of large binaries.
Substantially increasing the vheap sizes for such process restores
most of the same performance by allowing processes to hold more binary
data before major garbage collections are triggered.
This introduces a new module to capture process flag configurations.
The new vheap sizes are only applied when running on OTP 27 or
above.
It was still possible, although rare, to have message store
files lose message data, when the following conditions were
met:
* the message data contains byte values 255
(255 is used as an OK marker after a message)
* the message is located after a 0-filled hole in the file
* the length of the data is at least 4096 bytes and
if we misread it (as detailed below) we encounter
a 255 byte where we expect the OK marker
The trick for the code to previously misread the length can
be explained as follow:
A message is stored in the following format:
<<Len:64, MsgIdAndMsg:Len/unit:8, 255>>
With MsgId always being 16 bytes in length. So Len is always
at least 16, if the message data Msg is empty. But technically
it never is.
Now if we have a zero filled hole just before this message,
we may end up with this:
<<0, Len:64, MsgIdAndMsg:Len/unit:8, 255>>
When we are scanning we are testing bytes to see if there is
a message there or not. We look for a Len that gives us byte
255 after MsgIdAndMsg.
Len of value 4096 looks like this in binary:
<<0:48, 16, 0>>
Problem is if we have leading zeroes, Len may look like this:
<<0, 0:48, 16, 0>>
If we take the first 64 bits we get a potential length of 16.
We look at the byte after the next 16 bytes. If it is 255, we
think this is a message and skip by this amount of bytes, and
mistakenly miss the real message.
Solving this by changing the file format would be simple enough,
but we don't have the luxury to afford that. A different solution
was found, which is to combine file scanning with checking that
the message exists in the message store index (populated from
queues at startup, and kept up to date over the life time of
the store). Then we know for sure that the message above
doesn't exist, because the MsgId won't be found in the index.
If it is, then the file number and offset will not match,
and the check will fail.
There remains a small chance that we get it wrong during dirty
recovery. Only a better file format would improve that.
Add javascript unit tests given that amount of
javascript code it is difficult to get good coverage
with just end-to-end tests
The tests are not running yet because i need to learn
how to use Babel to convert ES5 modules into NodeJs modules
otherwise it is not possible because all the source modules
use ES5 modules whereas tests run from node.js which requires
CommonJS
split rabbit_oauth2_config into
- rabbit_oauth2_resource_server
- rabbit_oauth2_oauth_provider
and their respective test modules
Signing keys is an oauth provider
concern hence it stays with the
oauth_provider module.
* Support AMQP filter expressions
## What?
This PR implements the following property filter expressions for AMQP clients
consuming from streams as defined in
[AMQP Filter Expressions Version 1.0 Working Draft 09](https://groups.oasis-open.org/higherlogic/ws/public/document?document_id=66227):
* properties filters [section 4.2.4]
* application-properties filters [section 4.2.5]
String prefix and suffix matching is also supported.
This PR also fixes a bug where RabbitMQ would accept wrong filters.
Specifically, prior to this PR the values of the filter-set's map were
allowed to be symbols. However, "every value MUST be either null or of a
described type which provides the archetype filter."
## Why?
This feature adds the ability to RabbitMQ to have multiple concurrent clients
each consuming only a subset of messages while maintaining message order.
This feature also reduces network traffic between RabbitMQ and clients by
only dispatching those messages that the clients are actually interested in.
Note that AMQP filter expressions are more fine grained than the [bloom filter based
stream filtering](https://www.rabbitmq.com/blog/2023/10/16/stream-filtering) because
* they do not suffer false positives
* the unit of filtering is per-message instead of per-chunk
* matching can be performed on **multiple** values in the properties and
application-properties sections
* prefix and suffix matching on the actual values is supported.
Both, AMQP filter expressions and bloom filters can be used together.
## How?
If a filter isn't valid, RabbitMQ ignores the filter. RabbitMQ only
replies with filters it actually supports and validated successfully to
comply with:
"The receiving endpoint sets its desired filter, the sending endpoint
[RabbitMQ] sets the filter actually in place (including any filters defaulted at
the node)."
* Delete streams test case
The test suite constructed a wrong filter-set.
Specifically the value of the filter-set didn't use a described type as
mandated by the spec.
Using https://azure.github.io/amqpnetlite/api/Amqp.Types.DescribedValue.html
throws errors that the descriptor can't be encoded. Given that this code
path is already tests via the amqp_filtex_SUITE, this F# test gets
therefore deleted.
* Re-introduce the AMQP filter-set bug
Since clients might rely on the wrong filter-set value type, we support
the bug behind a deprecated feature flag and gradually remove support
this bug.
* Revert "Delete streams test case"
This reverts commit c95cfeaef7.
[Why]
Before this patch, the $RABBITMQ_FEATURE_FLAGS environment variable took
an exhaustive list of feature flags to enable. This list overrode the
default of enabling all stable feature flags.
It made it inconvenient when a user wanted to enable an experimental
feature flag like `khepri_db` while still leaving the default behavior.
[How]
$RABBITMQ_FEATURE_FLAGS now acceps the following syntax:
RABBITMQ_FEATURE_FLAGS=+feature1,-feature2
This will start RabbitMQ with all stable feature flags, plus `feature1`,
but without `feature2`.
For users setting `forced_feature_flags_on_init` in the config, the
corresponding syntax is:
{forced_feature_flags_on_init, {rel, [feature1], [feature2]}}
The problem comes from `ct_master` which doesn't tell us
in the return value whether the tests succeeded. In order
to get that information a CT hook was created. But then
we run into another problem: despite its documentation
claiming otherwise, `ct_master` does not handle `ct_hooks`
instructions in the test spec.
So for the time being we fork `ct_master` into a new
`ct_master_fork` module and insert our hook directly
in the code. Later on we will submit patches to OTP.
For example during the startup after RabbitMQ was upgraded but an
enabled community plugin wasn't, and the plugin's broker version
requirement isn't met any more, RabbitMQ still started the plugin
after logging an error.
build(deps-dev): bump org.junit.jupiter:junit-jupiter-params from 5.11.1 to 5.11.2 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot
[Why]
They just add noise to the UI and there is nothing the user can do about
them at that point.
Given their number will only increase, let's hide them to let the user
focus on the feature flags they can act on.
[Why]
The inventory map is huge and difficult to read when it is logged as
is.
[How]
Logging a matrix is much more compact and to the point.
Before:
Feature flags: inventory of node `rabbit-1@giotto`:
#{feature_flags =>
#{rabbit_exchange_type_local_random =>
#{name => rabbit_exchange_type_local_random,
desc => "Local random exchange",stability => stable,
provided_by => rabbit},
message_containers_deaths_v2 =>
#{name => message_containers_deaths_v2,
desc => "Bug fix for dead letter cycle detection",
...
After:
Feature flags: inventory queried from node `rabbit-2@giotto`:
,-- rabbit-2@giotto
|
amqp_address_v1:
classic_mirrored_queue_version: x
classic_queue_mirroring: x
classic_queue_type_delivery_support: x
...
[Why]
Showing that required feature flags are enabled over and over is not
useful and only adds noise to the logs.
[How]
Required feature flags and removed deprecated features are not lists
explicitly. We just log their respective numbers to still be clear that
they exist.
Before:
list of feature flags found:
[x] classic_mirrored_queue_version
[x] classic_queue_type_delivery_support
[x] direct_exchange_routing_v2
[x] feature_flags_v2
[x] implicit_default_bindings
[ ] khepri_db
[x] message_containers_deaths_v2
[x] quorum_queue_non_voters
[~] rabbit_exchange_type_local_random
[x] rabbitmq_4.0.0
...
list of deprecated features found:
[ ] amqp_address_v1
[x] classic_queue_mirroring
[ ] global_qos
[ ] queue_master_locator
[ ] ram_node_type
[ ] transient_nonexcl_queues
After:
list of feature flags found:
[ ] khepri_db
[x] message_containers_deaths_v2
[x] quorum_queue_non_voters
[~] rabbit_exchange_type_local_random
[x] rabbitmq_4.0.0
list of deprecated features found:
[ ] amqp_address_v1
[ ] global_qos
[ ] queue_master_locator
[ ] ram_node_type
[ ] transient_nonexcl_queues
required feature flags not listed above: 18
removed deprecated features not listed above: 1
Fixes a pattern matching bug for discards that come in after a consumer
has been cancelled. Because the rabbit_fifo_client does not keep
the integer consumer key after cancellation, late acks, returns, and
discards use the full {CTag, Pid} consumer id version.
As this is a state machine change the machine version has been
increased to 5.
The same bug is present for the `modify` command also however as
AMQP does not allow late settlements we don't have to make this
fix conditional on the machine version as it cannot happen.
[Why]
Before this change, the controller was looping on all feature flags to
enable, then for each:
1. it checked if it was supported
2. it acquired the registry lock
3. it enabled the feature flag
4. it released the registry lock
It was done this way to not acquire the log if the feature flag was
unsupported in the first place.
However, this put more load on the lock mechanism.
[How]
This commit changes the order. The controller acquires the registry lock
once, then loops on feature flags to enable. The support check is now
under the registry lock.
The `dict:dict()` typing of `rabbit_binding` appears to be a historical
artifact. `dict` has been superseded by `maps`. Switching to a map
makes deletions easier to inspect manually and faster. Though if
deletions grow so large that the map representation is important,
manipulation of the deletions is unlikely to be expensive compared to
any other operations that produced them, so performance is probably
irrelevant.
This commit refactors the bottom section of the `rabbit_binding` module
to switch to a map, switch the `deletions()` type to an opaque,
eliminating a TODO created when using Erlang/OTP 17.1, and the deletion
value to a record. We eliminate some historical artifacts and "cruft":
* Deletions taking multiple forms needlessly, specifically the shape
`{X, deleted | not_deleted, Bindings, none}` no longer being
handled. `process_deletions/2` was responsible for creating this
shape. Instead we now use a record to clearly define the fields.
* Clauses to catch `{error, not_found}` are unnecessary after minor
refactors of the callers. Removing them makes the type specs cleaner.
* `rabbit_binding:process_deletions/1` has no need to update or change
the deletions. This function uses `maps:foreach/2` instead and returns
`ok` instead of mapped deletions.
* Remove `undefined` from the typespec of deletions. This value is no
longer possible with a refactor to `maybe_auto_delete_exchange_in_*`
functions for Mnesia and Khepri. The value was nonsensical since you
cannot delete bindings for an exchange that does not exist.
... that considers the local node as if it was reset.
[Why]
When a node joins a cluster, we check its compatibility with the
cluster, reset the node, copy the feature flags states from the remote
cluster and add that node to the cluster.
However, the compatibility check is performed with the current feature
flags states, even though they are about to be reset. Therefore, a node
with an enabled feature flag that is unsupported by the cluster will
refuse to join. It's incorrect because after the reset and the states
copy, it could have join the cluster just fine.
[How]
We introduce a new variant of `check_node_compatibility/2` that takes an
argument to indicate if the local node should be considered as a virgin
node (i.e. like after a reset).
This way, the joining node will always be able to join, regardless of
its initial feature flags states, as long as it doesn't require a
feature flag that is unsupported by the cluster.
This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment
variable to force a new node to leave stable feature flags disabled to
allow it to join a cluster running an older version.
References #9677.
[Why]
`CheckNodesConsistency` is set to false when the
`check_cluster_consistency()` is called as part of a node joining a
cluster. And the generic compatibility check was already executed by
`rabbit_db_cluster`.
There is no need to run it again. This is even counter-productive with
the improvement to `rabbit_feature_flags:check_node_compatibility/2`
that follows.
... with older RabbitMQ versions which don't know about Khepri.
[Why]
When an older node wants to join a cluster, it calls `node_info/0` and
`cluster_status_from_mnesia/0` directly using RPC calls. If it does that
against a node already using Khepri, t will get an error telling it that
Mnesia is not running. The error is reported to the end user, making it
difficult to understand the problem: both nodes are simply incompatible.
It's better to leave the final decision to the Feature flags subsystem,
but for that, `rabbit_mnesia` on the newer Khepri-based node still needs
to return something the older version can accept.
[How]
`cluster_status_from_mnesia/0` and `node_info/0` are modified to verify
if Khepri is enabled and if it is, return a value based on Khepri's
status as if it was from Mnesia.
This will let the remote older node to continue all its checks and
eventually refuse to join because the Feature flags subsystem will
indicate they are incompatible.
Reverting back to the default 1 minute. The problem with
3 minutes is that this is exceedingly long and when there
are problems the test time increases exponentially.
All CT logs will now be under <toplevel>/logs. An improved
test workflow would be to always keep the logs/all_runs.html
page open in the browser and refresh it whenever tests are
run in any of the rabbit applications.
And disable those checks in CI for now. We are getting errors
because of the eetcd dependency and we can't upgrade at this
time (see comment in the commit).
The search results record change was done in OTP-25, which is
no longer supported. So we can use the modern search results
record and drop the compatibility clauses.
For more context:
* 8d8847e069
* https://github.com/erlang/otp/pull/5538
The shared test suite was renamed only for clarity, but the
Web-MQTT test suites were renamed out of necessity: since
we are now adding the MQTT test directory to the code path
we need test suites to have different names to avoid
conflicts. We can't (easily) addpath only for this test suite
either since CT hooks don't call functions in a predictable
enough manner; it would always be hacky.
This is a proof of concept that mostly works but is missing
some tests, such as rabbitmq_mqtt or rabbitmq_cli. It also
doesn't apply to mixed version testing yet.
Otherwise some plugins can't build if we try to run tests
directly after checkout. This is because the plugins
depend on osiris as well as rabbit, but there is no
dep_osiris defined in the plugin itself.
Fixes https://github.com/rabbitmq/rabbitmq-server/issues/12398
To repro this crash:
1. Start RabbitMQ v3.13.7 with feature flag message_containers disabled:
```
make run-broker TEST_TMPDIR="$HOME/scratch/rabbit/test" RABBITMQ_FEATURE_FLAGS=quorum_queue,implicit_default_bindings,virtual_host_metadata,maintenance_mode_status,user_limits,feature_flags_v2,stream_queue,classic_queue_type_delivery_support,classic_mirrored_queue_version,stream_single_active_consumer,direct_exchange_routing_v2,listener_records_in_ets,tracking_records_in_ets
```
In the Management UI
2. Create a quorum queue with x-delivery-limit=10
3. Publish a message to this queue.
4. Requeue this message two times.
5. ./sbin/rabbitmqctl enable_feature_flag all
6. Stop the node
7. git checkout v4.0.2
8. make run-broker TEST_TMPDIR="$HOME/scratch/rabbit/test"
9. Again in the Management UI, Get Message with Automatic Ack leads to above crash:
```
[error] <0.1185.0> ** Reason for termination ==
[error] <0.1185.0> ** {function_clause,
[error] <0.1185.0> [{mc_compat,set_annotation,
[error] <0.1185.0> [delivery_count,2,
[error] <0.1185.0> {basic_message,
[error] <0.1185.0> {resource,<<"/">>,exchange,<<>>},
[error] <0.1185.0> [<<"qq1">>],
[error] <0.1185.0> {content,60,
[error] <0.1185.0> {'P_basic',undefined,undefined,
[error] <0.1185.0> [{<<"x-delivery-count">>,long,2}],
[error] <0.1185.0> 2,undefined,undefined,undefined,undefined,undefined,
[error] <0.1185.0> undefined,undefined,undefined,undefined,undefined},
[error] <0.1185.0> none,none,
[error] <0.1185.0> [<<"m1">>]},
[error] <0.1185.0> <<230,146,94,58,177,125,64,163,30,18,177,132,53,207,69,103>>,
[error] <0.1185.0> true}],
[error] <0.1185.0> [{file,"mc_compat.erl"},{line,61}]},
[error] <0.1185.0> {rabbit_fifo_client,add_delivery_count_header,2,
[error] <0.1185.0> [{file,"rabbit_fifo_client.erl"},{line,228}]},
[error] <0.1185.0> {rabbit_fifo_client,dequeue,4,
[error] <0.1185.0> [{file,"rabbit_fifo_client.erl"},{line,211}]},
[error] <0.1185.0> {rabbit_queue_type,dequeue,5,
[error] <0.1185.0> [{file,"rabbit_queue_type.erl"},{line,755}]},
[error] <0.1185.0> {rabbit_misc,with_exit_handler,2,
[error] <0.1185.0> [{file,"rabbit_misc.erl"},{line,526}]},
[error] <0.1185.0> {rabbit_channel,handle_method,3,
[error] <0.1185.0> [{file,"rabbit_channel.erl"},{line,1257}]},
[error] <0.1185.0> {rabbit_channel,handle_cast,2,
[error] <0.1185.0> [{file,"rabbit_channel.erl"},{line,629}]},
[error] <0.1185.0> {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]}]}
```
The mc annotation `delivery_count` is a new mc annotation specifically
used in the header section of AMQP 1.0 messages:
https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-header
Hence, we can ignore this annotation for the old `#basic_message{}`.
Messages, messages_ready and messages_unacknowledged are duplicated
during management stats collection, resulting in internal errors
when sorting queues in the management UI.
These should not be part of rabbit_core_metrics:queue_stats/2
The shovel violated the AMQP 1.0 spec by sending transfers with settled=true
under sender settle mode unsettled (in case of shovel ack-mode being
on-publish).
For messages published to RabbitMQ, RabbitMQ honors the transfer `settled`
field, no matter what value the sender settle mode was set to in the attach
frame.
Therefore, prior to this commit, a client could send a transfer with
`settled=true` even though sender settle mode was set to `unsettled` in the
attach frame.
This commit enforces that the publisher sets only transfer `settled` fields
that are valid with the spec.
If sender settle mode is:
* `unsettled`, the transfer `settled` flag must be `false`.
* `settled`, the transfer `settled` flag must be `true`.
* `mixed`, the transfer `settled` flag can be `true` or `false`.
* Add global histogram metrics for received message sizes per-protocol
fixup: add new files to bazel
fixup: expose message_size_bytes as prometheus classic histogram type
`rabbit_msg_size_metrics` does not use `seshat` any more, but
`counters` directly.
fixup: add msg_size_metrics unit test
* Improve message size histogram
1.
Avoid unnecessary time series emitted for stream protocol
The stream protocol cannot observe message sizes.
This commit ensures that the following time series are omitted:
```
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="64"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="256"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1024"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4096"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16384"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="65536"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="262144"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1048576"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4194304"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16777216"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="67108864"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="268435456"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="+Inf"} 0
rabbitmq_global_message_size_bytes_count{protocol="stream"} 0
rabbitmq_global_message_size_bytes_sum{protocol="stream"} 0
```
This reduces the number of time series by 15.
2.
Further reduce the number of time series by reducing the number of
buckets. Instead of 13 bucktes, emit only 9 buckets. Buckets are not
free, each is an extra time series stored.
Prior to this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
92
```
After this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
57
```
3.
The emitted metric should be called
`rabbitmq_message_size_bytes_bucket` instead of `rabbitmq_global_message_size_bytes_bucket`.
The latter is poor naming. There is no need to use `global` in
the metric name given that this metric doesn't exist in the old flawed
aggregated metrics.
4.
This commit simplies module `rabbit_global_counters`.
5.
Avoid garbage collecting the 10-elements list of buckets per message
being received.
---------
Co-authored-by: Péter Gömöri <peter@84codes.com>
* Add global histogram metrics for received message sizes per-protocol
fixup: add new files to bazel
fixup: expose message_size_bytes as prometheus classic histogram type
`rabbit_msg_size_metrics` does not use `seshat` any more, but
`counters` directly.
fixup: add msg_size_metrics unit test
* Improve message size histogram
1.
Avoid unnecessary time series emitted for stream protocol
The stream protocol cannot observe message sizes.
This commit ensures that the following time series are omitted:
```
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="64"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="256"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1024"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4096"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16384"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="65536"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="262144"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="1048576"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="4194304"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="16777216"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="67108864"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="268435456"} 0
rabbitmq_global_message_size_bytes_bucket{protocol="stream",le="+Inf"} 0
rabbitmq_global_message_size_bytes_count{protocol="stream"} 0
rabbitmq_global_message_size_bytes_sum{protocol="stream"} 0
```
This reduces the number of time series by 15.
2.
Further reduce the number of time series by reducing the number of
buckets. Instead of 13 bucktes, emit only 9 buckets. Buckets are not
free, each is an extra time series stored.
Prior to this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
92
```
After this commit:
```
curl -s -u guest:guest localhost:15692/metrics | ag message_size | wc -l
57
```
3.
The emitted metric should be called
`rabbitmq_message_size_bytes_bucket` instead of `rabbitmq_global_message_size_bytes_bucket`.
The latter is poor naming. There is no need to use `global` in
the metric name given that this metric doesn't exist in the old flawed
aggregated metrics.
4.
This commit simplies module `rabbit_global_counters`.
5.
Avoid garbage collecting the 10-elements list of buckets per message
being received.
---------
Co-authored-by: Péter Gömöri <peter@84codes.com>
Adds a specific clause on the
`prometheus_rabbitmq_core_metrics_collector:labels` function when the
associated metric item is a Queue + Exchange combo (`{Queue, Exchange}`)
{release_cursor, Idx} effects promote checkpoints with an index
lower or _equal_ to the release cursor index. rabbit_fifo is emitting
the smallest active raft index instead which could cause the log to truncate
one index too many after a checkpoint promotion.
Using `rabbitmq.com` works and redirects to `www.rabbitmq.com`, but it
is preferable to use the canonical domain to have cleaner search
results.
This is important for manpages because we have an HTML copy in the
website.
Given that the default max_message_size got decreased from 128 MiB to 16
MiB in RabbitMQ 4.0 in https://github.com/rabbitmq/rabbitmq-server/pull/11455,
it makes sense to also decrease the default MQTT Maximum Packet Size from 256 MiB to 16 MiB.
Since this change was missed in RabbitMQ 4.0, it is scheduled for RabbitMQ 4.1.
build(deps): bump org.springframework.boot:spring-boot-starter-parent from 3.3.3 to 3.3.4 in /deps/rabbitmq_auth_backend_http/examples/rabbitmq_auth_backend_spring_boot_kotlin
```
<field name="message-annotations" type="fields"/>
```
Prior to this commit integration tests succeeded because both Erlang
client and RabbitMQ server contained a bug.
This bug was noticed by a Java client test suite.
--experimental is no longer particularly fair to Khepri,
which is not enabled by default because of its enormous
scope, and because once enabled, it cannot be disabled.
--opt-in would be a better name but --experimental
remains for backwards compatiblity.
When both are specified, we consider that the
user opts in if at least one of the flags is
set to true.
`delegate:invoke/2` catches errors but not exits of the delegate
process. Another process might query for a classic queue's consumers
while the classic queue is being deleted or otherwise terminating and
that would result in an exit of the calling process previously.
This return value was already possible since a classic queue will return
it during termination if `rabbit_amqqueue:internal_delete/2` fails with
that value.
`rabbit_amqqueue:delete/4` already handles this value and converts it
into a protocol error and channel exit. The other caller (MQTT
processor) will be updated in a child commit.
This commit also replaces eager conversions to protocol errors in
rabbit_classic_queue, rabbit_quorum_queue and rabbit_stream_coordinator:
we should return `{error, timeout}` consistently and not hide it in
protocol errors.
Before:
```
FORMATTER CRASH: {"Waiting for ~ts queues and streams to have quorum+1 replicas online.You can list them with `rabbitmq-diagnostics check_if_node_is_quorum_critical`","\t"}
```
After:
```
Waiting for 9 queues and streams to have quorum+1 replicas online. You can list them with `rabbitmq-diagnostics check_if_node_is_quorum_critical`
```
This reverts commit 620fff22f1.
It intoduced a regression in another area - a TCP health check,
such as the default (with cluster-operator) readinessProbe,
on a TLS-enabled instance would log a `rabbit_reader` crash
every few seconds:
```
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> crasher:
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> initial call: rabbit_reader:init/3
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> pid: <0.999.0>
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> registered_name: []
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> exception error: no match of right hand side value {error, handshake_failed}
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> in function rabbit_reader:init/3 (rabbit_reader.erl, line 171)
```
This updates Khepri FF description to be more correct
and to the point.
It also tweaks the management UI copywriting so
that it does not recommend against the use of
Khepri in production as it is much more mature
in 4.0.
interference from other tests sometimes makes
it fail because there is more than one connection.
Compared to most other AMQP 1.0 tests, this one can be
dropped.
For cases where users want to live a bit more dangerously this commit
maps a delivery limit of -1 (or any negative value) such that it
disables the delivery limit and restores the 3.13.x behaviour.
`khepri:fence/0,1,2` queries the leader's Raft index and blocks the
caller for the given (or default) timeout until the local member has
caught up in log replication to that index. We want to do this during
Khepri init to ensure that the local Khepri store is reasonably up to
date before continuing in the boot process and starting listeners.
This is conceptually similar to the call to `mnesia:wait_for_tables/2`
during `rabbit_mnesia:init/0` and should have the same effect.
This covers a specific case where we need to register projections not
covered by the enable callback of the `khepri_db` feature flag. The
feature flag may be enabled if a node has been part of a cluster which
enabled the flag, but the metadata store might be reset. Upon init the
feature flag will be enabled but the store will be empty and the
projections will not exist, so operations like inserting default data
will fail when asserting that a vhost exists for example.
This fixes the `cluster_management_SUITE:forget_cluster_node_in_khepri/1`
case when running the suite with `RABBITMQ_METADATA_STORE=khepri`, which
fails as mentioned above.
We could run projection registration always when using Khepri but once
projections are registered the command is idempotent so there's no need
to, and the commands are somewhat large.
This is a cosmetic change. `?RA_CLUSTER_NAME` is equivalent but is used
for clustering commands. Commands sent via the `khepri`/`khepri_adv`
APIs consistently use the `?STORE_ID` macro instead.
Previously this function threw errors. With this minor refactor we
return them instead so that `register_projection/0` is easier for
callers to work with. (In the child commit we will add another caller.)
This commit is only refactoring.
To avoid confusion with reply and noreply gen_server return values, this
commit uses different return values for handle_frame/2.
## What?
1. Support `handle-max` field in the AMQP 1.0 `begin` frame
2. Add a new setting `link_max_per_session` which defaults to 256.
3. Rename `session_max` to `session_max_per_connection`
## Why?
1. Operators might want to limit the number of links per session. A
similar setting `consumer_max_per_channel` exists for AMQP 0.9.1.
2. We should use RabbitMQ 4.0 as an opportunity to set a sensible
default as to how many links can be active on a given session simultaneously.
The session code does iterate over every link in some scenarios (e.g.
queue was deleted). At some point, it's better to just open 2nd
session instead of attaching hundreds or thousands of links to a single session.
A default `link_max_per_session` of 256 should be more than enough given
that `session_max_per_connection` is 64. So, the defaults allow
`256 * 64 = 16,384` links to be active on an AMQP 1.0 connection.
(Operators might want to lower both defaults.)
3. The name is clearer given that we might introduce
`session_max_per_node` in the future since
`channel_max_per_node` exists for AMQP 0.9.1.
### Additional Context
> Link handles MAY be reused once a link is closed for both send and receive.
> To make it easier to monitor AMQP link attach frames, it is RECOMMENDED that
> implementations always assign the lowest available handle to this field.
* Enforce AMQP 1.0 channel-max
Enforce AMQP 1.0 field `channel-max` in the `open` frame by introducing
a new more user friendly setting called `session_max`:
> The channel-max value is the highest channel number that can be used on the connection.
> This value plus one is the maximum number of sessions that can be simultaneously active on the connection.
We set the default value of `session_max` to 64 such that, by
default, RabbitMQ 4.0 allows maximum 64 AMQP 1.0 sessions per AMQP 1.0 connection.
More than 64 AMQP 1.0 sessions per connection make little sense.
See also https://www.rabbitmq.com/blog/2024/09/02/amqp-flow-control#session
Limiting the maximum number of sessions per connection can be useful to
protect against
* applications that accidentally open new sessions without ending old sessions
(session leaks)
* too many metrics being exposed, for example in the future via the
"/metrics/per-object" Prometheus endpoint with timeseries per session
being emitted.
This commit does not make use of the existing `channel_max` setting
because:
1. Given that `channel_max = 0` means "no limit", there is no way for an
operator to limit the number of sessions per connections to 1.
2. Operators might want to set different limits for maximum number of
AMQP 0.9.1 channels and maximum number of AMQP 1.0 sessions.
3. The default of `channel_max` is very high: It allows using more than
2,000 AMQP 0.9.1 channels per connection. Lowering this default might
break existing AMQP 0.9.1 applications.
This commit also fixes a bug in the AMQP 1.0 Erlang client which, prior
to this commit used channel number 1 for the first session. That's wrong
if a broker allows maximum 1 session by replying with `channel-max = 0`
in the `open` frame. Additionally, the spec recommends:
> To make it easier to monitor AMQP sessions, it is RECOMMENDED that implementations always assign the lowest available unused channel number.
Note that in AMQP 0.9.1, channel number 0 has a special meaning:
> The channel number is 0 for all frames which are global to the connection and 1-65535 for frames that
refer to specific channels.
* Apply PR feedback
With the change in the parent commit we no longer set and clear a
runtime parameter when deleting an exchange as part of vhost deletion.
We need to adapt the `audit_vhost_internal_parameter` test case to test
that the parameter is set and cleared when the exchange is deleted
instead.
Currently we delete each exchange one-by-one which requires three
commands: the delete itself plus a put and delete for a runtime
parameter that acts as a lock to prevent a client from declaring an
exchange while it's being deleted. The lock is unnecessary during vhost
deletion because permissions are cleared for the vhost before any
resources are deleted.
We can use a transaction to delete all exchanges and bindings for a
vhost in a single command against the Khepri store. This minimizes the
number of commands we need to send against the store and therefore the
latency of the deletion.
In a quick test with a vhost containing only 10,000 exchanges (no
bindings, queues, users, etc.), this is an order of magnitude speedup:
the prior commit takes 22s to delete the vhost while with this commit
the vhost is deleted in 2s.
Currently we use a combination of `khepri_tx:get_many/1` and then either
`khepri_tx:delete/1` or `khepri_tx:delete_many/1`. This isn't a
functional change: switching to `khepri_tx_adv:delete_many/1` is
essentially equivalent but performs the deletion and lookup all in one
command and one traversal of the tree. This should improve performance
when deleting many bindings in an exchange.
[Why]
The previous layout followed the flat structure we have in Mnesia:
* In Mnesia, we have tables named after each purpose (exchanges, queues,
runtime parameters and so on).
* In Khepri, we had about the same but the table names were replaced by
a tree node in the tree. We ended up with one tree node per purpose
at the root of the tree.
Khepri implements a tree. We could benefit from this and organize data
to reflect their relationship in RabbitMQ.
[How]
Here is the new hierarchy implemented by this commit:
rabbitmq
|-- users
| `-- $username
|-- vhosts
| `-- $vhost
| |-- user_permissions
| | `-- $username
| |-- exchanges
| | `-- $exchange
| | |-- bindings
| | | |-- queue
| | | | `-- $queue
| | | `-- exchange
| | | `-- $exchange
| | |-- consistent_hash_ring_state
| | |-- jms_topic
| | |-- recent_history
| | |-- serial
| | `-- user_permissions
| | `-- $username
| |-- queues
| | `-- $queue
| `-- runtime_params
| `-- $param_name
|-- runtime_params
| `-- $param_name
|-- mirrored_supervisors
| `-- $group
| `-- $id
`-- node_maintenance
`-- $node
We first define a root path in `rabbit/include/khepri.hrl` as
`[rabbitmq]`. This could be anything, including an empty path.
All paths are constructed either from this root path definition (users
and vhosts paths do that), or from a parent resource's path (exchanges
and queues paths are based on a vhost path).
[Why]
Currently, `rabbit_db_*` modules use and export the following kind of
functions to return the path to the resources they manage:
khepri_db_thing:khepri_things_path(),
khepri_db_thing:khepri_thing_path(Identifier).
Internally, `khepri_db_thing:khepri_thing_path(Identifier)` appends
`Identifier` to the list returned by
`khepri_db_thing:khepri_things_path()`. This works for the organization
of the records we have today in Khepri:
|-- thing
| |-- <<"identifier1">>
| | <<"identifier2">>
`-- other_thing
`-- <<"other_identifier1">>
However, with the upcoming organization that leverages the tree in
Khepri, identifiers may be in the middle of the path instead of a leaf
component. We may also put `other_thing` under `thing` in the tree.
That's why, we can't really expose a parent directory for `thing` and
`other_thing`. Therefore, `khepri_db_thing:khepri_things_path/0` needs
to go away. Only `khepri_db_thing:khepri_thing_path/1` should be
exported and used.
In addition to that, there are several places where paths are hard-coded
(i.e. their definition is duplicated).
[How]
The patch does exactly that. Uses of
`khepri_db_thing:khepri_things_path()` are generally replaced by
`rabbit_db_thing:khepri_thing_path(?KHEPRI_WILDCARD_STAR)`.
Places where the path definitions were duplicated are fixed too by
calling the path building functions.
In the future, for a resource that depends on another one, the
corresponding module will call the `rabbit_db_thing:khepri_thing_path/1`
for that other resource and build its path on top of that.
A while back, @mkuratczyk noted that we keep the timestamp of when a
connection is established in the connection state and related ETS table.
This PR uses the `connected_at` timestamp to calculate the duration of
the connection, to make it easier to identify short-running connections
via the log files.
By default Ra will use the cluster name as the metrics key. Currently
atom values are ignored by the prometheus plugin's tag rendering
functions, so if you have a QQ and Khepri running and request the
`/metrics/per-object` or `/metrics/detailed` endpoints you'll see values
that don't have labels set for the `ra_metrics` metrics:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total 10
With this change we map the name of the Ra cluster to a "raft_cluster"
tag, so instead an example metric might be:
# TYPE rabbitmq_raft_term_total counter
# HELP rabbitmq_raft_term_total Current Raft term number
rabbitmq_raft_term_total{vhost="/",queue="qq"} 9
rabbitmq_raft_term_total{raft_cluster="rabbitmq_metadata"} 10
This affects metrics for Khepri and the stream coordinator.
[Why]
If a node failed to join a cluster, `rabbit` was restarted then the
feature flags were reset and the error returned. I.e., the error
handling was in a single place at the end of the function.
We need to reset feature flags after a failure because the feature flags
states were copied from the remote node just before the join.
However, resetting them after restarting `rabbit` was incorrect because
feature flags were initialized in a way that didn't match the rest of
the state. This led to crashes during the start of `rabbit`.
[How]
The feature flags are now reset after the failure to join but before
starting `rabbit`.
A new testcase was added to test this scenario.
Because `ct_master` is yet another Erlang node, and it is used
to run multiple CT nodes, meaning it is in a cluster of CT
nodes, the tests that change the net_ticktime could not
work properly anymore. This is because net_ticktime must
be the same value across the cluster.
The same value had to be set for all tests in order to solve
this. This is why it was changed to 5s across the board. The
lower net_ticktime was used in most places to speed up tests
that must deal with cluster failures, so that value is good
enough for these cases.
One test in amqp_client was using the net_ticktime to test
the behavior of the direct connection timeout with varying
net_ticktime configurations. The test now mocks the
`net_kernel:get_net_ticktime()` function to achieve the
same result.
This has no real impact on performance[1] but should
make it clear which application can run the broker
and/or publish to Hex.pm. In particular, applications
that we can't run the broker from will now give up
early if we try to.
Note that while the broker can't normally run from the
amqp_client application's directory, it can run from
tests and some of the tests start the broker.
[1] on my machine
No real need to have two files, especially since it contains
only a few variable definitions. Plan is to only keep
separate files for larger features such as dist or run.