This revisits the information system conversion,
that is, support for suffixes like GiB, GB.
When configuration values like disk_free_limit.absolute,
vm_memory_high_watermark.absolute are set, the value
can contain an information unit (IU) suffix.
We now support several new suffixes and the meaning
a few more changes.
First, the changes:
* k, K now mean kilobytes and not kibibytes
* m, M now mean megabytes and not mebibytes
* g, G now means gigabytes and not gibibytes
This is to match the system used by Kubernetes.
There is no consensus in the industry about how
"k", "m", "g", and similar single letter suffixes
should be treated. Previously it was a power of 2,
now a power of 10 to align with a very popular OSS
project that explicitly documents what suffixes it supports.
Now, the additions:
Finally, the node will now validate these suffixes
at boot time, so an unsupported value will cause
the node to stop with a rabbitmq.conf validation
error.
The message logged will look like this:
````
2024-01-15 22:11:17.829272-05:00 [error] <0.164.0> disk_free_limit.absolute invalid, supported formats: 500MB, 500MiB, 10GB, 10GiB, 2TB, 2TiB, 10000000000
2024-01-15 22:11:17.829376-05:00 [error] <0.164.0> Error preparing configuration in phase validation:
2024-01-15 22:11:17.829387-05:00 [error] <0.164.0> - disk_free_limit.absolute invalid, supported formats: 500MB, 500MiB, 10GB, 10GiB, 2TB, 2TiB, 10000000000
````
Closes#10310
[Why]
Classic queue mirroring will be removed in RabbitMQ 4.0. Quorum queues
provide a better safer alternative. Non-replicated classic queues remain
supported.
[How]
Classic queue mirroring is marked as deprecated in the code using the
Deprecated features subsystem (based on feature flags). See #7390 for a
description of that subsystem.
To test RabbitMQ behavior as if the feature was removed, the following
configuration setting can be used:
deprecated_features.permit.classic_queue_mirroring = false
To turn off classic queue mirroring, there must be no classic mirrored
queues declared and no HA policy defined. A node with classic mirrored
queues will refuse to start if classic queue mirroring is turned off.
Once classic queue mirroring is turned off, users will not be able to
declare HA policies. Trying to do that from the CLI or the management
API will be rejected with a warning in the logs. This impacts clustering
too: a node with classic queue mirroring turned off will only cluster
with another node which has no HA policy or has classic queue mirroring
turned off.
Note that given the marketing calendar, the deprecated feature will go
directly from "permitted by default" to "removed" in RabbitMQ 4.0. It
won't go through the gradual deprecation process.
V2: Renamed the deprecated feature from `classic_mirrored_queues` to
`classic_queue_mirroring` to better reflect the intention. Otherwise
it could be unclear is only the mirroring property is
deprecated/removed or classic queues entirely.
This introduces a way to declare deprecated features in the code, not
only in our communication. The new module allows to disallow the use of
a deprecated feature and/or warn the user when he relies on such a
feature.
[Why]
Currently, we only tell people about deprecated features through blog
posts and the mailing-list. This might be insufficiant for our users
that a feature they use will be removed in a future version:
* They may not read our blog or mailing-list
* They may not understand that they use such a deprecated feature
* They might wait for the big removal before they plan testing
* They might not take it seriously enough
The idea behind this patch is to increase the chance that users notice
that they are using something which is about to be dropped from
RabbitMQ. Anopther benefit is that they should be able to test how
RabbitMQ will behave in the future before the actual removal. This
should allow them to test and plan changes.
[How]
When a feature is deprecated in other large projects (such as FreeBSD
where I took the idea from), it goes through a lifecycle:
1. The feature is still available, but users get a warning somehow when
they use it. They can disable it to test.
2. The feature is still available, but disabled out-of-the-box. Users
can re-enable it (and get a warning).
3. The feature is disconnected from the build. Therefore, the code
behind it is still there, but users have to recompile the thing to be
able to use it.
4. The feature is removed from the source code. Users have to adapt or
they can't upgrade anymore.
The solution in this patch offers the same lifecycle. A deprecated
feature will be in one of these deprecation phases:
1. `permitted_by_default`: The feature is available. Users get a warning
if they use it. They can disable it from the configuration.
2. `denied_by_default`: The feature is available but disabled by
default. Users get an error if they use it and RabbitMQ behaves like
the feature is removed. They can re-enable is from the configuration
and get a warning.
3. `disconnected`: The feature is present in the source code, but is
disabled and can't be re-enabled without recompiling RabbitMQ. Users
get the same behavior as if the code was removed.
4. `removed`: The feature's code is gone.
The whole thing is based on the feature flags subsystem, but it has the
following differences with other feature flags:
* The semantic is reversed: the feature flag behind a deprecated feature
is disabled when the deprecated feature is permitted, or enabled when
the deprecated feature is denied.
* The feature flag behind a deprecated feature is enabled out-of-the-box
(meaning the deprecated feature is denied):
* if the deprecation phase is `permitted_by_default` and the
configuration denies the deprecated feature
* if the deprecation phase is `denied_by_default` and the
configuration doesn't permit the deprecated feature
* if the deprecation phase is `disconnected` or `removed`
* Feature flags behind deprecated feature don't appear in feature flags
listings.
Otherwise, deprecated features' feature flags are managed like other
feature flags, in particular inside clusters.
To declare a deprecated feature:
-rabbit_deprecated_feature(
{my_deprecated_feature,
#{deprecation_phase => permitted_by_default,
msgs => #{when_permitted => "This feature will be removed in RabbitMQ X.0"},
}}).
Then, to check the state of a deprecated feature in the code:
case rabbit_deprecated_features:is_permitted(my_deprecated_feature) of
true ->
%% The deprecated feature is still permitted.
ok;
false ->
%% The deprecated feature is gone or should be considered
%% unavailable.
error
end.
Warnings and errors are logged automatically. A message is generated
automatically, but it is possible to define a message in the deprecated
feature flag declaration like in the example above.
Here is an example of a logged warning that was generated automatically:
Feature `my_deprecated_feature` is deprecated.
By default, this feature can still be used for now.
Its use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.
To continue using this feature when it is not permitted by default, set the following parameter in your configuration:
"deprecated_features.permit.my_deprecated_feature = true"
To test RabbitMQ as if the feature was removed, set this in your configuration:
"deprecated_features.permit.my_deprecated_feature = false"
To override the default state of `permitted_by_default` and
`denied_by_default` deprecation phases, users can set the following
configuration:
# In rabbitmq.conf:
deprecated_features.permit.my_deprecated_feature = true # or false
The actual behavior protected by a deprecated feature check is out of
scope for this subsystem. It is the repsonsibility of each deprecated
feature code to determine what to do when the deprecated feature is
denied.
V1: Deprecated feature states are initially computed during the
initialization of the registry, based on their deprecation phase and
possibly the configuration. They don't go through the `enable/1`
code at all.
V2: Manage deprecated feature states as any other non-required
feature flags. This allows to execute an `is_feature_used()`
callback to determine if a deprecated feature can be denied. This
also allows to prevent the RabbitMQ node from starting if it
continues to use a deprecated feature.
V3: Manage deprecated feature states from the registry initialization
again. This is required because we need to know very early if some
of them are denied, so that an upgrade to a version of RabbitMQ
where a deprecated feature is disconnected or removed can be
performed.
To still prevent the start of a RabbitMQ node when a denied
deprecated feature is actively used, we run the `is_feature_used()`
callback of all denied deprecated features as part of the
`sync_cluster()` task. This task is executed as part of a feature
flag refresh executed when RabbitMQ starts or when plugins are
enabled. So even though a deprecated feature is marked as denied in
the registry early in the boot process, we will still abort the
start of a RabbitMQ node if the feature is used.
V4: Support context-dependent warnings. It is now possible to set a
specific message when deprecated feature is permitted, when it is
denied and when it is removed. Generic per-context messages are
still generated.
V5: Improve default warning messages, thanks to @pstack2021.
V6: Rename the configuration variable from `permit_deprecated_features.*`
to `deprecated_features.permit.*`. As @michaelklishin said, we tend
to use shorter top-level names.
as it nicer categorises if there will be a future
"message_interceptors.outgoing.*" key.
We leave the advanced config file key because simple single value
settings should not require using the advanced config file.
As reported in https://groups.google.com/g/rabbitmq-users/c/x8ACs4dBlkI/
plugins that implement rabbit_channel_interceptor break with
Native MQTT in 3.12 because Native MQTT does not use rabbit_channel anymore.
Specifically, these plugins don't work anymore in 3.12 when sending a message
from an MQTT publisher to an AMQP 0.9.1 consumer.
Two of these plugins are
https://github.com/rabbitmq/rabbitmq-message-timestamp
and
https://github.com/rabbitmq/rabbitmq-routing-node-stamp
This commit moves both plugins into rabbitmq-server.
Therefore, these plugins are deprecated starting in 3.12.
Instead of using these plugins, the user gets the same behaviour by
configuring rabbitmq.conf as follows:
```
incoming_message_interceptors.set_header_timestamp.overwrite = false
incoming_message_interceptors.set_header_routing_node.overwrite = false
```
While both plugins were incompatible to be used together, this commit
allows setting both headers.
We name the top level configuration key `incoming_message_interceptors`
because only incoming messages are intercepted.
Currently, only `set_header_timestamp` and `set_header_routing_node` are
supported. (We might support more in the future.)
Both can set `overwrite` to `false` or `true`.
The meaning of `overwrite` is the same as documented in
https://github.com/rabbitmq/rabbitmq-message-timestamp#always-overwrite-timestamps
i.e. whether headers should be overwritten if they are already present
in the message.
Both `set_header_timestamp` and `set_header_routing_node` behave exactly
to plugins `rabbitmq-message-timestamp` and `rabbitmq-routing-node-stamp`,
respectively.
Upon node boot, the configuration is put into persistent_term to not
cause any performance penalty in the default case where these settings
are disabled.
The channel and MQTT connection process will intercept incoming messages
and - if configured - add the desired AMQP 0.9.1 headers.
For now, this allows using Native MQTT in 3.12 with the old plugins
behaviour.
In the future, once "message containers" are implemented,
we can think about more generic message interceptors where plugins can be
written to modify arbitrary headers or message contents for various protocols.
Likewise, in the future, once MQTT 5.0 is implemented, we can think
about an MQTT connection interceptor which could function similar to a
`rabbit_channel_interceptor` allowing to modify any MQTT packet.
as `raft.adaptive_failure_detector.poll_interval`.
On systems under peak load, inter-node communication link congestion
can result in false positives and trigger QQ leader re-elections that
are unnecessary and could make the situation worse.
Using a higher poll interval would at least reduce the probability of
false positives.
Per discussion with @kjnilsson @mkuratczyk.
Limits are defined in the instance config:
default_limits.vhosts.1.pattern = ^device
default_limits.vhosts.1.max_connections = 10
default_limits.vhosts.1.max_queues = 10
default_limits.vhosts.2.pattern = ^system
default_limits.vhosts.2.max_connections = 100
default_limits.vhosts.3.pattern = .*
default_limits.vhosts.3.max_connections = 20
default_limits.vhosts.3.max_queues = 20
Where pattern is a regular expression used to match limits to a newly
created vhost, and the limits are non-negative integers. First matching
set of limits is applied, only once, during vhost creation.
This is meant to be used by deployment tools,
core features and plugins
that expect a certain minimum
number of cluster nodes
to be present.
For example, certain setup steps
in distributed plugins might require
at least three nodes to be available.
This is just a hint, not an enforced
requirement. The default value is 1
so that for single node clusters,
there would be no behavior changes.
The classic local filesystem source is still supported
using the same traditional configuration key, load_definitions.
Configuration schema follows peer discovery in spirit:
* definitions.import_backend configures the mechanism to use,
which can be a module provided by a plugin
* definitions.* keys can be defined by plugins and contain any
keys a specific mechanism needs
For example, the classic local filesystem source can now be
configured like this:
``` ini
definitions.import_backend = local_filesystem
definitions.local.path = /path/to/definitions.d/definition.json
```
``` ini
definitions.import_backend = https
definitions.https.url = https://hostname/path/to/definitions.json
```
HTTPS may require additional configuration keys related to TLS/x.509
peer verification. Such extra keys will be added as the need for them
becomes evident.
References #3249
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.
Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.
For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in https://github.com/rabbitmq/cluster-operator/issues/662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.
To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.
Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
https://github.com/rabbitmq/cluster-operator/pull/689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.
In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.
global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.
The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).
Two special cases worth mentioning:
1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).
2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.
Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.
Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
The design of the rabbit_ct_config_schema helper makes it impossible to
do pattern matching and thus handle default values in the schema. As a
consequence, the helper explicitly removes the `{rabbit, {log, _}}`
configuration key to work around this limitation until a proper solution
is implemented and all testsuites rewritten. See
rabbitmq/rabbitmq-ct-helpers@b1f1f1ce68.
Therefore, we can't test log configuration variables anymore using this
helper. Thatt's ok because logging_SUITE already tests many things.