This makes a command that renames cluster members
a no-op. This command is really complex under
the hood and is fundamentally incompatible
with a few key Raft-based features:
* Khepri
* Quorum queues
* Streams
Because Khepri first ships in RabbitMQ 3.13,
now is the time to effectively eliminate this
command.
It will be permanently removed together with
other deprecated CLI commands in 4.0.
Per discussion with the team.
Closes#10367.
This revisits the information system conversion,
that is, support for suffixes like GiB, GB.
When configuration values like disk_free_limit.absolute,
vm_memory_high_watermark.absolute are set, the value
can contain an information unit (IU) suffix.
We now support several new suffixes and the meaning
a few more changes.
First, the changes:
* k, K now mean kilobytes and not kibibytes
* m, M now mean megabytes and not mebibytes
* g, G now means gigabytes and not gibibytes
This is to match the system used by Kubernetes.
There is no consensus in the industry about how
"k", "m", "g", and similar single letter suffixes
should be treated. Previously it was a power of 2,
now a power of 10 to align with a very popular OSS
project that explicitly documents what suffixes it supports.
Now, the additions:
Finally, the node will now validate these suffixes
at boot time, so an unsupported value will cause
the node to stop with a rabbitmq.conf validation
error.
The message logged will look like this:
````
2024-01-15 22:11:17.829272-05:00 [error] <0.164.0> disk_free_limit.absolute invalid, supported formats: 500MB, 500MiB, 10GB, 10GiB, 2TB, 2TiB, 10000000000
2024-01-15 22:11:17.829376-05:00 [error] <0.164.0> Error preparing configuration in phase validation:
2024-01-15 22:11:17.829387-05:00 [error] <0.164.0> - disk_free_limit.absolute invalid, supported formats: 500MB, 500MiB, 10GB, 10GiB, 2TB, 2TiB, 10000000000
````
Closes#10310
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.
To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.
Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.
After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.
[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.
We wanted the new algorithm to fulfill the following properties:
1. `rabbit_peer_discovery` should provide the ability to re-trigger it
easily to re-evaluate the cluster. The new public API is
`rabbit_peer_discovery:sync_desired_cluster/0`.
2. The selection of the node to join should be designed in a way that
all nodes select the same, regardless of the order in which they
become available. The adopted solution is to sort the list of
discovered nodes with the following criterias (in that order):
1. the size of the cluster a discovered node is part of; sorted from
bigger to smaller clusters
2. the start time of a discovered node; sorted from older to younger
nodes
3. the name of a discovered node; sorted alphabetically
The first node in that list will not join anyone and simply proceed
with its boot process. Other nodes will try to join the first node.
3. To reduce the chance of incorrectly having multiple standalone nodes
because the discovery backend returned only a single node, we want to
apply the following constraints to the list of nodes after it is
filtered and sorted (see property 2 above):
* The list must contain `node()` (i.e. the node running peer
discovery itself).
* If the RabbitMQ's cluster size hint is greater than 1, the list
must have at least two nodes. The cluster size hint is the maximum
between the configured target cluster size hint and the number of
elements in the nodes list returned by the backend.
If one of the constraint is not met, the entire peer discovery
process is restarted after a delay.
4. The lock is acquired only to protect the actual join, not the
discovery step where the backend is queried to get the list of peers.
With the node selection described above, this will let the first node
to start without acquiring the lock.
5. The cluster membership views queried as part of the algorithm to sort
the list of nodes will be used to detect additional clusters or
standalone nodes that did not cluster correctly. These nodes will be
asked to re-evaluate peer discovery to increase the chance of forming
a single cluster.
6. After some delay, peer discovery will be re-evaluated to further
eliminate the chances of having multiple clusters instead of one.
This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.
If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval
The default parameters were bumped to 30 retries with a delay of 1
second between each.
The locking retries/interval parameters are not used by the new
algorithm anymore.
There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
make sure we use the same backend throughout the entire process and
when we call `maybe_unregister/0` even if the configuration changed
for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
the end of a successful peer discovery process. `rabbit_db_cluster`
had to call `maybe_register/0` if the node was not virgin anyway. So
make it simpler and always call it in `rabbit_db_cluster` regardless
of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
directly. There is no need to explicitly call another function for
that.
* Messages are logged using `?LOG_*()` macros instead of the old
`rabbit_log` module.
[Why]
Up until now, a user had to run the following three commands to expand a
cluster:
1. stop_app
2. join_cluster
3. start_app
Stopping and starting the `rabbit` application and taking care of the
underlying Mnesia application could be handled by `join_cluster`
directly.
[How]
After the call to `can_join/1` and before proceeding with the actual
join, the code remembers the state of `rabbit`, the Feature flags
controler and Mnesia.
After the join, it restarts whatever needs to be restarted to. It does
so regardless of the success or failure of the join. One exception is
when the node switched from Mnesia to Khepri as part of that join. In
this case, Mnesia is left stopped.
[Why]
When a Khepri-based node joins a Mnesia-based cluster, it is reset and
switches back from Khepri to Mnesia. If there are Mnesia files left in
its data directory, Mnesia will restart with stale/incorrect data and
the operation will fail.
After a migration to Khepri, we need to make sure there is no stale
Mnesia files.
[How]
We use `rabbit_mnesia` to query the Mnesia files and delete them.
Providing a pre-hashed and salted password is
not significantly more secure but satisfies those
who cannot pass clear text passwords on the command
line for regulatory reasons.
Note that the optimal way of seeding users is still
definition import on node boot, not scripting with
CLI tools.
Closes#9166
Because both `add_member` and `grow` default to Membership status `promotable`,
new members will have to catch up before they are considered cluster members.
This can be overridden with either `voter` or (permanent `non_voter` statuses.
The latter one is useless without additional tooling so kept undocumented.
- non-voters do not affect quorum size for election purposes
- `observer_cli` reports their status with lowercase 'f'
- `rabbitmq-queues check_if_node_is_quorum_critical` takes voter status into
account
[Why]
Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:
* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...
Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.
[How]
@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.
We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.
This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.
This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.
Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.
For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
of the group can write to the database.
Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.
This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.
To enable Khepri, you need to enable the `khepri_db` feature flag:
rabbitmqctl enable_feature_flag khepri_db
When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
conversion if necessary on the way. Again, it uses
`mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
it.
This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.
Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.
More about the implementation details below:
In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:
* Up until RabbitMQ 3.12.x:
get(VHostName) when is_binary(VHostName) ->
get_in_mnesia(VHostName).
* Starting with RabbitMQ 3.13.0:
get(VHostName) when is_binary(VHostName) ->
rabbit_khepri:handle_fallback(
#{mnesia => fun() -> get_in_mnesia(VHostName) end,
khepri => fun() -> get_in_khepri(VHostName) end}).
This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.
Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.
However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:
* Sometimes, we need the feature flag state query to block because the
function interested in it can't return a valid answer during the
migration. Here is an example:
case rabbit_khepri:is_enabled(RemoteNode) of
true -> can_join_using_khepri(RemoteNode);
false -> can_join_using_mnesia(RemoteNode)
end
* Sometimes, we need the feature flag state query to NOT block (for
instance because it would cause a deadlock). Here is an example:
case rabbit_khepri:get_feature_state() of
enabled -> members_using_khepri();
_ -> members_using_mnesia()
end
Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.
Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:
-rabbit_mnesia_tables_to_khepri_db(
[{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).
The converter module — `rabbit_db_rh_exchange_m2k_converter` in this
example — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.
[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration
See #7206.
Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
[Why]
The CLI may be used against a remote node running a different version.
We took that into account in several uses of the `rabbit_db*` modules on
remote nodes, but not everywhere. Likewise in the
`clustering_management_SUITE` testsuite.
[How]
This patch falls back to previous `rabbit_mnesia`-based calls if the
initial calls throws an `undef` exception.
[Why]
`rabbit_mnesia` indirectly checks that `rabbit` is stopped on the remote
node because `mnesia:del_table_copy()` requires that Mnesia is stopped
to delete the schema. However, this is not specific to Mnesia and we
want `rabbit` to be stopped when we use Khepri in the future.
[How]
We use `rabbit:is_running(Node)` to query the status of RabbitMQ on the
remote node to forget. This is not atomic so there is a small chance
that RabbitMQ is restarted between the check and the actual forget.
Note: `rabbit_mnesia` also removes some queues and emit a "left cluster"
event after a successful forget. However, this part was not moved
because other parts of the module rely on this in RPC calls. To keep
nodes compatibles, the calls are left in place. They will be duplicated
for Khepri.
[Why]
When a single node or a cluster is initialized, we go through a few
steps which are not Mnesia-specific or even related. For instance, we
verify that both ends are compatible w.r.t. feature flags.
When we will introduce Khepri, we will have to go through the same
generic steps. Therefore, it makes sense to drive those steps from
`rabbit_db_cluster` and only call into `rabbit_mnesia` when needed.
[How]
The generic code is moved from `rabbit_mnesia` to `rabbit_db_cluster`.
Introduce 'ctl update_vhost_metadata'
that can be used to update the description, tags or default queue type of
any existing virtual hosts.
Closes#7912, #7857.
#7912 will need an HTTP API counterpart change.
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).
In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
* Fetch all prod cli deps with bazel
This avoids issues with hex and OTP 26, and is needed for offline
bazel builds anyway
* Fetch test cli deps with bazel
* mix format
vhost_precondition_failed => vhost_limit_exceeded
vhost_limit_exceeded is the error type used by
definition import when a per-vhost is exceeded.
It feels appropriate for this case, too.
This new module sits on top of `rabbit_mnesia` and provide an API with
all cluster-related functions.
`rabbit_mnesia` should be called directly inside Mnesia-specific code
only, `rabbit_mnesia_rename` or classic mirrored queues for instance.
Otherwise, `rabbit_db_cluster` must be used.
Several modules, in particular in `rabbitmq_cli`, continue to call
`rabbit_mnesia` as a fallback option if the `rabbit_db_cluster` module
unavailable. This will be the case when the CLI will interact with an
older RabbitMQ version.
This will help with the introduction of a new database backend.
This is the latest commit in the series, it fixes (almost) all the
problems with missing and circular dependencies for typing.
The only 2 unsolved problems are:
- `lg` dependency for `rabbit` - the problem is that it's the only
dependency that contains NIF. And there is no way to make dialyzer
ignore it - looks like unknown check is not suppressable by dialyzer
directives. In the future making `lg` a proper dependency can be a
good thing anyway.
- some missing elixir function in `rabbitmq_cli` (CSV, JSON and
logging related).
- `eetcd` dependency for `rabbitmq_peer_discovery_etcd` - this one
uses sub-directories in `src/`, which confuses dialyzer (or our bazel
machinery is not able to properly handle it). I've tried the latest
rules_erlang which flattens directory for .beam files, but it wasn't
enough for dialyzer - it wasn't able to find core erlang files. This
is a niche plugin and an unusual dependency, so probably not worth
investigating further.