[Why]
When a Khepri-based node joins a Mnesia-based cluster, it is reset and
switches back from Khepri to Mnesia. If there are Mnesia files left in
its data directory, Mnesia will restart with stale/incorrect data and
the operation will fail.
After a migration to Khepri, we need to make sure there is no stale
Mnesia files.
[How]
We use `rabbit_mnesia` to query the Mnesia files and delete them.
Providing a pre-hashed and salted password is
not significantly more secure but satisfies those
who cannot pass clear text passwords on the command
line for regulatory reasons.
Note that the optimal way of seeding users is still
definition import on node boot, not scripting with
CLI tools.
Closes#9166
Because both `add_member` and `grow` default to Membership status `promotable`,
new members will have to catch up before they are considered cluster members.
This can be overridden with either `voter` or (permanent `non_voter` statuses.
The latter one is useless without additional tooling so kept undocumented.
- non-voters do not affect quorum size for election purposes
- `observer_cli` reports their status with lowercase 'f'
- `rabbitmq-queues check_if_node_is_quorum_critical` takes voter status into
account
[Why]
Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:
* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...
Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.
[How]
@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.
We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.
This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.
This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.
Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.
For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
of the group can write to the database.
Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.
This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.
To enable Khepri, you need to enable the `khepri_db` feature flag:
rabbitmqctl enable_feature_flag khepri_db
When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
conversion if necessary on the way. Again, it uses
`mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
it.
This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.
Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.
More about the implementation details below:
In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:
* Up until RabbitMQ 3.12.x:
get(VHostName) when is_binary(VHostName) ->
get_in_mnesia(VHostName).
* Starting with RabbitMQ 3.13.0:
get(VHostName) when is_binary(VHostName) ->
rabbit_khepri:handle_fallback(
#{mnesia => fun() -> get_in_mnesia(VHostName) end,
khepri => fun() -> get_in_khepri(VHostName) end}).
This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.
Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.
However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:
* Sometimes, we need the feature flag state query to block because the
function interested in it can't return a valid answer during the
migration. Here is an example:
case rabbit_khepri:is_enabled(RemoteNode) of
true -> can_join_using_khepri(RemoteNode);
false -> can_join_using_mnesia(RemoteNode)
end
* Sometimes, we need the feature flag state query to NOT block (for
instance because it would cause a deadlock). Here is an example:
case rabbit_khepri:get_feature_state() of
enabled -> members_using_khepri();
_ -> members_using_mnesia()
end
Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.
Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:
-rabbit_mnesia_tables_to_khepri_db(
[{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).
The converter module — `rabbit_db_rh_exchange_m2k_converter` in this
example — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.
[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration
See #7206.
Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
[Why]
The CLI may be used against a remote node running a different version.
We took that into account in several uses of the `rabbit_db*` modules on
remote nodes, but not everywhere. Likewise in the
`clustering_management_SUITE` testsuite.
[How]
This patch falls back to previous `rabbit_mnesia`-based calls if the
initial calls throws an `undef` exception.
[Why]
`rabbit_mnesia` indirectly checks that `rabbit` is stopped on the remote
node because `mnesia:del_table_copy()` requires that Mnesia is stopped
to delete the schema. However, this is not specific to Mnesia and we
want `rabbit` to be stopped when we use Khepri in the future.
[How]
We use `rabbit:is_running(Node)` to query the status of RabbitMQ on the
remote node to forget. This is not atomic so there is a small chance
that RabbitMQ is restarted between the check and the actual forget.
Note: `rabbit_mnesia` also removes some queues and emit a "left cluster"
event after a successful forget. However, this part was not moved
because other parts of the module rely on this in RPC calls. To keep
nodes compatibles, the calls are left in place. They will be duplicated
for Khepri.
[Why]
When a single node or a cluster is initialized, we go through a few
steps which are not Mnesia-specific or even related. For instance, we
verify that both ends are compatible w.r.t. feature flags.
When we will introduce Khepri, we will have to go through the same
generic steps. Therefore, it makes sense to drive those steps from
`rabbit_db_cluster` and only call into `rabbit_mnesia` when needed.
[How]
The generic code is moved from `rabbit_mnesia` to `rabbit_db_cluster`.
Introduce 'ctl update_vhost_metadata'
that can be used to update the description, tags or default queue type of
any existing virtual hosts.
Closes#7912, #7857.
#7912 will need an HTTP API counterpart change.
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).
In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
* Fetch all prod cli deps with bazel
This avoids issues with hex and OTP 26, and is needed for offline
bazel builds anyway
* Fetch test cli deps with bazel
* mix format
vhost_precondition_failed => vhost_limit_exceeded
vhost_limit_exceeded is the error type used by
definition import when a per-vhost is exceeded.
It feels appropriate for this case, too.
This new module sits on top of `rabbit_mnesia` and provide an API with
all cluster-related functions.
`rabbit_mnesia` should be called directly inside Mnesia-specific code
only, `rabbit_mnesia_rename` or classic mirrored queues for instance.
Otherwise, `rabbit_db_cluster` must be used.
Several modules, in particular in `rabbitmq_cli`, continue to call
`rabbit_mnesia` as a fallback option if the `rabbit_db_cluster` module
unavailable. This will be the case when the CLI will interact with an
older RabbitMQ version.
This will help with the introduction of a new database backend.
This is the latest commit in the series, it fixes (almost) all the
problems with missing and circular dependencies for typing.
The only 2 unsolved problems are:
- `lg` dependency for `rabbit` - the problem is that it's the only
dependency that contains NIF. And there is no way to make dialyzer
ignore it - looks like unknown check is not suppressable by dialyzer
directives. In the future making `lg` a proper dependency can be a
good thing anyway.
- some missing elixir function in `rabbitmq_cli` (CSV, JSON and
logging related).
- `eetcd` dependency for `rabbitmq_peer_discovery_etcd` - this one
uses sub-directories in `src/`, which confuses dialyzer (or our bazel
machinery is not able to properly handle it). I've tried the latest
rules_erlang which flattens directory for .beam files, but it wasn't
enough for dialyzer - it wasn't able to find core erlang files. This
is a niche plugin and an unusual dependency, so probably not worth
investigating further.
So far, we had the following functions to list nodes in a RabbitMQ
cluster:
* `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster;
the argument was used to select members (all members or only those
running Mnesia and participating in the cluster)
* `rabbit_nodes:all/0` to get all members of the Mnesia cluster
* `rabbit_nodes:all_running/0` to get all members who currently run
Mnesia
Basically:
* `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)`
* `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)`
We also have:
* `rabbit_node_monitor:alive_nodes/1` which filters the given list of
nodes to only select those currently running Mnesia
* `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given
list of nodes to only select those currently running RabbitMQ
Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the
`rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)`
or `rabbit_nodes:all_running/0` is often used as a close approximation
of "all cluster members running RabbitMQ". This list might be incorrect
in times where a node is joining the clustered or is being worked on
(i.e. Mnesia is running but not RabbitMQ).
With Khepri, there won't be the same possible approximation because we
will try to keep Khepri/Ra running even if RabbitMQ is stopped to
expand/shrink the cluster.
So in order to clarify what we want when we query a list of nodes, this
patch introduces the following functions:
* `rabbit_nodes:list_members/0` to get all cluster members, regardless
of their state
* `rabbit_nodes:list_reachable/0` to get all cluster members we can
reach using Erlang distribution, regardless of the state of RabbitMQ
* `rabbit_nodes:list_running/0` to get all cluster members who run
RabbitMQ, regardless of the maintenance state
* `rabbit_nodes:list_serving/0` to get all cluster members who run
RabbitMQ and are accepting clients
In addition to the list functions, there are the corresponding
`rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)`
filtering functions.
The code is modified to use these new functions. One possible
significant change is that the new list functions will perform RPC calls
to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
Testcases are executed in a random order. Unfortunately, this testcase
depended on side effects of other testcases. If this testcase was
executed first, then there were no permissions set and the testcase
would fail.
It now lists permissions before and after the actual test and compare
both.
* Add rabbitmq_cli dialyze to bazel
and fix a number of warnings
Because we stop mix from recompiling rabbit_common in bazel, many
unknown functions are reported, so this dialyzer analysis is somewhat
incomplete.
* Use erlang dialyzer for rabbitmq_cli rather than mix dialyzer
Since this resolves all of the rabbit functions, there are far fewer
unknown functions.
Requires yet to be released rules_erlang 3.9.2
* Temporarily use pre-release rules_erlang
So that checks can run on this PR without a release
* Fix additional dialyzer warnings in rabbitmq_cli
* rabbitmq_cli: mix format
* Additional fixes for ignored return values
* Revert "Temporarily use pre-release rules_erlang"
This reverts commit c16b5b6815.
* Use rules_erlang 3.9.2
This allows us to stop ignorning undefined callback warnings
When mix compiles rabbitmqctl, it produces a 'consolidated' directory
alongside the 'ebin' dir. Some of the modules in consolidated are
intended to be used instead of those provided by elixir. We now handle
the conflicts properly in the bazel build.
For scenarios where node hostname resolution on the invoking
host is not a suitable option and a particular address
should be tried instead.
Closes#6853.
`rabbitmqctl add_vhost myvhost --tags "my_tag"` would not previously
conform "my_tag" to a list before setting vhost metadata, which could
cause crashes when the list was read.
RabbitMQ lacks argument verifications in many places unfortunately.
Several testcases were passing bad arguments to the command but were
also comparing the result to bad data. This happened to "work"...
For instance, if '-n ...' has been used to specify a node, it will be
used to discover plugins and print additional commands made available
by plugins, as the "help" command does
Now it's only visible in the management UI.
One can craft a series of calls to `rabbitmqctl list_queues` and
`rabbitmqctl list_policies` to achieve similiar result. But it's more
difficult, and also doesn't take operator policy (if any) into account.
This function returns the data directory where all subsystems should
store their files.
Historically, this was the Mnesia directory. But semantically, this
should be the reverse: RabbitMQ owns the data directory and Mnesia is
configured to put its files there too.
`rabbit_mnesia:dir/0` now calls `rabbit:data_dir/0`.
Other subsystems will be modified in a follow-up commit to call
`rabbit:data_dir/0` instead of `rabbit_mnesia:dir/0`.
The location and name of this directory remains the same for
compatibility reasons. Therefore, it sill contains "mnesia" in its name.
However, semantically, we want this directory to be unrelated to Mnesia.
In the end, many subsystems write files and directories there, including
Mnesia, all Ra systems and in the future, Khepri.
Allow the restart_stream command / API to provide a preferrred leader
hint. If this leader replies to the coordinator stop phase as one of the
first n/2+1 members and has a sufficientl up-to-date stream tail it will be
selected as the leader, else it will fall back and use the modulus logic to
select the next leader.
sufficiently up to
As of #6316/#6020 all plugin directories are scanned earlier and
unconditionally. In some environments, the path won't be set or
will be set incorrectly. Make sure that 'rabbitmqctl help'
and friends work in such environments, even if no commands from
plugins would be available.
Per discussion with @pjk25.
When I ran it manually, all files were reported as mis-formatted. I
didn't investigate further, may line endings or encoding is an issue?
It seems worth it to skip the check for now since we don't run
integration test on windows with bazel yet anyway.
In many tests, we do not care what is the complete set of plugins
running on a node (or present, or in a specific state). We only
care that a few select ones are running, are of the expected version,
in a certain state, and so on.
So list comparison assertions are counterproductive and lead to
test interference that is difficult to track down. In many cases
we can do more fine grained assertions and ignore the rest of
the plugins present on the node.
References #6289, #6020.
`bazel test //deps/rabbitmq_cli:all` runs tests and format check
`bazel test //deps/rabbitmq_cli:tests` runs just the tests
`bazel test //deps/rabbitmq_cli:check_formatted` runs just the format
check
Bazel seems to be cleaning up files after following directory symlinks
after the test. Not what I would have expected, and maybe a
consequence of lighter sandboxing used on macos with rabbitmq. In any
case copying seems to be a reasonable workaround.
For reasons which are not yet fully understood, this file can go
missing and require a `bazel clean` to get the suite working
again. Having this check in place should make future flakes easier to
diagnose, and help determine if subsequent symlinking in the
underlying bazel rule should be replaced with copying, or if something
else altogether is happening.
This category should be unused with the decommissioning of the old
upgrade subsystem (in favor of the feature flags subsystem). It means:
1. The upgrade log file will not be created by default anymore.
2. The `$RABBITMQ_UPGRADE_LOG` environment variable is now unsupported.
The configuration variables remain to avoid breaking an existing and
working configuration.
This gen_statem-based process is responsible for handling concurrency
when feature flags are enabled and synchronized when a cluster is
expanded.
This clarifies and stabilizes the behavior of the feature flag subsystem
w.r.t. situations where e.g. a feature flag migration function takes
time to update data and a new node joins a cluster and synchronizes its
feature flag states with the cluster. There was a chance that the
feature flag was marked as enabled on the joining node, even though the
migration function didn't take care of that node.
With this new feature flags controller, enabling or synchronizing
feature flags blocks and delays any concurrent operations which try to
modify feature flags states too.
This change also clarifies where and when the migration function is
called: it is called at least once on each node who knows the feature
flag and when the state goes from "disabled" to "enabled" on that node.
Note that even if the feature flag is being enabled on a subset of the
nodes (because other nodes already have it enabled), it is marked as
"state_changing" everywhere during the migration. This is to prevent
that a node where it is enabled assumes it is enabled on all nodes who
know the feature flag.
There is a new feature as well: just after a feature flag is enabled,
the migration function is called a second time for any post-enable
actions. The feature flag is marked as enabled between these "enable"
and "post-enable" steps. The success or failure of this "post-enable"
run does not affect the state of the feature flag (i.e. it is ignored).
A new migration function API is introduced to allow more advanced
things. The new API is:
my_migration_function(
#ffcommand{name = ...,
props = ...,
command = enable | post_enable,
extra = #{...}})
The record is defined in `include/feature_flags.hrl`. Here is the
meaning of each field:
* `name` and `props` are the equivalent of the `FeatureName` and
`FeatureProps` arguments of the previous migration function API.
* `command` is basically the same as the previous `Arg` arguments.
* `extra` is map containing context-specific information. For instance, it
contains the list of nodes where the feature flag state changes.
This whole new behavior is behind a new feature flag called
`feature_flags_v2`. If a feature flag uses the new migration function
API, `feature_flags_v2` will be automatically enabled.
If many feature flags are enabled at once (like when a fresh RabbitMQ
node is started for the first time), `feature_flags_v2` will be enabled
first if it is in the list.
This unbreaks the build of rabbitmq_cli on platforms where GNU Make is
installed under another name than `make`. This is the case on Mac OSX
and *BSD for instance where GNU Make is available as `gmake`.
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps
Also:
- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
- Fetching seems blocked in CI, and this should reduce hex api usage in
all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
the defaults