[Why]
Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:
* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...
Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.
[How]
@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.
We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.
This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.
This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.
Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.
For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
of the group can write to the database.
Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.
This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.
To enable Khepri, you need to enable the `khepri_db` feature flag:
rabbitmqctl enable_feature_flag khepri_db
When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
conversion if necessary on the way. Again, it uses
`mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
it.
This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.
Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.
More about the implementation details below:
In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:
* Up until RabbitMQ 3.12.x:
get(VHostName) when is_binary(VHostName) ->
get_in_mnesia(VHostName).
* Starting with RabbitMQ 3.13.0:
get(VHostName) when is_binary(VHostName) ->
rabbit_khepri:handle_fallback(
#{mnesia => fun() -> get_in_mnesia(VHostName) end,
khepri => fun() -> get_in_khepri(VHostName) end}).
This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.
Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.
However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:
* Sometimes, we need the feature flag state query to block because the
function interested in it can't return a valid answer during the
migration. Here is an example:
case rabbit_khepri:is_enabled(RemoteNode) of
true -> can_join_using_khepri(RemoteNode);
false -> can_join_using_mnesia(RemoteNode)
end
* Sometimes, we need the feature flag state query to NOT block (for
instance because it would cause a deadlock). Here is an example:
case rabbit_khepri:get_feature_state() of
enabled -> members_using_khepri();
_ -> members_using_mnesia()
end
Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.
Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:
-rabbit_mnesia_tables_to_khepri_db(
[{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).
The converter module — `rabbit_db_rh_exchange_m2k_converter` in this
example — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.
[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration
See #7206.
Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
This includes a new ra:key_metrics/1 API that is more available
than parsing the output of sys:get_status/1.
the rabbit_quorum_queue:status/1 function has been ported to use
this API instead as well as now inludes a few new fields.
This version contains a bit more error handling during replica start
to avoid logging process crashes during RabbitMQ shutdown.
rabbit_stream_queue: avoid double local pid query
When starting a new consumer. Also move some metrics
registration a bit later when we know we are likely to succeed.
* Check additional applications when comparing bazel and make results
* Sync bazel/make for amqp_client
* Do not fail-fast in build system comparison
* promethus -> prometheus
* Regenerate BUILD.redbug
* When comparing build systems & .app files ignore empty 'registered'
It's listed as a required key in
https://www.erlang.org/doc/man/app.html, but the same docs state the
default is "[]". It seems to ignore it if it's empty.
* Copy bazel/BUILD.osiris from BUILD.bazel in the osiris repo
Normally it would be generated with `bazel run gazelle-update-repos --
-args osiris@1.5.1=github.com/rabbitmq/osiris@v1.5.1`, but in this
case we just want to match it's compilation with erlang.mk with some
manual tweaks.
* Use elixir 1.15, otherwise mix format fails
* Sync bazel/make for rabbitmq_web_dispatch, rabbitmq_management_agent
emqtt repos:
emqx/emqtt PR #196 is based on rabbitmq:otp-26-compatibility
emqx/emqtt PR #198 is based on ansd:master
rabbitmq/master contains both of these 2 PRs cherry-picked.
rabbitmq-server repos:
main branch points emqtt to rabbitmq:otp-26-compatibility
mqtt5 branch points emqtt to rabbitmq:master
Therefore, the current mqtt5 branch is OTP 26 compatible and can support
multiple subscription identifiers.
Includes minor fixes and improvements such as:
* Don't overwrite Ra member config file in place to avoid potential
corruption scenario
* Make logging unicode compatible
* Optimisation to avoid spawning node connector process on ra member init
when nodes are already connected.
* Catch recovery failures in the Ra WAL rather than crashing hard.
We already were using Cowlib 2.12.1 and therefore were
compatible with OTP-26. This simply updates Cowboy to
the version that depends on Cowlib 2.12.1.
This change should be reverted once emqx/emqtt is OTP26 compatible.
Our fork/branch isn't either at this point, but at least partially
works. Let's use this branch for now to uncover server-side OTP26
incompatibilities (and continue working on OTP26 support for emqtt of
course).
* Fetch all prod cli deps with bazel
This avoids issues with hex and OTP 26, and is needed for offline
bazel builds anyway
* Fetch test cli deps with bazel
* mix format
Previously osiris did not support uncorrelated writes which means
we could not use a "stateless" queue type delivery and these were
silently dropped.
This had the impact that at-most-once dead letter was not possible
where the dead letter target is a stream.
This change bumps the osiris version that has the required API
to allow for uncorrelated writes (osiris:write/2).
Currently there is no feature flag to control this as osiris writer
processes just logs and drops any messages they don't understand.
Returns reaching a Ra member that used to be leader but now has stepped
down would cause that follower to crash and restart.
This commit avoids this scenario as well as giving the return commands
a good chance of being resent to the new leader in a timeley manner.
(see the Ra release for this).
deps/rabbitmq_ct_helpers depends on proper and meck, so unfortunately
if proper and meck are marked as dev dependencies, bazel modules
depending on rabbitmq-server cannot build it
Another way of putting it is that they are not actually "dev"
dependencies of for all components that rabbitmq-server exposes
This Ra release includes improvements to Ra server GC behaviour when receiving a lot
of low priority commands with large binary payloads (e.g. quorum queue messages).
Practically this allows quorum queues to accept large amounts of messages in a more predicatble and performant manner.
This change also removes ra_file_handle cache that was used as a bridge between ra file operations and RabbitMQ io metrics. Lots of components in RabbitMQ such as streams and CQv2s do not record io metrics in the previous manner due to overhead incurred for every file io operation. These metrics are better inspected at the OS level anyway.
This Ra release
* Omproves election availability in certain mixed versions failure
scenarios
* Optimises segment reference compaction which may becomes expensive
in quorum queues with very long backlogs
* Various log message improvements and level tweaks
* Better cleans up machine monitor records after quorum queue rebalancing
* Add rabbitmq_cli dialyze to bazel
and fix a number of warnings
Because we stop mix from recompiling rabbit_common in bazel, many
unknown functions are reported, so this dialyzer analysis is somewhat
incomplete.
* Use erlang dialyzer for rabbitmq_cli rather than mix dialyzer
Since this resolves all of the rabbit functions, there are far fewer
unknown functions.
Requires yet to be released rules_erlang 3.9.2
* Temporarily use pre-release rules_erlang
So that checks can run on this PR without a release
* Fix additional dialyzer warnings in rabbitmq_cli
* rabbitmq_cli: mix format
* Additional fixes for ignored return values
* Revert "Temporarily use pre-release rules_erlang"
This reverts commit c16b5b6815.
* Use rules_erlang 3.9.2
* Add gazelle for use with update-repos command
* Use explicit BUILD.app_name files for erlang app deps
This allows us to remove the duplicate definitions in
workspace_helpers.bzl
These files are generated with gazelle. For instance:
BUILD.ra is generated with `bazel run gazelle -- update-repos
--verbose --build_files_dir=bazel hex.pm/ra@2.4.6`
Running gazelle this way will modify the WORKSPACE file, as gazelle
does not yet support MODULE.bazel files. Such changes to the WORKSPACE
can be dropped, and should not be committed. It may also update the
`moduleindex.yaml` file. Changes to `moduleindex.yaml` should be
committed.
However
* skip the explicit bazel/BUILD.osiris file, as osiris already contains the file in its repo
* skip the explict BUILD.inet_tcp_proxy_dist file, since the repo already contains a bazel BUILD.bazel file
gazelle command: `bazel run gazelle -- update-repos --verbose --build_files_dir=bazel
inet_tcp_proxy_dist=github.com/rabbitmq/inet_tcp_proxy@master`
* jose is imported with `bazel run gazelle -- update-repos --verbose --build_files_dir=bazel
jose=github.com/michaelklishin/erlang-jose@d63c1c5c8f9c1a4f1438e234b886de8607a0034e`
* Move the bats dep directly to WORKSPACE, drop workspace_helpers.bzl
* Use bzlmod in windows tests
Also change consumer credit top-ups to delay calling send_chunks until there is a "batch"
of credit to consume. Most clients at the time of writing send single credit updates after receiving each chunk so here we won't enter the send loop unless there are more than half the initial credits available.
osiris v1.4.3
Since 4.10.0 was released specifically to address an issue we
encountered in RabbitMQ integration with prometheus.erl, new test was
added to validate this functionality in the future.
Currently ra is not published to hex.pm with the bazel files that are
available when fetched directly from github, so additional hints must
be provided in the MODULE.bazel for now
This includes:
* async replica initialisation - making definitions import
and bulk restarts faster.
* Faster end of stream handling - lowering cpu use in clusters
with many low throughput streams
* Makes osiris binary string friendly so that paths do not have to be lists
anymore.
* Periodic max_age retention evaluation. Streams with max_age retention settings
are now re-evaluated every 1hr to reclaim disk space for streams that are idle but
have segments that only have old message data. Before retention would only be evaluated
when streams were written to and a new segment was opened.
When the version of osiris is a published version, there is no longer
a need to inject the git sha as its version. Other patches are no
longer needed now that osiris is caught up to the same rules_erlang
major.
The quality of auto-detection of properities of a hex dependency was
improved with bzlmod, thus in the MODULE.bazel file, ra is handled
correctly with no hints. In WORKSPACE.bazel/workspace_helpers.bzl,
this not the case, so a full build_file_content is needed.
Bazel 6, due this month, takes bzlmod out of experimental status, so I
don't expect to close up the difference between the systems.
The easier solution is to publish ra to hex.pm with the BUILD.bazel
file included, as it exists in the ra source, and is correct,
eliminating the need for any auto-generation of it when
imported/referenced by rabbitmq-server
If it's a dev dependency, projects that depend on rabbitmq-server as a
bzlmod module cannot borrow rabbitmq-server's platform definitions, as
the rbe repo won't be visible to rabbitmq-server in such a scenario