Commit Graph

34 Commits

Author SHA1 Message Date
Michal Kuratczyk 175ba70e8c
[skip ci] Remove rabbit_log and switch to LOG_ macros 2025-07-18 08:42:59 +02:00
Iliia Khaprov e408c9e0f2
Queues with plugins - core 2025-05-17 20:47:43 +02:00
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Karl Nilsson 2dcced6967 Maintenance mode: change revive to use quorum queue recovery function.
As this already does the job.
2024-08-16 10:05:53 +01:00
Michael Klishin 2f165e02f2 rabbitmq-upgrade revive: handle more errors
returned by Ra, e.g. when a replica cannot be
restarted because of a concurrent delete
or because a QQ was inserted into a schema data
store but not yet registered as a process on
the node.

References #12013.
2024-08-15 10:02:02 -04:00
Diana Parra Corbacho 3bbda5bdba Remove classic mirror queues 2024-06-04 13:00:31 +02:00
deterclosed 21f7da6dfd chore: fix some typos
Signed-off-by: deterclosed <fliter@outlook.com>
2024-03-23 13:44:55 +08:00
Jean-Sébastien Pédron af0728c33f
rabbit_db_maintenance: Use local queries with Khepri
[Why]
Ra consistent queries are currently fragile in the sense that the query
function may run on a remote node and the function reference or MFA may
not be valid on that node. See previous commit for more details.

[How]
We perform local queries in `rabbit_db_maintenance:get_consistent/1`
when Khepri is enabled. This violates what the expectation from this
API, that's why it is a temporary measure, until a proper solution is
found.
2024-01-31 19:12:28 +01:00
Michael Klishin fb775afcc8
More (c) source header updates #9969 2024-01-19 19:53:28 -05:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Jean-Sébastien Pédron d65637190a
rabbit_nodes: Add list functions to clarify which nodes we are interested in
So far, we had the following functions to list nodes in a RabbitMQ
cluster:
* `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster;
  the argument was used to select members (all members or only those
  running Mnesia and participating in the cluster)
* `rabbit_nodes:all/0` to get all members of the Mnesia cluster
* `rabbit_nodes:all_running/0` to get all members who currently run
  Mnesia

Basically:
* `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)`
* `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)`

We also have:
* `rabbit_node_monitor:alive_nodes/1` which filters the given list of
  nodes to only select those currently running Mnesia
* `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given
  list of nodes to only select those currently running RabbitMQ

Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the
`rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)`
or `rabbit_nodes:all_running/0` is often used as a close approximation
of "all cluster members running RabbitMQ". This list might be incorrect
in times where a node is joining the clustered or is being worked on
(i.e. Mnesia is running but not RabbitMQ).

With Khepri, there won't be the same possible approximation because we
will try to keep Khepri/Ra running even if RabbitMQ is stopped to
expand/shrink the cluster.

So in order to clarify what we want when we query a list of nodes, this
patch introduces the following functions:
* `rabbit_nodes:list_members/0` to get all cluster members, regardless
  of their state
* `rabbit_nodes:list_reachable/0` to get all cluster members we can
  reach using Erlang distribution, regardless of the state of RabbitMQ
* `rabbit_nodes:list_running/0` to get all cluster members who run
  RabbitMQ, regardless of the maintenance state
* `rabbit_nodes:list_serving/0` to get all cluster members who run
  RabbitMQ and are accepting clients

In addition to the list functions, there are the corresponding
`rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)`
filtering functions.

The code is modified to use these new functions. One possible
significant change is that the new list functions will perform RPC calls
to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
2023-02-13 12:58:40 +01:00
Diana Parra Corbacho 9cf10ed8a7 Unit test rabbit_db_* modules, spec and API updates 2023-02-02 15:01:42 +01:00
Diana Parra Corbacho c95ac0a9e6 Move maintenance mode Mnesia-specific code to rabbit_db_maintenance module 2023-01-31 10:23:16 +01:00
Alexey Lebedeff 7c0f04067f Fix all dialyzer warnings in `rabbit` 2023-01-24 17:26:58 +01:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Michal Kuratczyk 23e40b53ed
Transfer coordinator leadership when draining a node 2022-12-02 14:55:06 +01:00
David Ansari 694501b923 Close local MQTT connections when draining node
When a node gets drained (i.e. goes into maintenance mode), only local
connections should be terminated.

However, prior to this commit, all MQTT connections got terminated
cluster-wide when a single node was drained.
2022-10-13 11:14:01 +00:00
Luke Bakken 7fe159edef
Yolo-replace format strings
Replaces `~s` and `~p` with their unicode-friendly counterparts.

```
git ls-files *.erl | xargs sed -i.ORIG -e s/~s>/~ts/g -e s/~p>/~tp/g
```
2022-10-10 10:32:03 +04:00
Jean-Sébastien Pédron 43a525f4d0
Remove pre-maintenance_mode_status compatibility code
Maintenance mode, introduced in RabbitMQ 3.8.x, was a breaking change
protected behind a feature flag. This allowed a RabbitMQ cluster to be
upgraded one node at a time, without having to stop the entire cluster.

The compatibility code is in the wild for long enough. The
`maintenance_mode_status` feature flag was marked as required in a
previous commit (see #5202). This allows us to remove code in this
patch.

References #5215.
2022-07-29 11:51:26 +02:00
Michael Klishin a633312450
Maintenance mode state: don't force a local replica on boot
It breaks forced node removal in some tricky scenarios
because node schemas do not match, and Mnesia is sensitive to this.
2022-05-08 21:34:10 +04:00
Michael Klishin 2d6964d89e Add a local replica of maintenance mode state table on boot
Some users were surprised by the current approach to
state table setup.

Per discussion in #4755
2022-05-07 21:33:09 +04:00
Felix Huettner fa0a8de9ac Fix migrate leadership timeout
The code of `migrate_leadership_to_existing_replica` previously assumed
that it can control the target node of a failover of mirrored queues.

While its sibling function `transfer_leadership` can do that (as it drops all
mirrors besides the one on the target node)
`migrate_leadership_to_existing_replica` tries to be less intrusive and
only tries to migrate the queue away for the current primary.

However it then used `wait_for_new_master` to wait for the queue to actually
transfer the primary to the specified target node. As the specified target
node might not ever become the primary of the queue (as this decission is
only made between the remaining mirrors) this can cause the migrate operation
to run into a timeout (10 sec per default).

As `migrate_leadership_to_existing_replica` is only used during
`transfer_leadership_of_classic_mirrored_queues` its only goal is to get the
primary away from the current node. Therefor we can just wait for the
queue to become active on some other node instead of expecting a specific
node to become the primary.
2022-04-08 15:58:42 +02:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Michael Klishin 49fd6b3e8d
Make sure 'rabbitmq-upgrade {drain, revive}' do not produce scary log messages
"scary" here means log messages that show up as "errors". Warnings
and info messages make more sense.

Per discussion with @mkuratczyk.
2021-08-05 16:26:13 +03:00
Philip Kuryloski 388654c542
Add a partial Bazel build (#2938)
Adds WORKSPACE.bazel, BUILD.bazel & *.bzl files for partial build & test with Bazel. Introduces a build-time dependency on https://github.com/rabbitmq/bazel-erlang
2021-03-29 11:01:43 +02:00
Michael Klishin 647b2ad453
Revisit what drain and revive do when their feature flag is not enabled
If maintenance mode feature flag is not enable, drain and revive should return an error
2021-03-24 23:11:59 +03:00
kjnilsson f6f02a5d2d
ra systems wip 2021-03-22 21:44:15 +03:00
Michael Klishin 589352c31a
Remove a leftover binding
The line that used it was removed in #2765
2021-02-01 15:05:07 +03:00
Michal Kuratczyk 7cc2c7889d
Remove a log line related to CMQ leader transfer
We've removed this functionality in c7b9c39352
so we shouldn't log that we would do that.
2021-01-28 20:55:11 +01:00
Michael Klishin c7b9c39352
Don't perform CMQ leadership transfer when entering maintenance mode
The time this operation can take in clusters with a lot of classic
mirrored queue (say, 10s or 100s of thousands) be prohibitive for
upgrades.

Upgrades that use a health check to ensure that there are in-sync
replicas before entering maintenance mode, in which case
the transfer is not really necessary.

All of the above is more obvious with the recent changes in #2749.
2021-01-27 19:11:26 +03:00
Michael Klishin bc87b9a1bb
Less intrusive CMQ leadership transfer
Since we only consider nodes hosting
in-sync replicas for transfer candidates,
we can drop only one mirror instead of N,
and reduce the load caused by this operation.

This does not affect CMQ leadership
transfer when performed in the context
'rabbitmq-queues rebalance'

Pair: @dcorbacho, @mkuratczyk
2021-01-26 15:35:16 +03:00
Michael Klishin 429f87913e
Synchronously add mirrors when transferring ownership
of classic mirrored queues.

There are cases when asynchronously adding mirrors makes
a lot of sense: e.g. when a new node joins the cluster.
In this case, if we add mirrors asynchronously, this
operation will race with the step that removes mirrors.
As a result, we can end up with a queue that decided
that it had no promotable replicas => data loss
from the transfer.

Closes #2749.

Pairs: @dcorbacho, @mkuratczyk
2021-01-26 14:47:15 +03:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
Philip Kuryloski a1fe3ab061 Change repo "root" to deps/rabbit
rabbit must not be the monorepo root application, as other applications depend on it
2020-11-13 14:34:42 +01:00