Commit Graph

37 Commits

Author SHA1 Message Date
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Diana Parra Corbacho 02bca637e1 queue_SUITE: use a different upstream for each queue on multi-federation tests 2024-10-29 16:01:07 +01:00
Michael Klishin 01092ff31f
(c) year bumps 2024-01-01 22:02:20 -05:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Jean-Sébastien Pédron 6e9ee4d0da
Remove test code which depended on the `quorum_queue` feature flags
These checks are now irrelevant as the feature flag is required.
2022-08-01 12:41:30 +02:00
Michael Klishin b18cd3e749
Erlang 25 compat: drop a duplicate flaky test
The test largely does what it's pattern-based version does.
During our Erlang 24 and 25 upgrade cycles, this version required
extra attention even though the feature works as advertised in
other tests and during manual testing.
2022-04-05 07:44:25 +04:00
Michael Klishin 7b57750e4e
Federated queue suite: make it easier to identify messages 2022-03-29 09:03:02 +04:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Loïc Hoguin dfe0f4b0a6
Clear federation before deleting message_flow queues
Without this change the test could take a very long time to
cleanup the queues and finish because of a race condition
between the queue deletion and the federation link being
restarted and declaring the queue again.

(The test bidirectional was renamed to message_flow to better
 represent what it is doing.)
2021-09-14 14:46:36 +02:00
Philip Kuryloski d6399bbb5b
Mixed version testing in bazel (#3200)
Unlike with gnu make, mixed version testing with bazel uses a package-generic-unix for the secondary umbrella rather than the source. This brings the benefit of being able to mixed version test releases built with older erlang versions (even though all nodes will run under the single version given to bazel)

This introduces new test labels, adding a `-mixed` suffix for every existing test. They can be skipped if necessary with `--test_tag_filters` (see the github actions workflow for an example)

As part of the change, it is now possible to run an old release of rabbit with rabbitmq_run rule, such as:

`bazel run @rabbitmq-server-generic-unix-3.8.17//:rabbitmq-run run-broker`
2021-07-19 14:33:25 +02:00
Philip Kuryloski 8a339aae6c Fixup rabbitmq_federation:queue_SUITE for mixed version testing 2021-07-13 18:05:04 +02:00
Michael Klishin 62e24460b3
Avoid conflicting queue declarations from tests
this is important now that different federation sides
use queues of different types in the mixed group.
2021-06-16 22:32:44 +08:00
Karl Nilsson c9026daecc test case 2021-06-16 13:49:16 +02:00
Michael Klishin 98724eff09
pg2 => pg for OTP 24 compatibility
there is still one failing queue federation test.
2021-03-03 19:01:12 +03:00
dcorbacho 699cd1ab29 Add federation support for quorum queues 2021-02-18 17:15:47 +01:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
Philip Kuryloski ad401bb9dd Increase rabbitmq_federation test timeouts
To address test flakiness
2020-10-23 10:32:59 +02:00
Philip Kuryloski 4323c643e7 rabbitmq_federation: use dynamic waits where possible
to address spurious failures in CI

Building upon recent changes and discussion with @michaelklishin
2020-10-13 11:19:31 +02:00
Michael Klishin 1aa2e059bb Rework two more tests
Per discussion with @pjk25.
2020-10-13 06:26:23 +03:00
Philip Kuryloski 287d0c67e6 Replace a fixed test suite delay with a dynamic
queue_SUITE > without_disambiguate > cluster_size_1 >
dynamic_plugin_stop_start fails often in CI
2020-09-29 16:07:37 +02:00
dcorbacho 26a4365188 Switch to Mozilla Public License 2.0 (MPL 2.0) 2020-07-12 22:52:53 +01:00
Jean-Sébastien Pédron fab830b862 queue_SUITE: Wait for federation in `dynamic_plugin_stop_start` 2020-03-30 17:07:07 +02:00
Jean-Sébastien Pédron c278947717 Update copyright (year 2020) 2020-03-10 16:10:02 +01:00
Michael Klishin 8b5bd10c43 (c) bump 2019-12-29 05:50:28 +03:00
Michael Klishin 05c89aae39 Minor test improvements 2019-03-22 14:59:52 +03:00
Luke Bakken dd10964e48 Change regex to pattern 2019-03-21 12:03:35 -07:00
Diego Herrera Celedón 1ab3c5071c Implement some tests 2019-03-20 19:11:07 -03:00
Spring Operator ef766e32f0 URL Cleanup
This commit updates URLs to prefer the https protocol. Redirects are not followed to avoid accidentally expanding intentionally shortened URLs (i.e. if using a URL shortener).

# HTTP URLs that Could Not Be Fixed
These URLs were unable to be fixed. Please review them to see if they can be manually resolved.

* http://blog.listincomprehension.com/search/label/procket (200) with 1 occurrences could not be migrated:
   ([https](https://blog.listincomprehension.com/search/label/procket) result ClosedChannelException).
* http://dozzie.jarowit.net/trac/wiki/TOML (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/TOML) result SSLHandshakeException).
* http://dozzie.jarowit.net/trac/wiki/subproc (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/subproc) result SSLHandshakeException).
* http://e2project.org (200) with 1 occurrences could not be migrated:
   ([https](https://e2project.org) result AnnotatedConnectException).
* http://erlang.2086793.n4.nabble.com/initializing-library-applications-without-processes-td2094473.html (200) with 1 occurrences could not be migrated:
   ([https](https://erlang.2086793.n4.nabble.com/initializing-library-applications-without-processes-td2094473.html) result SSLHandshakeException).
* http://nitrogenproject.com/ (200) with 2 occurrences could not be migrated:
   ([https](https://nitrogenproject.com/) result ConnectTimeoutException).
* http://proper.softlab.ntua.gr (200) with 1 occurrences could not be migrated:
   ([https](https://proper.softlab.ntua.gr) result SSLHandshakeException).
* http://yaws.hyber.org (200) with 1 occurrences could not be migrated:
   ([https](https://yaws.hyber.org) result AnnotatedConnectException).
* http://choven.ca (503) with 1 occurrences could not be migrated:
   ([https](https://choven.ca) result ConnectTimeoutException).

# Fixed URLs

## Fixed But Review Recommended
These URLs were fixed, but the https status was not OK. However, the https status was the same as the http request or http redirected to an https URL, so they were migrated. Your review is recommended.

* http://fixprotocol.org/ (301) with 1 occurrences migrated to:
  https://fixtrading.org ([https](https://fixprotocol.org/) result SSLHandshakeException).
* http://erldb.org (UnknownHostException) with 1 occurrences migrated to:
  https://erldb.org ([https](https://erldb.org) result UnknownHostException).

## Fixed Success
These URLs were switched to an https URL with a 2xx status. While the status was successful, your review is still recommended.

* http://cloudi.org/ with 27 occurrences migrated to:
  https://cloudi.org/ ([https](https://cloudi.org/) result 200).
* http://erlware.org/ with 1 occurrences migrated to:
  https://erlware.org/ ([https](https://erlware.org/) result 200).
* http://inaka.github.io/cowboy-trails/ with 1 occurrences migrated to:
  https://inaka.github.io/cowboy-trails/ ([https](https://inaka.github.io/cowboy-trails/) result 200).
* http://ninenines.eu with 6 occurrences migrated to:
  https://ninenines.eu ([https](https://ninenines.eu) result 200).
* http://www.actordb.com/ with 2 occurrences migrated to:
  https://www.actordb.com/ ([https](https://www.actordb.com/) result 200).
* http://www.cs.kent.ac.uk/projects/wrangler/Home.html with 1 occurrences migrated to:
  https://www.cs.kent.ac.uk/projects/wrangler/Home.html ([https](https://www.cs.kent.ac.uk/projects/wrangler/Home.html) result 200).
* http://www.rabbitmq.com/federation.html with 1 occurrences migrated to:
  https://www.rabbitmq.com/federation.html ([https](https://www.rabbitmq.com/federation.html) result 200).
* http://www.rebar3.org with 1 occurrences migrated to:
  https://www.rebar3.org ([https](https://www.rebar3.org) result 200).
* http://contributor-covenant.org with 1 occurrences migrated to:
  https://contributor-covenant.org ([https](https://contributor-covenant.org) result 301).
* http://contributor-covenant.org/version/1/3/0/ with 1 occurrences migrated to:
  https://contributor-covenant.org/version/1/3/0/ ([https](https://contributor-covenant.org/version/1/3/0/) result 301).
* http://inaka.github.com/apns4erl with 1 occurrences migrated to:
  https://inaka.github.com/apns4erl ([https](https://inaka.github.com/apns4erl) result 301).
* http://inaka.github.com/edis/ with 1 occurrences migrated to:
  https://inaka.github.com/edis/ ([https](https://inaka.github.com/edis/) result 301).
* http://lasp-lang.org/ with 1 occurrences migrated to:
  https://lasp-lang.org/ ([https](https://lasp-lang.org/) result 301).
* http://saleyn.github.com/erlexec with 1 occurrences migrated to:
  https://saleyn.github.com/erlexec ([https](https://saleyn.github.com/erlexec) result 301).
* http://www.mozilla.org/MPL/ with 27 occurrences migrated to:
  https://www.mozilla.org/MPL/ ([https](https://www.mozilla.org/MPL/) result 301).
* http://zhongwencool.github.io/observer_cli with 1 occurrences migrated to:
  https://zhongwencool.github.io/observer_cli ([https](https://zhongwencool.github.io/observer_cli) result 301).
2019-03-20 08:16:24 -05:00
Michael Klishin 898cf32100 Merge branch 'stable' 2017-04-02 21:57:33 +03:00
Michael Klishin ab6ccc5f1f (c) year 2017-04-02 21:47:58 +03:00
Diana Corbacho 9139e8e43f Merge branch 'stable' 2017-01-03 15:11:37 +01:00
Diana Corbacho 600b29df43 Add tests for federation status/lookup 2016-12-29 10:06:56 +01:00
Diana Corbacho 9b70d870b8 Test refactor 2016-11-21 10:06:04 +00:00
Daniil Fedotov a4c514f39a Test cleanup 2016-10-14 14:44:53 +01:00
Jean-Sébastien Pédron 4e70c06538 Switch testsuite to common_test
While here, update Erlang.mk. It fixes the use of the `t=` argument to
`make ct-$suite`.

[#121412011]
2016-06-24 12:49:37 +02:00