Commit Graph

65 Commits

Author SHA1 Message Date
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
David Ansari e2a113605d Disallow transient entities in RabbitMQ AMQP 1.0 Erlang client
Transient (i.e. `durable=false`) exchanges and queues are deprecated.

Khepri will store all entities durably.
(Even exclusive queues will be stored durably. Exclusive queues are
still deleted when the declaring connection is closed.)

Similar to how the RabbitMQ AMQP 1.0 Java client already disallows the
creation of transient exchanges and queues, this commit will prohibit
the declaration of transient exchanges and queues in the RabbitMQ
AMQP 1.0 Erlang client starting with RabbitMQ 4.1.
2024-12-16 16:17:55 +01:00
David Ansari b1eb354385 Strictly validate annotations 2024-09-18 12:42:27 +02:00
David Ansari cdc5b886f8 Fix crash in consistent hash exchange
Prior to this commit, a crash occurred when a consistent hash exchange
got declared with a `hash-header` argument, but the publishing client
didn't set that header on the message.

This bug is present in RabbitMQ 3.13.0 - 3.13.6.

Fixes https://github.com/rabbitmq/rabbitmq-server/discussions/11671
2024-07-24 11:42:59 +02:00
Michael Davis da83358a4a
Respect RABBITMQ_METADATA_STORE in consistent hash exchange suite 2024-07-10 13:46:22 -04:00
David Ansari 1d02ea9e55 Fix crashes when message gets dead lettered
Fix crashes when message is originally sent via AMQP and
stored within a classic or quorum queue and subsequently
dead lettered where the dead letter exchange needs access to message
annotations or properties or application-properties.
2024-05-02 07:56:00 +00:00
Michael Klishin 01092ff31f
(c) year bumps 2024-01-01 22:02:20 -05:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Luke Bakken 7fe159edef
Yolo-replace format strings
Replaces `~s` and `~p` with their unicode-friendly counterparts.

```
git ls-files *.erl | xargs sed -i.ORIG -e s/~s>/~ts/g -e s/~p>/~tp/g
```
2022-10-10 10:32:03 +04:00
David Ansari 878f369b7a Make adding bindings idempotent
First binding wins.
Duplicate bindings, i.e. bindings with the same source exchange and
same destination queue / exchange but possibly different routing key
(weight) are ignored from now on by the consistent hash exchange.

This applies only to bindings being added.
For bindings being deleted, any duplicate binding (independent of its
routing key) will delete all buckets for the given source and
destination. (This is to ensure that buckets for a given source and
destination can be deleted for when upgrading from a version prior
to this commit. This was also the behaviour prior to this commit,
so nothing changes in that regard.)

Note that duplicate bindings continue to be created in RabbitMQ.
(They are only ignored by the consistent hash exchange.)

Adding a binding will perform linear search in the bucket map.
This is already stated in the README:
"These two operations use linear algorithms to update the ring."

The linear search when adding a binding could be optimised by
adding another Mnesia table field which will require a new migration and
feature flag. Hence, such an optimization is left out in this commit.

Fixes #3386.
2022-06-30 09:24:02 +00:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Philip Kuryloski 078321cce5 Attempt to reduce test flakes in consistent_hash_exchange suite 2022-03-11 15:37:00 +01:00
Falcon Taylor-Carter 4dab02289d Add tests for duplicate binding scenarios 2021-10-18 23:25:06 -04:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
Jean-Sébastien Pédron 1cfb526a48 rabbit_exchange_type_consistent_hash_SUITE: Bump wait_for_confirms to 5 minutes
We still get failures in CI. Let's see how it goes with a very large
timeout value.
2020-11-04 17:29:49 +01:00
Jean-Sébastien Pédron de1cccff7a rabbit_exchange_type_consistent_hash_SUITE: Use ?assertEqual instead of matching
The reported error will provide more information.
2020-11-04 17:28:42 +01:00
Jean-Sébastien Pédron cd6c8e25cf rabbit_exchange_type_consistent_hash_SUITE: Remove trailing whitespaces 2020-11-04 17:28:18 +01:00
Jean-Sébastien Pédron c7354f0f45 rabbit_exchange_type_consistent_hash_SUITE: Wait for confirms for 60 seconds
Switching from 5000 seconds to 5 seconds, after we discovered that this
API expects seconds instead of milliseconds, made the wait too short.
2020-11-04 11:06:13 +01:00
Luke Bakken 868bd77859 wait_for_confirms timeout is in seconds
References rabbitmq/rabbitmq-erlang-client#138

cc @dumbbell
2020-11-02 10:49:50 -08:00
dcorbacho 5d348bd3a1 Switch to Mozilla Public License 2.0 (MPL 2.0) 2020-07-11 19:45:03 +01:00
Michael Klishin 2d0adc176b Integration tests for #45 2020-06-12 13:00:21 +03:00
Jean-Sébastien Pédron f73587775a Update copyright (year 2020) 2020-03-10 16:08:09 +01:00
Michael Klishin 8c28dff573 (c) bump 2019-12-29 05:50:26 +03:00
Spring Operator f1ac305a24 URL Cleanup
This commit updates URLs to prefer the https protocol. Redirects are not followed to avoid accidentally expanding intentionally shortened URLs (i.e. if using a URL shortener).

# HTTP URLs that Could Not Be Fixed
These URLs were unable to be fixed. Please review them to see if they can be manually resolved.

* http://blog.listincomprehension.com/search/label/procket (200) with 1 occurrences could not be migrated:
   ([https](https://blog.listincomprehension.com/search/label/procket) result ClosedChannelException).
* http://dozzie.jarowit.net/trac/wiki/TOML (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/TOML) result SSLHandshakeException).
* http://dozzie.jarowit.net/trac/wiki/subproc (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/subproc) result SSLHandshakeException).
* http://e2project.org (200) with 1 occurrences could not be migrated:
   ([https](https://e2project.org) result AnnotatedConnectException).
* http://michaelnielsen.org/blog/consistent-hashing/ (200) with 1 occurrences could not be migrated:
   ([https](https://michaelnielsen.org/blog/consistent-hashing/) result SSLHandshakeException).
* http://nitrogenproject.com/ (200) with 2 occurrences could not be migrated:
   ([https](https://nitrogenproject.com/) result ConnectTimeoutException).
* http://proper.softlab.ntua.gr (200) with 1 occurrences could not be migrated:
   ([https](https://proper.softlab.ntua.gr) result SSLHandshakeException).
* http://rubybunny.info (200) with 1 occurrences could not be migrated:
   ([https](https://rubybunny.info) result AnnotatedConnectException).
* http://www.martinbroadhurst.com/Consistent-Hash-Ring.html (200) with 1 occurrences could not be migrated:
   ([https](https://www.martinbroadhurst.com/Consistent-Hash-Ring.html) result SSLHandshakeException).
* http://yaws.hyber.org (200) with 1 occurrences could not be migrated:
   ([https](https://yaws.hyber.org) result AnnotatedConnectException).
* http://choven.ca (503) with 1 occurrences could not be migrated:
   ([https](https://choven.ca) result ConnectTimeoutException).

# Fixed URLs

## Fixed But Review Recommended
These URLs were fixed, but the https status was not OK. However, the https status was the same as the http request or http redirected to an https URL, so they were migrated. Your review is recommended.

* http://fixprotocol.org/ (301) with 1 occurrences migrated to:
  https://fixtrading.org ([https](https://fixprotocol.org/) result SSLHandshakeException).
* http://erldb.org (UnknownHostException) with 1 occurrences migrated to:
  https://erldb.org ([https](https://erldb.org) result UnknownHostException).

## Fixed Success
These URLs were switched to an https URL with a 2xx status. While the status was successful, your review is still recommended.

* http://cloudi.org/ with 27 occurrences migrated to:
  https://cloudi.org/ ([https](https://cloudi.org/) result 200).
* http://en.wikipedia.org/wiki/Consistent_hashing with 1 occurrences migrated to:
  https://en.wikipedia.org/wiki/Consistent_hashing ([https](https://en.wikipedia.org/wiki/Consistent_hashing) result 200).
* http://erlware.org/ with 1 occurrences migrated to:
  https://erlware.org/ ([https](https://erlware.org/) result 200).
* http://inaka.github.io/cowboy-trails/ with 1 occurrences migrated to:
  https://inaka.github.io/cowboy-trails/ ([https](https://inaka.github.io/cowboy-trails/) result 200).
* http://ninenines.eu with 6 occurrences migrated to:
  https://ninenines.eu ([https](https://ninenines.eu) result 200).
* http://www.actordb.com/ with 2 occurrences migrated to:
  https://www.actordb.com/ ([https](https://www.actordb.com/) result 200).
* http://www.cs.kent.ac.uk/projects/wrangler/Home.html with 1 occurrences migrated to:
  https://www.cs.kent.ac.uk/projects/wrangler/Home.html ([https](https://www.cs.kent.ac.uk/projects/wrangler/Home.html) result 200).
* http://www.rabbitmq.com/plugins.html with 1 occurrences migrated to:
  https://www.rabbitmq.com/plugins.html ([https](https://www.rabbitmq.com/plugins.html) result 200).
* http://www.rebar3.org with 1 occurrences migrated to:
  https://www.rebar3.org ([https](https://www.rebar3.org) result 200).
* http://contributor-covenant.org with 1 occurrences migrated to:
  https://contributor-covenant.org ([https](https://contributor-covenant.org) result 301).
* http://contributor-covenant.org/version/1/3/0/ with 1 occurrences migrated to:
  https://contributor-covenant.org/version/1/3/0/ ([https](https://contributor-covenant.org/version/1/3/0/) result 301).
* http://inaka.github.com/apns4erl with 1 occurrences migrated to:
  https://inaka.github.com/apns4erl ([https](https://inaka.github.com/apns4erl) result 301).
* http://inaka.github.com/edis/ with 1 occurrences migrated to:
  https://inaka.github.com/edis/ ([https](https://inaka.github.com/edis/) result 301).
* http://lasp-lang.org/ with 1 occurrences migrated to:
  https://lasp-lang.org/ ([https](https://lasp-lang.org/) result 301).
* http://saleyn.github.com/erlexec with 1 occurrences migrated to:
  https://saleyn.github.com/erlexec ([https](https://saleyn.github.com/erlexec) result 301).
* http://www.mozilla.org/MPL/ with 3 occurrences migrated to:
  https://www.mozilla.org/MPL/ ([https](https://www.mozilla.org/MPL/) result 301).
* http://zhongwencool.github.io/observer_cli with 1 occurrences migrated to:
  https://zhongwencool.github.io/observer_cli ([https](https://zhongwencool.github.io/observer_cli) result 301).
2019-03-20 03:13:58 -05:00
Diana Corbacho 1b75f7beec More tests for #40
Use served-named queues
2019-01-07 14:03:19 +00:00
Diana Corbacho 81836b3531 Refactor test code
References #40.
2019-01-07 13:20:23 +00:00
Michael Klishin 235bacb3b4 More tests for #40 2019-01-07 15:56:17 +03:00
Michael Klishin 93626ee0c8 Add a failing test for the scenario outlined in #40 2019-01-07 15:24:12 +03:00
Michael Klishin 0638c70552 Make chi squared test an observation we log, not an assertion
Due to randomness of the inputs and other characteristics that vary
beetween environments it doesn't always end up being < the expected
value but there's plenty of evidence that in most environments
the resulting distribution is very uniform (for all intents and
purposes of this plugin anyway).

References #37, #39.
2018-08-31 23:51:36 +02:00
Michael Klishin 6ace19d972 Use only a subset of queues in routing tests 2018-08-28 20:01:25 +03:00
Michael Klishin 0b1776d59d More tests, more idempotent binding management operations
[#159822323]
2018-08-28 19:53:52 +03:00
Michael Klishin b368ee922e Increase sample count to pass Chi squared test in more environments, reorganise tests
We still depend on the PRNG to provide a reasonably uniform distribution
of inputs (e.g. routing keys) but things pass in at least 3 different environments
reliably with 150K iterations.

Pair: @dcorbacho.

References #37, #38.
2018-08-21 16:40:21 +03:00
Michael Klishin ab5f54ee8f Bring back the Chi squared test assertion, bump the number of samples 2018-08-21 16:23:10 +03:00
Michael Klishin 67fe821b79 Fix a warning 2018-08-21 16:03:18 +03:00
Michael Klishin 05e7cc756f Don't assert on Chi squared test value
In some environments, namely our Concourse containers, with *some* iterations
of the test the value exceeds the reference value of p-value = 0.01.

This may be specific to OTP 19.3 or certain platforms. This is not
something that I can reproduce in a number of OTP 21 environments.

References #37, #38.
2018-08-21 06:50:04 +03:00
Michael Klishin d6e9fd9b9e Test suite improvements
* Use publisher confirms, that's what the test really needs
 * Clean up exchanges before setting up topology to make sure failing tests
   do not leave anything behind
2018-08-20 19:47:43 +03:00
Michael Klishin e132a0a865 A typo 2018-08-20 18:31:56 +03:00
Michael Klishin b887efdcf3 Extract a few test helpers 2018-08-20 15:05:01 +03:00
Diana Corbacho f319c84343 Test different bucket sizes 2018-08-20 11:07:12 +01:00
Diana Corbacho 02c5be2d54 Test - and fix - binding cleanup 2018-08-20 08:57:35 +01:00
Diana Corbacho d82a77cecc Verify distribution using chi-square test 2018-08-17 12:12:21 +01:00
Michael Klishin 72623501a4 Merge branch 'stable' 2017-04-02 21:56:37 +03:00
Michael Klishin c30520abbd (c) year 2017-04-02 21:47:35 +03:00
Jean-Sébastien Pédron a5323b7379 Use `rand` directly in master because we require Erlang 18.3
References rabbitmq/rabbitmq-server#860.
[#122335241]
2016-06-30 17:30:20 +02:00
Jean-Sébastien Pédron 9aa6728140 Use the new `rand_compat` module to transition from `random` to `rand`
References #860.
[#122335241]
2016-06-29 13:25:19 +02:00
Michael Klishin 54d88af579 Not really probabilistic 2016-06-23 00:17:31 +03:00
Michael Klishin 619160c4b4 Switch test suite to Common Test
Fixes #21.
2016-06-20 16:30:26 +03:00
Michael Klishin 3e06644577 Update (c) info 2016-01-01 12:59:16 +03:00