Commit Graph

240 Commits

Author SHA1 Message Date
Karl Nilsson a2c3610fb5 osiris v1.8.0 2024-02-27 19:29:15 -05:00
Jean-Sébastien Pédron 2aa7e92818
Bump Khepri from 0.12.0 to 0.12.1
Release notes:
https://github.com/rabbitmq/khepri/releases/tag/v0.12.1
2024-02-21 11:34:53 +01:00
Rin Kuryloski 5dc6df2841 Add back osiris from BCR
gazelle update-repos does not correctly generate the bazel build for
it, because it does not pick up the application's env property
2024-02-20 14:19:54 +01:00
Rin Kuryloski dcfb05774e Add a comment for the orisis update-repos command 2024-02-20 11:13:51 +01:00
Rin Kuryloski de2305992f Do not use BCR for ra, osiris, or seshat
Because khepri is not bazel-native, ra and seshat needed to be
declared twice and manually synchronized. This allows them to be
declared just once.

looking_glass remains a bazel_dep, since it has native extensions
2024-02-20 11:07:49 +01:00
Jean-Sébastien Pédron 53139ce99c
Bump Khepri from 0.11.0 to 0.12.0
... and khepri_mnesia_migration from 0.3.0 to 0.4.0.

Release notes:
* Khepri: https://github.com/rabbitmq/khepri/releases/tag/v0.12.0
* khepri_mnesia_migration: https://github.com/rabbitmq/khepri_mnesia_migration/releases/tag/v0.4.0
2024-02-12 15:01:18 +01:00
GitHub 5faa9937ef Adopt otp 25.3.2.9 2024-02-10 03:06:04 +00:00
GitHub 0e275eead4 Adopt otp 26.2.2 2024-02-09 18:48:25 -05:00
Karl Nilsson 2df3fc16aa Ra 2.9.1
This Ra release contains fixes for leaderboard updates as well
as a long standing bug fix that meant the latest cluster may not
be recovered correctly after an unclean shutdown.
2024-02-09 14:42:44 +00:00
Jean-Sébastien Pédron d5624d976f
Bump khepri_mnesia_migration from 0.2.1 to 0.3.0
Release notes:
https://github.com/rabbitmq/khepri_mnesia_migration/releases/tag/v0.3.0
2024-01-31 19:12:27 +01:00
Jean-Sébastien Pédron d8abd3ac2b
Bump Khepri from 0.10.0 to 0.11.0
Release notes:
https://github.com/rabbitmq/khepri/releases/tag/v0.11.0
2024-01-31 19:12:27 +01:00
Jean-Sébastien Pédron 967533967e
Bump Ra from 2.7.1 to 2.7.3
Release notes:
* https://github.com/rabbitmq/ra/releases/tag/v2.7.2
* https://github.com/rabbitmq/ra/releases/tag/v2.7.3
2024-01-31 19:12:23 +01:00
Loïc Hoguin 865bc22fb8
Update Cowboy to 2.11 2024-01-29 16:27:59 +01:00
Michal Kuratczyk 8d2657ef2f Adopt OTP 26.2.1 2024-01-09 10:14:08 +01:00
Karl Nilsson e9317bd31e Ra v2.7.1
Includes:

Update to aten 0.6.0 which includes a double notification fix and
use of monotonic time instead of system time.
2024-01-03 14:21:54 +00:00
Michael Klishin 54ae40692a
Merge pull request #9656 from cloudamqp/prometheus_escape_label
Escape prometheus core metric label values
2023-12-20 19:28:53 -05:00
GitHub 4bcd46376e Adopt otp 25.3.2.8 2023-12-19 03:07:08 +00:00
Michael Klishin c9f69b43b2 Bump Osiris to 1.7.2 2023-12-12 17:14:50 -05:00
Michael Davis dea4769fed
Update khepri to 0.10.1
Khepri 0.10.0 replaces `khepri:wait_for_async_ret/2,3` with
`khepri:handle_async_ret/1,2`. This will be used by the child commit:
the child commit will use Khepri's async interface and handle async
write events from Ra.

Changes to the bazel build files were done automatically with gazelle:

    bazel run gazelle -- update-repos --verbose \
        --build_files_dir=bazel github.com/rabbitmq/khepri@v0.10.1
2023-12-12 12:01:59 -05:00
Karl Nilsson 81a57feb6b Osiris 1.7.1 2023-12-04 10:09:28 +00:00
Péter Gömöri 8c787609de Bump prometheus dependency to 4.11.0 2023-12-03 01:14:44 +01:00
Michael Klishin cb27208c06
Osiris 1.7.0 2023-12-01 16:57:23 -05:00
Michal Kuratczyk 7ef4bec607
Revert "Remove Elixir json package in one more place #9926 #9932"
This reverts commit 342af9ab96.
2023-11-17 13:08:19 +01:00
Michal Kuratczyk 5a51547d9e
Revert "Remove hex.pm json package #9926 #9932"
This reverts commit dfe4f6fd70.
2023-11-17 13:07:58 +01:00
Michael Klishin 342af9ab96
Remove Elixir json package in one more place #9926 #9932 2023-11-16 10:13:08 -05:00
Michael Klishin dfe4f6fd70
Remove hex.pm json package #9926 #9932 2023-11-16 10:12:45 -05:00
Michal Kuratczyk 0f0076a025
Run gazelle for updated deps 2023-11-10 16:47:39 +01:00
Michal Kuratczyk b2c01e3e8e
Remove dialyxir from bazel 2023-11-10 15:37:11 +01:00
Rin Kuryloski 49f2da63c2 Use rules_erlang 3.14.0
This version detects major version mismatches in transient
dependencies

In this case, it will notice if, for instance, ra and osiris ask for
different major versions of seshat
2023-11-02 16:50:10 +01:00
Jean-Sébastien Pédron af0bce1764
Upgrade khepri_mnesia_migration from 0.1.1 to 0.2.1 2023-10-27 16:08:43 +02:00
Rin Kuryloski 231465f35e Use rules_erlang 3.13.1
This version of rules_erlang adds coverage support

Bazel has sort of standardized on lcov for coverage, so that is what
we use.

Example:
1. `bazel coverage //deps/rabbit:eunit -t-`
2. `genhtml --output genhtml "$(bazel info
output_path)/_coverage/_coverage_report.dat"`
3. `open genhtml/index.html`

Multiple tests can be run with results aggregated, i.e. `bazel
coverage //deps/rabbit:all -t-`

Running coverage with RBE has a lot of caveats,
https://bazel.build/configure/coverage#remote-execution, so the above
commands won't work as is with RBE.
2023-10-17 11:22:36 -04:00
Rin Kuryloski fe07f4c930 Always reference seshat 0.6.1
Which is code-wise identical to 0.6.0
2023-10-16 16:17:44 +02:00
GitHub 117cffb0f6 Adopt elixir 1.15.7 2023-10-15 03:05:00 +00:00
Karl Nilsson 2494dbf678 Osiris v1.6.9
This contains a fix for a situation where a replica may not discover
the current commit offset until the next entry is written to the
stream.

Should help with a frequent flake in rabbit_stream_queue_SUITE:add_replicas
2023-10-13 12:01:57 +01:00
Michael Klishin 203bdf45ae
Merge pull request #9692 from rabbitmq/bump-otp-25.3
Adopt otp 25.3.2.7
2023-10-12 23:20:37 -04:00
GitHub b250d36388 Adopt otp 26.1.2 2023-10-13 03:07:34 +00:00
GitHub b79314e371 Adopt otp 25.3.2.7 2023-10-13 03:07:01 +00:00
Karl Nilsson 89be37f403 Osiris v1.6.8
This osiris release contains optimisations and bug fixes:

* Various index scanning operations have been substantially improved
  resulting in up to 10x improvement for certain cases.
* A bug which meant stream replication listener would fail if the
  TLS version was limited to `tlsv1.3` has been fixed.
* A bug where the log may be incorrectly truncated when filters are
  used has been fixed.
* Startup handles one more case where a file has been corrupted after
  an unclean shutdown.
2023-10-11 11:19:55 +01:00
GitHub 55f534cd66 Adopt otp 26.1.1 2023-09-30 03:13:16 +00:00
Rin Kuryloski 0bbb188aa9
Partially revert commit 3253fe433b
Khepri needs ra, and unless khepri is a native bazel dep, we still
need to declare ra in the classic fashion
2023-09-29 16:00:11 +02:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
GitHub d79c8a5051 Adopt otp 26.1.1 2023-09-29 03:06:09 +00:00
Karl Nilsson 882e0c1749 Ra 2.7.0
This includes a new ra:key_metrics/1 API that is more available
than parsing the output of sys:get_status/1.

the rabbit_quorum_queue:status/1 function has been ported to use
this API instead as well as now inludes a few new fields.
2023-09-28 11:46:39 -04:00
Rin Kuryloski a482edcaa7
Merge pull request #8341 from rabbitmq/rin/use-3.12.0-rc.3-for-mixed-versions
Use 3.12.6 for the secondary umbrella
2023-09-22 11:38:11 +02:00
Rin Kuryloski c6672b9dea Use the latest 3.12.x for the secondary umbrella 2023-09-21 09:35:49 +02:00
GitHub 6b1635d323 Adopt elixir 1.15.6 2023-09-21 03:05:05 +00:00
Rin Kuryloski 75eb0621fc Use OTP 26.1 as OTP 26 in CI 2023-09-20 15:33:34 +02:00
Rin Kuryloski 90f9656ee7 Use seshat from our bzlmod registry
0.6.1 matches the version currently used in osiris
2023-09-20 12:07:44 +02:00
Michael Klishin ed88e3820d
Merge pull request #9418 from rabbitmq/osiris-1.6.6
Osiris 1.6.7
2023-09-15 10:07:54 -04:00
Karl Nilsson 739214bc3f osiris 1.6.7 2023-09-15 11:26:02 +01:00