Commit Graph

666 Commits

Author SHA1 Message Date
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Simon Unge 91948c6936 Improve consumer metric cleanup when a channel goes down 2023-09-08 19:09:45 +00:00
Luke Bakken d4d1f880b0 Update incorrect warning message 2023-08-01 05:56:43 -07:00
Luke Bakken 30673071ac Remove Powershell as a backup to `handle.exe`
If a user does not have `handle.exe` installed in the `PATH` of their
Windows system, a message will be logged once, and then the total
handles being used will be set to `0`.

Fixes #8700

Follow-up to:
* #6614
* #6613
* #8541
2023-08-01 05:56:43 -07:00
Rin Kuryloski ca1806dbcd
Check additional applications when comparing bazel and make results (#8209)
* Check additional applications when comparing bazel and make results

* Sync bazel/make for amqp_client

* Do not fail-fast in build system comparison

* promethus -> prometheus

* Regenerate BUILD.redbug

* When comparing build systems & .app files ignore empty 'registered'

It's listed as a required key in
https://www.erlang.org/doc/man/app.html, but the same docs state the
default is "[]". It seems to ignore it if it's empty.

* Copy bazel/BUILD.osiris from BUILD.bazel in the osiris repo

Normally it would be generated with `bazel run gazelle-update-repos --
-args osiris@1.5.1=github.com/rabbitmq/osiris@v1.5.1`, but in this
case we just want to match it's compilation with erlang.mk with some
manual tweaks.

* Use elixir 1.15, otherwise mix format fails

* Sync bazel/make for rabbitmq_web_dispatch, rabbitmq_management_agent
2023-07-12 17:26:16 +02:00
Jean-Sébastien Pédron f3be0118c6
Mark management metrics collection as deprecated
[Why]
Management metrics collection will be removed in RabbitMQ 4.0. The
prometheus plugin provides a better and more scalable alternative.

[How]
The management metrics collection is marked as deprecated in the code
using the Deprecated features subsystem (based on feature flags). See
pull request #7390 for a description of that subsystem.

To test RabbitMQ behavior as if the feature was removed, the following
configuration setting can be used:
deprecated_features.permit.management_metrics_collection = false

Management metrics collection can be turned off anytime, there are no
conditions to do that.

Once management metrics collection is turned off, the management API
will not report any metrics and the UI will show empty graphs.

Note that given the marketing calendar, the deprecated feature will go
directly from "permitted by default" to "removed" in RabbitMQ 4.0. It
won't go through the gradual deprecation process.
2023-07-06 11:02:45 +02:00
Simon Unge 8b3ca4c972 See #8605. Add authentcation support to prometheus. 2023-06-23 13:54:45 -07:00
Michael Klishin 55442aa914 Replace @rabbitmq.com addresses with rabbitmq-core@groups.vmware.com
Don't ask why we have to do it. Because reasons!
2023-06-20 15:40:13 +04:00
Iliia Khaprov 00b3a895f1 UI bits for consumer timeout 2023-05-22 11:59:30 +02:00
Rin Kuryloski a944439fba Replace globs in bazel with explicit lists of files
As this is preferred in rules_erlang 3.9.14
2023-04-25 17:29:12 +02:00
Rin Kuryloski 854d01d9a5 Restore the original -include_lib statements from before #6466
since this broke erlang_ls

requires rules_erlang 3.9.13
2023-04-20 12:40:45 +02:00
Rin Kuryloski 8de8f59d47 Use gazelle generated bazel files
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).

In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
2023-04-17 18:13:18 +02:00
Rin Kuryloski 8a7eee6a86 Ignore warnings when building plt files for dependencies
As we don't generally care if a dependency has warnings, only the
target
2023-04-17 10:09:24 +02:00
Jean-Sébastien Pédron d65637190a
rabbit_nodes: Add list functions to clarify which nodes we are interested in
So far, we had the following functions to list nodes in a RabbitMQ
cluster:
* `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster;
  the argument was used to select members (all members or only those
  running Mnesia and participating in the cluster)
* `rabbit_nodes:all/0` to get all members of the Mnesia cluster
* `rabbit_nodes:all_running/0` to get all members who currently run
  Mnesia

Basically:
* `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)`
* `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)`

We also have:
* `rabbit_node_monitor:alive_nodes/1` which filters the given list of
  nodes to only select those currently running Mnesia
* `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given
  list of nodes to only select those currently running RabbitMQ

Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the
`rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)`
or `rabbit_nodes:all_running/0` is often used as a close approximation
of "all cluster members running RabbitMQ". This list might be incorrect
in times where a node is joining the clustered or is being worked on
(i.e. Mnesia is running but not RabbitMQ).

With Khepri, there won't be the same possible approximation because we
will try to keep Khepri/Ra running even if RabbitMQ is stopped to
expand/shrink the cluster.

So in order to clarify what we want when we query a list of nodes, this
patch introduces the following functions:
* `rabbit_nodes:list_members/0` to get all cluster members, regardless
  of their state
* `rabbit_nodes:list_reachable/0` to get all cluster members we can
  reach using Erlang distribution, regardless of the state of RabbitMQ
* `rabbit_nodes:list_running/0` to get all cluster members who run
  RabbitMQ, regardless of the maintenance state
* `rabbit_nodes:list_serving/0` to get all cluster members who run
  RabbitMQ and are accepting clients

In addition to the list functions, there are the corresponding
`rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)`
filtering functions.

The code is modified to use these new functions. One possible
significant change is that the new list functions will perform RPC calls
to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
2023-02-13 12:58:40 +01:00
David Ansari db107500c5 Remove compatibility code for management agent feature flags
Remove compatibility code for feature flags
* drop_unroutable_metric
* empty_basic_get_metric

since they are required in 3.12.0.

See https://github.com/rabbitmq/rabbitmq-server/pull/7219
2023-02-09 09:53:46 +00:00
David Ansari 5045fce6de Require all feature flags introduced before 3.11.1
RabbitMQ 3.12 requires feature flag `feature_flags_v2` which got
introduced in 3.11.0 (see
https://github.com/rabbitmq/rabbitmq-server/pull/6810).

Therefore, we can mark all feature flags that got introduced in 3.11.0
or before 3.11.0 as required because users will have to upgrade to
3.11.x first, before upgrading to 3.12.x

The advantage of marking these feature flags as required is that we can
start deleting any compatibliy code for these feature flags, similarly
as done in https://github.com/rabbitmq/rabbitmq-server/issues/5215

This list shows when a given feature flag was first introduced:

```
classic_mirrored_queue_version 3.11.0
stream_single_active_consumer 3.11.0
direct_exchange_routing_v2 3.11.0
listener_records_in_ets 3.11.0
tracking_records_in_ets 3.11.0

empty_basic_get_metric 3.8.10
drop_unroutable_metric 3.8.10
```

In this commit, we also force all required feature flags in Erlang
application `rabbit` to be enabled in mixed version cluster testing
and delete any tests that were about a feature flag starting as disabled.

Furthermore, this commit already deletes the callback (migration) functions
given they do not run anymore in 3.12.x.

All other clean up (i.e. branching depending on whether a feature flag
is enabled) will be done in separate commits.
2023-02-08 16:00:03 +00:00
Alexey Lebedeff c7da0da8b8 Cleanup dialyzer calls
- Use the same base .plt everywhere, so there is no need to list
standard apps everywhere
- Fix typespecs: some typos and the use of not-exported types
2023-02-06 17:05:30 +01:00
syuparn e3e6c97466 Fix empty channel detail format in management api
Signed-off-by: syuparn <s.hello.spagetti@gmail.com>
2023-01-26 07:20:53 +09:00
Rin Kuryloski b84e746ee9 Rework plt/dialyze for rabbitmqctl and plugins that depend on it
This allows us to stop ignorning undefined callback warnings

When mix compiles rabbitmqctl, it produces a 'consolidated' directory
alongside the 'ebin' dir. Some of the modules in consolidated are
intended to be used instead of those provided by elixir. We now handle
the conflicts properly in the bazel build.
2023-01-19 17:29:23 +01:00
Michael Klishin c87bffc1d0
Merge pull request #6938 from rabbitmq/dialyzer-warnings-rabbitmq_management
Fix all dialyzer warnings in rabbitmq_management
2023-01-19 07:53:04 -06:00
Alexey Lebedeff a10d63404a Fix all dialyzer warnings in rabbitmq_management_agent
Case is removed from `rabbit_mgmt_external_stats:get_used_fd/1`,
because `get_used_fd/2` always returns a number.
2023-01-19 14:10:38 +01:00
Alexey Lebedeff 1865284ab3 Fix all dialyzer warnings in rabbitmq_management 2023-01-19 12:47:34 +01:00
Rin Kuryloski 5ef8923462 Avoid the need to pass package name to rabbitmq_integration_suite 2023-01-18 15:25:27 +01:00
Rin Kuryloski a317b30807 Use improved assert_suites2 macro from rules_erlang 3.9.0 2023-01-18 15:07:06 +01:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Luke Bakken b067aa612e
Handle handle.exe output correctly 2022-12-10 14:34:23 -08:00
Luke Bakken 10fcc64b74
Use win32_cmd/2 to run handle.exe because it is great. 2022-12-10 14:20:02 -08:00
Luke Bakken f7b65abc15
Execute powershell directly 2022-12-10 13:19:45 -08:00
Luke Bakken 9eb993eb65
Use powershell as a backup to handle.exe
Fixes #6613
2022-12-10 13:19:45 -08:00
Jean-Sébastien Pédron 15d9cdea61
Call `rabbit:data_dir/0` instead of `rabbit_mnesia:dir/0`
This is a follow-up commit to the parent commit. To quote part of the
parent commit's message:

> Historically, this was the Mnesia directory. But semantically, this
> should be the reverse: RabbitMQ owns the data directory and Mnesia is
> configured to put its files there too.

Now all subsystems call `rabbit:data_dir/0`. They are not tied to Mnesia
anymore.
2022-11-30 14:41:32 +01:00
Luke Bakken 7fe159edef
Yolo-replace format strings
Replaces `~s` and `~p` with their unicode-friendly counterparts.

```
git ls-files *.erl | xargs sed -i.ORIG -e s/~s>/~ts/g -e s/~p>/~tp/g
```
2022-10-10 10:32:03 +04:00
Luke Bakken 755ac7176b
Follow-up to #5486
Discovered by @dumbbell

Ensure externally read strings are saved as utf-8 encoded binaries. This
is necessary since `cmd.exe` on Windows uses ISO-8859-1 encoding and
directories can have latin1 characters, like `RabbitMQ Sérvér`.

The `é` is represented by decimal `233` in the ISO-8859-1 encoding. The
unicode code point is the same decimal value, `233`, so you will see
this in the charlist data. However, when encoded using utf-8, this
becomes the two-byte sequence `C3 A9` (hexidecimal).

When reading strings from env variables and configuration, they will be
unicode charlists, with each list item representing a unicode code
point. All of Erlang string functions can handle strings in this form.
Once these strings are written to ETS or Mnesia, they will be converted
to utf-8 encoded binaries. Prior to these changes just
`list_to_binary/1` was used.

Fix xref error

re:replace requires an iodata, which is not a list of unicode code points

Correctly parse unicode vhost tags

Fix many format strings to account for utf8 input. Try again to fix unicode vhost tags

More format string fixes, try to get the CONFIG_FILE var correct

Be sure to use the `unicode` option for re:replace when necessary

More unicode format strings, add unicode option to re:split

More format strings updated

Change ~s to ~ts for vhost format strings

Change ~s to ~ts for more vhost format strings

Change ~s to ~ts for more vhost format strings

Add unicode format chars to disk monitor

Quote the directory on unix

Finally figure out the correct way to pass unicode to the port
2022-09-24 11:19:59 -07:00
Rin Kuryloski 575c5f9975 Remove all of the .travis.yml files
since we no longer use them
2022-08-16 09:46:31 +02:00
Jean-Sébastien Pédron 6e9ee4d0da
Remove test code which depended on the `quorum_queue` feature flags
These checks are now irrelevant as the feature flag is required.
2022-08-01 12:41:30 +02:00
Michael Klishin aa8b0dc778
Merge pull request #4985 from rabbitmq/rabbitmq-server-4981
Fix cluster links statistic
2022-06-10 13:50:03 +04:00
Philip Kuryloski a250a533a4 Remove elixir related -ignore_xref calls
As they are no longer necessary with xref2 and the erlang.mk updates
2022-06-09 23:18:40 +02:00
Philip Kuryloski 15a79466b1 Use the new xref2 macro from rules_erlang
That adopts the modern erlang.mk xref behaviour
2022-06-09 23:18:28 +02:00
Luke Bakken 86a509df80
Fix cluster links statistic
Use the `sys_dist` ets table to get distribution port information.

Fixes #4981

Get cluster links stats for TLS dist

Use code from prometheus.erl to get dist links info
2022-06-09 07:20:45 -07:00
Philip Kuryloski 327f075d57 Make rabbitmq-server work with rules_erlang 3
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps

Also:

- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
  - Fetching seems blocked in CI, and this should reduce hex api usage in
    all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
  the defaults
2022-06-08 14:04:53 +02:00
Luke Bakken b172bca19b
Add rabbit_consult module
This will be used to fix rabbitmq/osiris#78

If a RabbitMQ `advanced.config` file contains the following:

```
{customize_hostname_check, [
    {match_fun, public_key:pkix_verify_hostname_match_fun(https)}
]}
```

...`file:consult/1` will fail because it does not evaluate terms in the
file.

The code in `rabbit_consult` was copied from this OTP module:

https://github.com/erlang/otp/blob/master/lib/ssl/src/ssl_dist_sup.erl

...and then modified for our use.

Add Bazel suite

Use the same license as Erlang/OTP, add link to source cc @dumbbell

Add test and ensure value returned matches file:consult/1

Add test data file

Ensure that Funs are converted to binaries before jsx:encode is called

Add a check that customize_hostname_check can be JSON encoded

Ensure that customize_hostname_check and match_fun are filtered out from listener data
2022-06-05 06:13:49 -07:00
Loïc Hoguin dc70cbf281
Update Erlang.mk and switch to new xref code 2022-05-31 13:51:12 +02:00
David Ansari 20677395cd Check queue and exchange existence with ets:member/2
This reduces memory usage and improves code readability.
2022-05-10 10:16:40 +00:00
Péter Gömöri 52cb5796a3 Remove leftover compiler option for get_stacktrace 2022-05-03 18:40:49 +02:00
Luke Bakken dba25f6462
Replace files with symlinks
This prevents duplicated and out-of-date instructions.
2022-04-15 06:04:29 -07:00
Philip Kuryloski 2dd9bde891 Bring over PROJECT_APP_EXTRA_KEYS values from make to bazel 2022-04-07 17:39:33 +02:00
Loïc Hoguin 499e0b9197
Remove the CQv1 disabled stats from management/Prometheus 2022-04-05 12:37:54 +02:00
Luke Bakken b3c35b4073
Fix return value of now_to_str
Prior to 331ff8e851 the print/2 function was used to format output which also converted to binary. This restores the conversion to binary.

Follow up to #4277 and #4278
2022-03-22 07:09:19 -07:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Luke Bakken 455c01fccf
Update some logging timestamps
Uses the same format as the OTP logger

Fixes #4276
2022-03-15 12:40:44 -07:00
Philip Kuryloski dabf053cf8 Additional dialyzer warning fixes
Currently loading of the rabbitmq_cli defined behaviors compiled with
Elixir does not work, so we ignore the callback definitions contained therein
2022-02-25 18:14:35 +01:00