Commit Graph

192 Commits

Author SHA1 Message Date
Péter Gömöri a877ecb1a3 Remove observer_cli from CLI escritps
observer_cli (and its dependency recon) was declared as a dependency
of rabbitmq_cli and as a consequence included in all escritps. However
the major part of observer_cli runs in the broker. The cli side only
used `observer_cli:rpc_start/2` which is just an rpc call into the
target node.

By using common rpc call we can remove observer_cli and recon from the
escripts. This can be considered a minor improvement based on the
philosophy "simpler is better".

As an additional benefit auto-completing functions of the recon app
now works in `rabbitmq-diagnostics remote_shell`.
(eg. `recon:proc_c<TAB>`)

(cherry picked from commit f9d3ed732b)
2025-03-12 19:19:31 +00:00
Michael Klishin e1d748131c By @Ayanda-D: new CLI health check that detects QQs without an elected reachable leader #13433 (#13487)
* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbef33)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0009)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1fd7)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a73ca)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339e49)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bced2d)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b404)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e161f)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737b35)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89b87)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5b24)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 17368454c5)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179a2c)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6bd2)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e9922bd)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d8a4)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc9846e)

* Update (c) line year

(cherry picked from commit df82f12a70)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae28e)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166ff6)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81fd79)

* Fix typo (cli test module)

(cherry picked from commit 615856853a)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938f3d)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81bd2)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b929)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b0b380a5f5bad4e571a759c10f5cc93b91)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43bad58d3a286faa0377b8778fcbe6f8705d)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db7376797581e4716e659fad85ef484cc6f0ea15)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b392315d5e597e5171a0c8196230d92b8ea8e92)

---------

Co-authored-by: Ayanda Dube <adube14@bloomberg.net>
(cherry picked from commit 09f1ab47b7)
2025-03-12 04:33:37 +00:00
Michael Klishin b8fccd96ed
'rabbitmq diagnostics check_if_metadata_store_is_initialized*': support --formatter=json 2025-01-28 19:26:54 -05:00
Michael Klishin 07ec1b4b50
New health check CLI commands 2025-01-28 16:44:37 -05:00
Michael Klishin 5c7d643ab3
CLI: drop an unused import 2025-01-07 18:50:02 -05:00
Michael Klishin bcb080179c
Improve documentation of certain health checks
in the HTTP API documentation reference
as well as one CLI command that apparently
needed a clarification.
2025-01-07 18:48:39 -05:00
Michael Klishin 3f5b13d47f
Merge branch 'main' into mk-virtual-host-protection-from-accidental-deletion 2025-01-02 17:01:54 -05:00
Michael Klishin c95a95f822
CLI: mix format 2025-01-02 17:00:46 -05:00
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Michael Klishin d0b66b8c8f
check_certificate_expiration_command.ex: squash a compiler warning 2024-11-27 18:26:34 -05:00
Jean-Sébastien Pédron ddaea6facb
CLI: Finish `check_if_any_deprecated_features_are_used` implementation
[Why]
The previous implementation bypassed the deprecated features subsystem.
It only cared about classic mirrored queues and called some
queue-related code directly to determine if this specific feature was
used.

[How]
The command code is simplified by calling the deprecated subsystem to
list used deprecated features instead.

References #12619.
2024-11-11 10:13:25 +01:00
Michal Kuratczyk b48d4bf29a
Don't return khepri status when khperi_db is disabled
When khepri_db feature flag is disabled, Khepri servers
are running but are not clustered. In this case `rabbit_khepri:status/0`
shows that all nodes are leaders, which is confusing and scary
(even though actually harmless). Instead, we now just print that mnesia
is in use.
2024-08-22 11:01:46 +02:00
Michael Klishin f414c2d512
More missed license header updates #9969 2024-02-05 11:53:50 -05:00
Michael Davis c285651636
Resolve elixirc warnings
* Remove unused aliases/imports
* Remove or underscore unused bindings
* Fix variables that should be atoms (`unavailable` -> `:unavailable`)

Also, `Logger.warn/1` has been replaced by `Logger.warning/1`. It should
be safe to just replace the call with `Logger.warning/1` since it's
been in the standard library since Elixir 1.11.
2024-02-01 16:30:14 -05:00
Ariel Otilibili a9e488dbed
Fix for #8557, removed ANSI codes in JSON output
Add missing newline chars
2024-01-09 09:15:14 -08:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Michael Klishin 229a9fb7bd
list_policies_that_match: correctly format 'not found' errors as JSON 2023-11-13 21:35:48 -05:00
Michael Klishin 1ff5f6077b
mix format 2023-11-13 21:18:27 -05:00
Michael Klishin fd0488516b
diagnostics list_policies_that_match: support JSON formatting 2023-11-13 20:46:06 -05:00
Michael Klishin c4db560e0e
CLI: mix format 2023-11-13 11:21:29 -05:00
Michal Kuratczyk 408c33ec49
Add list_policies_that_match command 2023-11-13 13:47:54 +01:00
Michael Klishin a88c22144d CLI: mix format 2023-11-06 23:03:41 -05:00
Michael Klishin 114f9b90c9 CLI: refactor 'diagnostics check_if_any_deprecated_features_are_used' 2023-11-06 22:50:35 -05:00
Diana Parra Corbacho 51783e9464 CLI: check if any deprecated features are used returns just the list of features 2023-11-06 17:45:11 +01:00
Michael Klishin cbe2756cbd CLI: tests and refactoring for 'diagnostics check_if_cluster_has_classic_queue_mirroring_policy' 2023-11-06 07:20:08 -05:00
Diana Parra Corbacho 6288e9aa2c CTL: check if any deprecated features are used command 2023-11-06 07:20:08 -05:00
Diana Parra Corbacho 13e88ced92 Use Diagnostics group 2023-11-06 07:20:08 -05:00
Diana Parra Corbacho a06698d43d CLI: command to check if cluster has a classic queue mirroring policy 2023-11-06 07:20:08 -05:00
Diana Parra Corbacho 5f0981c5a3
Allow to use Khepri database to store metadata instead of Mnesia
[Why]

Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:

* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...

Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.

[How]

@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.

We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.

This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.

This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.

Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.

For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
  the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
  of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.

This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.

To enable Khepri, you need to enable the `khepri_db` feature flag:

    rabbitmqctl enable_feature_flag khepri_db

When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
   cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
   the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
   conversion if necessary on the way. Again, it uses
   `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
   it.

This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.

More about the implementation details below:

In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:

* Up until RabbitMQ 3.12.x:

        get(VHostName) when is_binary(VHostName) ->
            get_in_mnesia(VHostName).

* Starting with RabbitMQ 3.13.0:

        get(VHostName) when is_binary(VHostName) ->
            rabbit_khepri:handle_fallback(
              #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                khepri => fun() -> get_in_khepri(VHostName) end}).

This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
   it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.

However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:

* Sometimes, we need the feature flag state query to block because the
  function interested in it can't return a valid answer during the
  migration. Here is an example:

        case rabbit_khepri:is_enabled(RemoteNode) of
            true  -> can_join_using_khepri(RemoteNode);
            false -> can_join_using_mnesia(RemoteNode)
        end

* Sometimes, we need the feature flag state query to NOT block (for
  instance because it would cause a deadlock). Here is an example:

        case rabbit_khepri:get_feature_state() of
            enabled -> members_using_khepri();
            _       -> members_using_mnesia()
        end

Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:

    -rabbit_mnesia_tables_to_khepri_db(
       [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
example  — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.

[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration

See #7206.

Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
2023-09-29 16:00:11 +02:00
Rin Kuryloski 42d29a5ca3 Run 'mix format' with elixir 1.15.2 2023-07-04 17:45:32 +02:00
Michael Klishin 57cd750a1a mix format 2023-04-24 21:16:41 +04:00
Michael Klishin 4e41246415 CLI: more OTP 26 compatibility for remote_shell
shell:start_interactive/1 takes a {node(), mfa()}, not
{node(), atom(), atom(), non_neg_integer()}
2023-04-24 20:54:13 +04:00
Rin Kuryloski 5d27b3c246 Fix a compiler warning in the cli 2023-03-21 15:24:27 +01:00
Loïc Hoguin 74ce1a4bcc
Fix rabbitmq-diagnostics remote_shell for 26.0 (#7441) 2023-02-27 15:27:32 +01:00
Rin Kuryloski bdb2046185
Add rabbitmq_cli dialyze to bazel (#7066)
* Add rabbitmq_cli dialyze to bazel

and fix a number of warnings

Because we stop mix from recompiling rabbit_common in bazel, many
unknown functions are reported, so this dialyzer analysis is somewhat
incomplete.

* Use erlang dialyzer for rabbitmq_cli rather than mix dialyzer

Since this resolves all of the rabbit functions, there are far fewer
unknown functions.

Requires yet to be released rules_erlang 3.9.2

* Temporarily use pre-release rules_erlang

So that checks can run on this PR without a release

* Fix additional dialyzer warnings in rabbitmq_cli

* rabbitmq_cli: mix format

* Additional fixes for ignored return values

* Revert "Temporarily use pre-release rules_erlang"

This reverts commit c16b5b6815.

* Use rules_erlang 3.9.2
2023-01-31 15:05:52 +01:00
Michael Klishin 9d69a4b325
CLI: mix format 2023-01-16 09:24:37 -08:00
Michael Klishin 226a7603e8
diagnostics check_port_connectivity: support --address
For scenarios where node hostname resolution on the invoking
host is not a suitable option and a particular address
should be tried instead.

Closes #6853.
2023-01-16 09:24:37 -08:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Ayanda Dube 4cbbaad2df mix format rabbitmq_cli 2022-10-02 18:54:11 +01:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Michael Klishin d7e1336741
rabbitmq-diagnostics remote_shell: squash a compiler warning 2021-03-11 04:12:23 +03:00
Michael Klishin 0d29cbb116
Cosmetics 2021-03-03 17:17:59 +03:00
Loïc Hoguin 5c829ff599
Add rabbitmq-diagnostics remote_shell 2021-03-03 11:28:54 +01:00
Michael Klishin 17526987c6
Bump (c) year 2021-02-14 00:54:01 +03:00
Michael Klishin c43db9d4d9 Auth attempt command naming, add JSON --formatter support 2020-10-14 23:32:16 +03:00
dcorbacho 49616a709a Add --by-source option to list auth attempts command 2020-09-24 12:29:52 +01:00
dcorbacho 156df91ae2 Rename auth attempt commands 2020-09-23 11:49:13 +01:00
dcorbacho bb005b1ce6 Add enable/disable and list auth attempt metrics 2020-08-28 15:22:19 +01:00
dcorbacho 679ca254f3 Switch to Mozilla Public License 2.0 (MPL 2.0) 2020-07-11 19:23:07 +01:00
Michael Klishin db662cfa59 Re-categorize some rabbitmq-diagnostics commands
For a couple of reasons:

 * The Observability, Monitoring and Health Checks group has grown too large
 * Some commands in it clearly have to do with exploring effective node
   or CLI tool configuration, not its dynamically changing operational state
2020-07-05 15:35:32 +07:00