Commit Graph

110 Commits

Author SHA1 Message Date
Michael Klishin 09f1ab47b7
By @Ayanda-D: new CLI health check that detects QQs without an elected reachable leader #13433 (#13487)
* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbef33)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0009)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1fd7)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a73ca)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339e49)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bced2d)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b404)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e161f)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737b35)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89b87)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5b24)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 17368454c5)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179a2c)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6bd2)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e9922bd)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d8a4)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc9846e)

* Update (c) line year

(cherry picked from commit df82f12a70)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae28e)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166ff6)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81fd79)

* Fix typo (cli test module)

(cherry picked from commit 615856853a)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938f3d)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81bd2)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b929)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b0b380a5f5bad4e571a759c10f5cc93b91)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43bad58d3a286faa0377b8778fcbe6f8705d)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db7376797581e4716e659fad85ef484cc6f0ea15)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b392315d5e597e5171a0c8196230d92b8ea8e92)

---------

Co-authored-by: Ayanda Dube <adube14@bloomberg.net>
2025-03-12 00:32:59 -04:00
Michael Klishin 07ec1b4b50
New health check CLI commands 2025-01-28 16:44:37 -05:00
Michael Klishin 3f5b13d47f
Merge branch 'main' into mk-virtual-host-protection-from-accidental-deletion 2025-01-02 17:01:54 -05:00
Michael Klishin c95a95f822
CLI: mix format 2025-01-02 17:00:46 -05:00
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Michael Klishin 9b6ab77f87
CLI: test/diagnostics cosmetics #12894 2024-12-04 13:44:29 -05:00
Michael Davis c328922a2a
CLI: Fix match of discover peers command test
This previously emitted a warning because Elixir will rebind `this_node`
by default, so the `this_node` binding in the line above was unused.
(As opposed to Erlang which would treat this as a match - rejecting
the binding if `this_node` was not equal to the value being matched.)
The node needed to be adjusted as well - `node()` returned the ExUnit
runner's node while the command returned the remote node, which is
stored in the context under `opts.node`.
2024-12-04 11:07:46 -05:00
Michael Davis d58d874a0b
CLI: Resolve elixirc warnings 2024-12-04 11:07:34 -05:00
Jean-Sébastien Pédron 112ff3f3f5
rabbitmq_cli: Prepare tests to run against a node with Khepri enabled by default 2024-12-02 13:55:41 +01:00
Michael Klishin f414c2d512
More missed license header updates #9969 2024-02-05 11:53:50 -05:00
Jean-Sébastien Pédron 84cede17e1
rabbit_peer_discovery: Rewrite core logic
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
2023-12-07 15:51:54 +01:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Michael Klishin 8a76e903a3
One more test renaming to follow CLI conventions 2023-11-13 20:46:31 -05:00
Michael Klishin 2ebc23ef23
Use a standard CLI test suite file naming convention 2023-11-13 19:51:58 -05:00
Michael Klishin c4db560e0e
CLI: mix format 2023-11-13 11:21:29 -05:00
Michal Kuratczyk 408c33ec49
Add list_policies_that_match command 2023-11-13 13:47:54 +01:00
Michael Klishin 114f9b90c9 CLI: refactor 'diagnostics check_if_any_deprecated_features_are_used' 2023-11-06 22:50:35 -05:00
Michael Klishin cbe2756cbd CLI: tests and refactoring for 'diagnostics check_if_cluster_has_classic_queue_mirroring_policy' 2023-11-06 07:20:08 -05:00
Rin Kuryloski 42d29a5ca3 Run 'mix format' with elixir 1.15.2 2023-07-04 17:45:32 +02:00
Michael Klishin 98c85f367f Bump (c) year 2023-07-04 00:21:40 +04:00
Michal Kuratczyk 699af2c8c3
Don't rely on implicit order in a test 2023-04-13 14:37:18 +02:00
Michael Klishin c3c4665970
Update tests 2023-01-16 09:24:37 -08:00
Jean-Sébastien Pédron 4b132daaba
Remove upgrade-specific log file
This category should be unused with the decommissioning of the old
upgrade subsystem (in favor of the feature flags subsystem). It means:
1. The upgrade log file will not be created by default anymore.
2. The `$RABBITMQ_UPGRADE_LOG` environment variable is now unsupported.

The configuration variables remain to avoid breaking an existing and
working configuration.
2022-10-06 21:28:50 +02:00
Ayanda Dube 4cbbaad2df mix format rabbitmq_cli 2022-10-02 18:54:11 +01:00
Jean-Sébastien Pédron cdcf602749
Switch from Lager to the new Erlang Logger API for logging
The configuration remains the same for the end-user. The only exception
is the log root directory: it is now set through the `log_root`
application env. variable in `rabbit`. People using the Cuttlefish-based
configuration file are not affected by this exception.

The main change is how the logging facility is configured. It now
happens in `rabbit_prelaunch_logging`. The `rabbit_lager` module is
removed.

The supported outputs remain the same: the console, text files, the
`amq.rabbitmq.log` exchange and syslog.

The message text format slightly changed: the timestamp is more precise
(now to the microsecond) and the level can be abbreviated to always be
4-character long to align all messages and improve readability. Here is
an example:

    2021-03-03 10:22:30.377392+01:00 [dbug] <0.229.0> == Prelaunch DONE ==
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>  Starting RabbitMQ 3.8.10+115.g071f3fb on Erlang 23.2.5
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com

The example above also shows that multiline messages are supported and
each line is prepended with the same prefix (the timestamp, the level
and the Erlang process PID).

JSON is also supported as a message format and now for any outputs.
Indeed, it is possible to use it with e.g. syslog or the exchange. Here
is an example of a JSON-formatted message sent to syslog:

    Mar  3 11:23:06 localhost rabbitmq-server[27908] <0.229.0> - {"time":"2021-03-03T11:23:06.998466+01:00","level":"notice","msg":"Logging: configured log handlers are now ACTIVE","meta":{"domain":"rabbitmq.prelaunch","file":"src/rabbit_prelaunch_logging.erl","gl":"<0.228.0>","line":311,"mfa":["rabbit_prelaunch_logging","configure_logger",1],"pid":"<0.229.0>"}}

For quick testing, the values accepted by the `$RABBITMQ_LOGS`
environment variables were extended:
  * `-` still means stdout
  * `-stderr` means stderr
  * `syslog:` means syslog on localhost
  * `exchange:` means logging to `amq.rabbitmq.log`

`$RABBITMQ_LOG` was also extended. It now accepts a `+json` modifier (in
addition to the existing `+color` one). With that modifier, messages are
formatted as JSON intead of plain text.

The `rabbitmqctl rotate_logs` command is deprecated. The reason is
Logger does not expose a function to force log rotation. However, it
will detect when a file was rotated by an external tool.

From a developer point of view, the old `rabbit_log*` API remains
supported, though it is now deprecated. It is implemented as regular
modules: there is no `parse_transform` involved anymore.

In the code, it is recommended to use the new Logger macros. For
instance, `?LOG_INFO(Format, Args)`. If possible, messages should be
augmented with some metadata. For instance (note the map after the
message):

    ?LOG_NOTICE("Logging: switching to configured handler(s); following "
                "messages may not be visible in this log output",
                #{domain => ?RMQLOG_DOMAIN_PRELAUNCH}),

Domains in Erlang Logger parlance are the way to categorize messages.
Some predefined domains, matching previous categories, are currently
defined in `rabbit_common/include/logging.hrl` or headers in the
relevant plugins for plugin-specific categories.

At this point, very few messages have been converted from the old
`rabbit_log*` API to the new macros. It can be done gradually when
working on a particular module or logging.

The Erlang builtin console/file handler, `logger_std_h`, has been forked
because it lacks date-based file rotation. The configuration of
date-based rotation is identical to Lager. Once the dust has settled for
this feature, the goal is to submit it upstream for inclusion in Erlang.
The forked module is calld `rabbit_logger_std_h` and is based
`logger_std_h` in Erlang 23.0.
2021-03-11 15:17:36 +01:00
Loïc Hoguin 5c829ff599
Add rabbitmq-diagnostics remote_shell 2021-03-03 11:28:54 +01:00
Michael Klishin c43db9d4d9 Auth attempt command naming, add JSON --formatter support 2020-10-14 23:32:16 +03:00
dcorbacho 679ca254f3 Switch to Mozilla Public License 2.0 (MPL 2.0) 2020-07-11 19:23:07 +01:00
Michael Klishin db299967e0 Introduce 'rabbitmq-diagnostics erlang_cookie_sources'
to help troubleshoot authentication issues.

Inspired by an idea from @gerhard.
2020-07-05 03:19:50 +07:00
Michael Klishin f11384fe86 Introduce 'rabbitmq-diagnostics resolver_info'
To inspect effective inetrc [1] settings used
by a node or CLI tools.

1. https://erlang.org/doc/apps/erts/inet_cfg.html
2020-06-21 15:09:21 +03:00
Michael Klishin b17fda724b Introduce 'rabbitmq-diagnostics resolve_hostname'
Helps with troubleshooting hostname resolution behavior
on nodes and locally for CLI tools. This is obviously not meant
to be a replacement for existing tools such as dig, only
a way to quickly spot obvious irregularities, e.g. those
in environments that use custom Erlang inetrc files.

Per discussion @harshac.
2020-06-20 16:55:21 +03:00
Michael Klishin 3003b9e615 Introduce 'rabbitmq-diagnostics list_network_interfaces'
To make it easier to discover them without using eval and
obscure functions.

Part of rabbitmq/rabbitmq-cli#424
2020-06-05 17:16:10 +03:00
Michael Klishin 0ff2e3fb77 Explain 2020-05-16 19:12:31 +03:00
Michael Klishin a2ec22023f Handle variable case in this test 2020-05-16 00:39:07 +03:00
Michael Klishin 0537a9ca36 Don't depend on a single env variable in this test 2020-05-16 00:36:33 +03:00
Michael Klishin 947940ccd5 Introduce 'rabbitmq-diagnostics os_env'
It prints RabbitMQ-specific environment variables that
are set on the target node. Can be used to inspect env variable-based
configuration without access to the target host.
2020-05-06 23:19:04 +03:00
Michael Klishin a3d60d35c5 Be less generous in empty whitespace use in setup_all functions 2020-03-24 19:08:00 +03:00
Jean-Sébastien Pédron 0e15591bf5 Update copyright (year 2020) 2020-03-10 15:39:56 +01:00
Michael Klishin 5b7063d07d More sensible JSON formatting for some commands
Fall back to a JSON document if command returns a bitstring
(does not do any preformatting for JSON).

Per discussion with @lukebakken

Closes #394.
2020-01-18 02:06:04 +03:00
Michael Klishin 73776fbf04 (c) bump 2019-12-29 05:50:26 +03:00
Michael Klishin 01e950fd18 Make tests that mess with node or quorum state sequential
As most tests already are. It's highly unlikely that these
were meant to execute in parallel by design.
2019-12-11 14:51:10 +01:00
Michael Klishin f3a06eda0d Squash a warning 2019-09-30 23:19:56 +03:00
Jean-Sébastien Pédron 0cb95e7b55 log_tail_stream_command_test: Bump stream duration to 15 seconds
... from 5 seconds. Hopefully this will increase the chance of seeing
the messages logged by the testcase.
2019-09-24 11:48:47 +02:00
Michael Klishin 5a137480b3 Merge pull request #378 from rabbitmq/consume-events-command
Consume event command
2019-09-24 01:55:57 +03:00
Michael Klishin 99f1790ac3 Update test expectations 2019-09-24 01:52:31 +03:00
Michael Klishin 4c33ce0961 Move command_line_arguments to rabbitmq-diagnostics 2019-09-24 00:54:11 +03:00
dcorbacho 15d7eb2858 Diagnostics: test consume_event_stream_command
[#168224266]
2019-09-23 17:19:07 +01:00
Michael Klishin 5b1086156e diagnostics log_tail_stream: remove a fragile test
It makes a lot of assumptions about Lager's log flush
timing and can be tripped by the peak rate protection
mechanism. This test module has a high rate of false
positives on Concourse.

There is another test that asserts over a "folded" stream, so
code coverage is kept about the same.
2019-08-11 13:20:19 +10:00
Michael Klishin 462b480f16 Same as d3c01b3a1f1a65d1d935c3e6e0441388da44ba57 in more places
(cherry picked from commit 68c8d204c08eb9956925e0fb71608a0737f3e771)
2019-07-06 20:31:43 +03:00
Michael Klishin 535f00e08f Let Lager's log message rate lapse before logging in these tests
Otherwise some log messages we assert on might be dropped.

(cherry picked from commit d3c01b3a1f1a65d1d935c3e6e0441388da44ba57)
2019-07-06 18:47:11 +03:00