Commit Graph

571 Commits

Author SHA1 Message Date
Arnaud Cogoluègnes d1aab61566
Prevent blocked groups in stream SAC with fine-grained status
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.

This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.

This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.

Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.

The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.

So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.

2 new configuration entries are introduced:
 * rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
   disconnected-to-forgotten timer.
 * rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
   in the coordinator. It used to be a fixed value of 30 seconds. The
   default value is still the same. The setting has been introduced to
   make integration tests faster.

Fixes #14070
2025-06-17 11:56:20 +02:00
Iliia Khaprov 9a2f702f4f
Log queue_utils ra's local_query rpc error 2025-06-13 14:51:41 +02:00
Iliia Khaprov 2f3bed5a5b
Move clustering_utils and queue_utils to ct_helpers 2025-06-13 14:43:23 +02:00
Diana Parra Corbacho e1d71b185c CT broker helpers: use rabbitmq-plugins from the given node with a secondary umbrella 2025-06-10 18:14:30 +02:00
David Ansari eccf9fee1e Run Quorum Queue property test on different OTP versions
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Has been cancelled Details
Test (make) / Build and Xref (1.17, 26) (push) Has been cancelled Details
Test (make) / Build and Xref (1.17, 27) (push) Has been cancelled Details
Test (make) / Test (1.17, 27, khepri) (push) Has been cancelled Details
Test (make) / Test (1.17, 27, mnesia) (push) Has been cancelled Details
Test (make) / Test mixed clusters (1.17, 27, khepri) (push) Has been cancelled Details
Test (make) / Test mixed clusters (1.17, 27, mnesia) (push) Has been cancelled Details
Test (make) / Type check (1.17, 27) (push) Has been cancelled Details
## What?

PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.

This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.

 ## Why?

This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.

 ## How?

CI runs currently tests with Erlang 27.

This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.

Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.

By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
2025-06-06 17:08:28 +02:00
Iliia Khaprov 76a5531d8c
Move test_utils.erl from rabbit to rabbitmq_ct_helpers
fake_pid function is useful for other plugins
2025-05-27 18:06:59 +02:00
Iliia Khaprov 6b528e2caf
Replace ct:pal with ct:log in select places 2025-05-26 16:57:41 +02:00
Iliia Khaprov 8512a4459b
Hardcode rabbit_ct_hook and cth_styledout inside our ct_master_fork.
Helps cleaning-up/coloring stdout for parallel targets
TODO: there are obvious races for different nodes outputs
In the next iteration I hope to implement cursor tracking for each node
2025-05-26 16:57:40 +02:00
Iliia Khaprov 8dcad8a4fd
Run rabbit_ct_hook for management, and mqtt 2025-05-26 16:57:40 +02:00
Jean-Sébastien Pédron 124467e620
rabbitmq_ct_helpers: Use node 2 as seed node, even with secondary umbrella
[Why]
This makes sure that nodes are clustered the same way, whether the tests
are executed with or without a secondary umbrella.
2025-04-08 18:47:27 +02:00
Arnaud Cogoluègnes b8244f70f4
Pull from socket up to 10 times in stream test utils (#13588)
To make sure to have enough data to complete a command.
2025-03-24 09:13:31 +01:00
Loïc Hoguin c5d150a7ef
Use Erlang.mk's native Elixir support for CLI
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.

Elixir is only enabled from within rabbitmq_cli currently.

Eunit is disabled since there are only Elixir tests.

Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.

This commit also includes a few changes that are
related:

 * The Erlang distribution will now be started for parallel-ct

 * Many unnecessary PROJECT_MOD lines have been removed

 * `eunit_formatters` has been removed, it provides little value

 * The new `maybe_flock` Erlang.mk function is used where possible

 * Build test deps when testing rabbitmq_cli (Mix won't do it anymore)

 * rabbitmq_ct_helpers now use the early plugins to have Dialyzer
   properly set up
2025-03-18 10:02:49 +01:00
Aitor Perez 07adc3e571
Remove Bazel files 2025-03-13 13:42:34 +00:00
Diana Parra Corbacho c0bd1f5202 Tests: add rabbitmq_diagnostics to test helpers 2025-02-20 15:58:04 +01:00
Jean-Sébastien Pédron 1f1a13521b
Skip peer discovery clustering tests if multiple Khepri machine versions
... are being used at the same time.

[Why]
Depending on which node clusters with which, a node running an older
version of the Khepri Ra machine may not be able to apply Ra commands
and could be stuck.

There is no real solution and this clearly an unsupported scenario. An
old node won't always be able to join a newer cluster.

[How]
In the testsuites, we skip clustering tests if we detect that multiple
Khepri Ra machine versions are being used.
2025-02-12 17:13:24 +01:00
Jean-Sébastien Pédron c78aec7d48
rabbit_db: `force_reset` command is unsupported with Khepri
[Why]
The `force_reset` command simply removes local files on disk for the
local node.

In the case of Ra, this can't work because the rest of the cluster does
not know about the forced-reset node. Therefore the leader will continue
to send `append_entry` commands to the reset node.

If that forced-reset node restarts and receives these messages, it will
either join the cluster again (because it's on an older Raft term) or it
will hit an assertion and exit (because it's on the same Raft term).

[How]
Given we can't really support this scenario and it has little value, the
command will now return an error if someone attemps a `force_reset` with
a node running Khepri.

This also deprecates the command: once Mnesia support is removed, the
command will be removed at the same time. This is noted in the
rabbitmqctl.8 manpage.
2025-02-10 15:09:36 +01:00
David Ansari 579c58603e Support AMQP over WebSocket (OSS part) 2025-01-27 17:50:47 +01:00
Jean-Sébastien Pédron f549425615
rabbitmq_ct_broker_helpers: Use node 2 as the cluster seed node
[Why]
When running mixed-version tests, nodes 1/3/5/... are using the primary
umbrella, so usually the newest version. Nodes 2/4/6/... are using the
secondary umbrella, thus the old version.

When clustering, we used to use node 1 (running a new version) as the
seed node, meaning other nodes would join it.

This complicates things with feature flags because we have to make sure
that we start node 1 with new stable feature flags disabled to allow old
nodes to join.

This is also a problem with Khepri machine versions because the cluster
would start with the latest version, which old nodes might not have.

[How]
This patch changes the logic to use a node running the secondary
umbrella as the seed node instead. If there is no node running it, we
pick the first node as before.

V2: Revert part of "rabbitmq_ct_helpers: Fix how we set
    `$RABBITMQ_FEATURE_FLAGS` in tests" (commit
    57ed962ef6). These changes are no
    longer needed with the new logic.

V3: The check that verifies that the correct metadata store is used has
    a special case for nodes that use the secondary umbrella: if Khepri
    is supposed to be used but it's not, the feature flag is enabled.
    The reason is that the `v4.0.x` branch doesn't know about the `rel`
    configuration of `forced_feature_flags_on_init`. The nodes will
    have ignored thies parameter and booted with the stable feature
    flags only.

    Many testsuites are adapted to the new clustering order. If they
    manage which node joins which node, either the order is changed in
    the testcases, or nodes are started with only required feature
    flags. For testsuites that rely on peer discovery where the order is
    unknown, nodes are started with only required feature flags.
2025-01-27 12:08:12 +01:00
Karl Nilsson d6865a648e Ct helpers: add "** killed" to the defaul log crash ignore list.
Exits the with reason "killed" only occurs "naturally" in OTP
when a supervisor tries to shut a child down and it times out.

It is used for failure simulation in tests quite frequently however.
2025-01-23 13:26:41 +00:00
Arnaud Cogoluègnes b3b0940024
Fix wait-for-confirms sequence in stream test utils
And refine the implementation and its usage.
2025-01-21 17:38:58 +01:00
Jean-Sébastien Pédron 57ed962ef6
rabbitmq_ct_helpers: Fix how we set `$RABBITMQ_FEATURE_FLAGS` in tests
[Why]
In order to make `khepri_db` the default in the future, the handling of
`$RABBITMQ_FEATURE_FLAGS` had to be adapted to be able to *disable*
Khepri instead.

Unfortunately I broke the behavior with stable feature flags that are
only available in the primary umbrella. In this case, they were
automatically enabled and thus, clustering with an old umbrella that did
not have these feature flags failed with `incompatible_feature_flags`.

[How]
The solution is to always use an absolute list of feature flags, not the
new relative list.

V2: Allow a testsuite to skip the configuration of the metadata store.
    This is needed for the feature_flags_SUITE testsuite because it
    tests the default behavior and the configuration of the metadata
    store changes that behavior.

    While here, fix a ct log message where variables were swapped
    compared to the format strieg expectation.

V3: Enable `rabbitmq_4.0.0` feature flag in rabbit_mgmt_http_SUITE. This
    testsuite apparently requires it and if it's not enabled, it fails.
2025-01-15 20:43:41 +01:00
Michael Klishin 3f5b13d47f
Merge branch 'main' into mk-virtual-host-protection-from-accidental-deletion 2025-01-02 17:01:54 -05:00
Michael Klishin f62d46c286
Introduce a way to protect a virtual host from deletion
Accidental "fat finger" virtual deletion accidents
would be easier to avoid if there was a protection mechanism
that would apply equally even to CLI tools and external
applications that do not use confirmations for deletion
operations.

This introduce the following changes:

 * Virtual host metadata now supports a new queue,
   'protected_from_deletion', which, when set,
   will be considered by key virtual host deletion function(s)
 * DELETE /api/vhosts/{name} was adapted to handle
   such blocked deletion attempts to respond with
   a 412 Precondition Failed status
 * 'rabbitmqctl list_vhosts' and 'rabbitmqctl delete_vhost'
   were adapted accordingly
 * DELETE /api/vhosts/{name}/deletion/protection
   is a new endpoint that can be used to remove
   the protective seal (the metadata key)
 * POST /api/vhosts/{name}/deletion/protection
   marks the virtual host as protected

In the case of the HTTP API, all operations on
virtual host metadata require administrative
privileges from the target user.

Other considerations:

 * When a virtual host does not exist, the behavior
  remains the same: the original, protection-unaware
  code path is used to preserve backwards compatibility

References #12772.
2025-01-02 16:50:51 -05:00
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Jean-Sébastien Pédron debe2a118c
rabbitmq_ct_helpers: Change how Mnesia/Khepri is selected
[Why]
Once `khepr_db` is enabled by default, we need another way to disable it
to select Mnesia instead.

[How]
We use the new relative forced feature flags mechanism to indicate if we
want to explicitly enable or disable `khepri_db`. This way, we don't
touch other stable feature flags and only mess with Khepri.

However, this mechanism is not supported by RabbitMQ 4.0.x and older.
They will ignore the setting. Therefore, to make this work in
mixed-version testing, we set the `$RABBITMQ_FEATURE_FLAGS` variable for
the secondary umbrella. This part will go away once we test against
RabbitMQ 4.1.x as the secondary umbrella in the future.

At the end, we compare the effective metadata store to the expected one.
If they don't match, we skip the test.

While here, change `rjms_topic_selector_SUITE` to only choose Khepri
without specifying any feature flags.
2024-12-17 09:56:54 +01:00
Michael Klishin 1cae417dbf
Merge pull request #12821 from rabbitmq/rabbitmq-server-12776
Definition export: inject default queue type into virtual host metadata
2024-11-27 14:53:25 -05:00
Michael Klishin 090d11818f
HTTP API tests for injected default queue type 2024-11-26 18:00:37 -05:00
Diana Parra Corbacho ca0a450f3b Tests: SSL certificates
Parallel/sharding groups often fail to create certificates in CI.
Most likely it is related to the fact they use the same directory
for certificates. This commit uses shard/node name and unique id
for each SSL certificate
2024-11-25 14:46:05 +01:00
GitHub 873d54a088 bazel run gazelle 2024-11-21 04:02:30 +00:00
Péter Gömöri 9bb7530d04
Move client-side stream protocol test helpers to a separate module
So that they can be used from multiple test suites.

(cherry picked from commit cf8a00c5db)
2024-11-19 19:13:59 -05:00
Michael Klishin 961e5c5a21
Undo the Bazel-related change from #12696
(cherry picked from commit a66c926985)
2024-11-09 17:47:06 -05:00
Michael Klishin 673826425a
Merge pull request #12696 from rabbitmq/mk-http-api-lower-body-length-limit-for-binding-creation
HTTP API: reduce body size limit for the endpoint used to bind queues/streams/exchanges
2024-11-09 17:13:03 -05:00
Michael Klishin 3dc5c463a4
Pass Dialyzer 2024-11-09 16:53:45 -05:00
Marcial Rosales e7cb2420a7 Verify non-zero DNS and email SAN 2024-10-29 16:41:20 +01:00
Loïc Hoguin f68fc8bb94
Make CI: Add mixed version testing
This is enabled on main and for pull requests. Bazel remains
used in previous branches.
2024-10-25 13:50:05 +02:00
Loïc Hoguin 4127f15676
Make CI: Bazel updates following ct_master work 2024-10-15 14:57:42 +02:00
Loïc Hoguin 8d411c7cda
Make CI: Print auto-skipped and failed test cases at the end
Of a ct_master run. This uses the builtin CT Master event
handler to gather the results.
2024-10-15 14:57:42 +02:00
Loïc Hoguin 655caf6d1a
Make CI: Have ct_master return the test results
Instead of having a CT hook just to know whether our tests failed.
2024-10-15 14:57:42 +02:00
Loïc Hoguin dddf917378
Make CI: Sort the results printout from ct_master
It makes more sense to sort by node name, than to have
the results in the order they finished.
2024-10-15 14:57:42 +02:00
Loïc Hoguin 6cdc32f558
Make CI: Make ct_master handle all testspec instructions 2024-10-15 14:57:42 +02:00
Loïc Hoguin 77ab5eddcb
Reduce the amount of printing to the terminal during tests 2024-10-15 14:57:42 +02:00
Loïc Hoguin 1897e02764
Make CI: Fix a small issue in master_runs.html 2024-10-15 14:57:42 +02:00
Loïc Hoguin ce7184598c
Make CI: Fix the master_runs.html css file paths
Needed to file:set_cwd like in normal CT.
2024-10-15 14:57:42 +02:00
Loïc Hoguin 37c2f9f675
Make CI: Don't refresh logs at the end of ct_master run
The ct_run:run_test function already takes care of the
node's logs. The ct_master_logs module takes care of
ct_master itself.
2024-10-15 14:57:41 +02:00
Loïc Hoguin 807c8f8a0b
Make CI: Add forks of ct_master_event and ct_master_logs 2024-10-15 14:57:41 +02:00
GitHub b9bb3014c0 bazel run gazelle 2024-10-08 04:02:25 +00:00
Loïc Hoguin 9645fb1275
Make parallel-ct properly detect test failures
The problem comes from `ct_master` which doesn't tell us
in the return value whether the tests succeeded. In order
to get that information a CT hook was created. But then
we run into another problem: despite its documentation
claiming otherwise, `ct_master` does not handle `ct_hooks`
instructions in the test spec.

So for the time being we fork `ct_master` into a new
`ct_master_fork` module and insert our hook directly
in the code. Later on we will submit patches to OTP.
2024-10-07 13:30:32 +02:00
Loïc Hoguin 7fe78a3af9
Better fix for a Dialyzer warning
The previous fix was leading to a badmatch in some cases,
including when trying to stop a node that was already stopped.
2024-09-30 14:25:01 +02:00
Loïc Hoguin f54e307aee
CT: No longer wait 3 minutes for node start
Reverting back to the default 1 minute. The problem with
3 minutes is that this is exceedingly long and when there
are problems the test time increases exponentially.
2024-09-30 12:35:44 +02:00
Loïc Hoguin 67eee5602c
Fix OTP-27 Dialyzer errors in rabbitmq_ct_helpers 2024-09-30 12:35:43 +02:00