Commit Graph

1149 Commits

Author SHA1 Message Date
Loïc Hoguin cc8892df3b
CQv2 prop: Tweak command weights 2022-02-09 14:24:39 +01:00
Loïc Hoguin ae51c6fbaf
CQv2 prop: Refactor duplicate code
Now that the test suite is solid some of the duplicated code
can be merged together.
2022-02-09 12:46:02 +01:00
David Ansari d6a81a0bbc Add lqueue:get/2
as less garbage alternative to lqueue:peek/1
2022-02-04 17:43:16 +01:00
David Ansari 6b9e501516 Add lqueue:get/1 and lqueue:get_r/1 2022-02-04 17:43:03 +01:00
David Ansari 676d0dfd85 Allocate 2 bytes less per queue operation
by changing the lqueue state.

This results in less memory usage and hence less garbage collection
when queue functions are called many thousand times per second.

The subset of functions lqueue supports are copied from
OTP's queue module and extended to update the length.

lqueue's foldr/3 is deleted since there is no usage.
lqueue's foldl/3 is renamed to fold/3 to match OTP's queue naming.

lqueue accepts both old and new state, but returns only new state.
2022-02-04 17:42:42 +01:00
Loïc Hoguin 4a7cce831c
CQv2 prop: Clean restarts don't drop transient messages 2022-02-04 15:22:25 +01:00
Loïc Hoguin 9e59c6a698
CQv2 prop: Properly handle messages sent before enabling confirms 2022-02-04 14:34:31 +01:00
Loïc Hoguin c6092cacaa
CQv2 prop: Keep track of confirms for more accuracy
We now know that we will not be expecting transient messages
after clean restarts, and we can preserve the order of
persistent messages that were confirmed.
2022-02-04 10:45:01 +01:00
Loïc Hoguin e042660393
CQv2 prop: Add commands to discard one/many messages
Uses basic.reject with requeue=false.
2022-02-03 12:28:32 +01:00
Loïc Hoguin b77bc8e67a
CQv2 prop: Be more specific: reject is really requeue 2022-02-03 11:27:47 +01:00
Loïc Hoguin 36b2466ef7
CQv2 prop: Use a single limiter pid per test case
Instead of one per rabbit_amqqueue:basic_get.
2022-02-03 10:42:56 +01:00
Loïc Hoguin fd8380ebdc
CQv2 prop: Don't make cmd_publish_msg use confirms
The test suite was not checking them anyway, and channels are
a better fit for this kind of test.
2022-02-03 10:21:40 +01:00
Loïc Hoguin c3ded33872
CQv2 prop: Ensure we can fill up entire segments
This is done by changing cmd_channel_publish_many to send up
to 512 messages and reducing the segment_entry_count of both
v1 and v2 to 512.
2022-02-03 09:01:23 +01:00
Loïc Hoguin 6916c1bb7e
CQv2 prop: Send transient messages as well 2022-02-02 16:40:57 +01:00
Loïc Hoguin e93e0a4c57
CQv2 prop: Only use the "many" commands for 2+ messages 2022-02-02 16:24:09 +01:00
Loïc Hoguin 02d0cbd1de
CQv2 property suite: Rename to classic_queue_prop_SUITE 2022-02-02 15:11:06 +01:00
Loïc Hoguin a76e3294fd
CQv2 property suite: Add acks/rejects out of order
Also a command cmd_channel_publish_many to make the
scenarios more likely.
2022-02-02 14:56:01 +01:00
Loïc Hoguin 895c18e563
CQv2 property suite: Add command to disable v2 CRC32 check 2022-02-01 15:41:58 +01:00
Loïc Hoguin 52d577dd8e
CQv2 property suite: Enable cmd_purge again 2022-02-01 13:27:18 +01:00
Loïc Hoguin 6197f54fa5
CQv2 property suite: Fix vhost restart cancelling consumers 2022-02-01 13:26:38 +01:00
Loïc Hoguin 02bc9f9a32
CQv2 property suite: Enable cmd_restart_queue_dirty 2022-02-01 11:00:49 +01:00
Loïc Hoguin ab559b4714
CQv2 property suite: Register a default consumer
This gets around race condition related problems caused by
the client sending a basic.cancel while the server sends
a basic.cancel itself. This also solves a similar issue
where the server delivers a message that I have not
investigated thoroughly.

Details about the basic.cancel race condition issue can
be found in https://github.com/rabbitmq/rabbitmq-server/issues/4070
2022-02-01 10:54:21 +01:00
Loïc Hoguin af24b4b4c4
CQv2 property suite: Handle shutdown exits on teardown 2022-02-01 10:52:31 +01:00
Michael Klishin b0a60cc779
Don't skip definition import if target node count has not been reached
Otherwise delayed definition import [of queues and bindings] would
never be run in scenario where target node count is greater than one.
2022-01-31 10:52:49 +03:00
Michael Klishin 3837b252b0
New tests For definitions.skip_if_unchanged 2022-01-30 00:32:53 +03:00
Loïc Hoguin 8c9f417959
Account for state differences in channel_operation_timeout_test_queue
The state has differences across versions and the test was no
longer compatible with previous versions in mixed version testing.
2022-01-28 14:10:56 +01:00
Loïc Hoguin a74f0a354e
CQ property suite: Fix cmd_basic_get_msg
I mistakenly only edited cmd_channel_basic_get before. This
brings the other command up to speed.
2022-01-25 12:47:34 +01:00
Loïc Hoguin 777e0fd6ea
Fix a consistency issue in the v1 index after dirty restarts
The issue was found via classic_queue_SUITE using a currently
disabled command that kills the queue. A function has also
been added to convert a set of commands given by PropEr as
a result to Erlang code that can be put in the `do_manual`
function. Some tips have also been added.

The Erlang code that could reproduce the issue follows.
The issue never needed a loop on my machine for what it's
worth, but it might on other machines. Commands that were
not necessary were commented out. The timer:sleep(1) calls
were added as the issue did not seem to trigger without them.

do_manual(Config) ->
    St0 = #cq{name=prop_classic_queue_v1, mode=lazy, version=1,
              config=minimal_config(Config)},

    Res1 = cmd_setup_queue(St0),
    St1 = St0#cq{amq=Res1},

    do_manual_loop(St1).

do_manual_loop(St1) ->

%    Res2 = cmd_set_mode(St1, lazy),
%    true = postcondition(St1, {call, undefined, cmd_set_mode, [St1, lazy]}, Res2),
%    St2 = next_state(St1, Res2, {call, undefined, cmd_set_mode, [St1, lazy]}),
    St2 = St1,

timer:sleep(1),

%    Res3 = cmd_basic_get_msg(St2),
%    true = postcondition(St2, {call, undefined, cmd_basic_get_msg, [St2]}, Res3),
%    St3 = next_state(St2, Res3, {call, undefined, cmd_basic_get_msg, [St2]}),
    St3 = St2,

timer:sleep(1),

    Res4 = cmd_channel_open(St3),
    true = postcondition(St3, {call, undefined, cmd_channel_open, [St3]}, Res4),
    St4 = next_state(St3, Res4, {call, undefined, cmd_channel_open, [St3]}),

timer:sleep(1),

    Res5 = cmd_channel_publish(St4, Res4, 22, false, undefined),
    true = postcondition(St4, {call, undefined, cmd_channel_publish, [St4, Res4, 22, false, undefined]}, Res5),
    St5 = next_state(St4, Res5, {call, undefined, cmd_channel_publish, [St4, Res4, 22, false, undefined]}),

timer:sleep(1),

    Res6 = cmd_restart_vhost_clean(St5),
    true = postcondition(St5, {call, undefined, cmd_restart_vhost_clean, [St5]}, Res6),
    St6 = next_state(St5, Res6, {call, undefined, cmd_restart_vhost_clean, [St5]}),

timer:sleep(1),

%    Res7 = cmd_channel_publish(St6, Res4, 13, false, 71),
%    true = postcondition(St6, {call, undefined, cmd_channel_publish, [St6, Res4, 13, false, 71]}, Res7),
%    St7 = next_state(St6, Res7, {call, undefined, cmd_channel_publish, [St6, Res4, 13, false, 71]}),
    St7 = St6,

timer:sleep(1),

%    Res8 = cmd_channel_open(St7),
%    true = postcondition(St7, {call, undefined, cmd_channel_open, [St7]}, Res8),
%    St8 = next_state(St7, Res8, {call, undefined, cmd_channel_open, [St7]}),
    St8 = St7,

timer:sleep(1),

%    Res9 = cmd_channel_close(Res8),
%    true = postcondition(St8, {call, undefined, cmd_channel_close, [Res8]}, Res9),
%    St9 = next_state(St8, Res9, {call, undefined, cmd_channel_close, [Res8]}),
    St9 = St8,

timer:sleep(1),

%    Res10 = cmd_channel_close(Res4),
%    true = postcondition(St9, {call, undefined, cmd_channel_close, [Res4]}, Res10),
%    St10 = next_state(St9, Res10, {call, undefined, cmd_channel_close, [Res4]}),
    St10 = St9,

timer:sleep(1),

    Res11 = cmd_restart_queue_dirty(St10),
    true = postcondition(St10, {call, undefined, cmd_restart_queue_dirty, [St10]}, Res11),
    St11 = next_state(St10, Res11, {call, undefined, cmd_restart_queue_dirty, [St10]}),

timer:sleep(1),

    Res12 = cmd_restart_vhost_clean(St11),
    true = postcondition(St11, {call, undefined, cmd_restart_vhost_clean, [St11]}, Res12),
    St12 = next_state(St11, Res12, {call, undefined, cmd_restart_vhost_clean, [St11]}),

timer:sleep(1),

%    Res13 = cmd_set_version(St12, 1),
%    true = postcondition(St12, {call, undefined, cmd_set_version, [St12, 1]}, Res13),
%    St13 = next_state(St12, Res13, {call, undefined, cmd_set_version, [St12, 1]}),
    St13 = St12,

timer:sleep(1),

    Res14 = cmd_restart_vhost_clean(St13),
    true = postcondition(St13, {call, undefined, cmd_restart_vhost_clean, [St13]}, Res14),
    St14 = next_state(St13, Res14, {call, undefined, cmd_restart_vhost_clean, [St13]}),

timer:sleep(1),
logger:error("loop~n"),

    do_manual_loop(St14).
2022-01-25 11:23:24 +01:00
Luke Bakken c352525e0c
Rename `variable_queue_default_version` to `classic_queue_default_version` 2022-01-25 11:23:23 +01:00
Luke Bakken 5da7396bf3
Add rabbit.variable_queue_default_version to the cuttlefish schema 2022-01-25 11:23:23 +01:00
Loïc Hoguin 087045e319
CQ property suite: Temporarily disable dirty queue restart 2022-01-25 11:23:23 +01:00
Loïc Hoguin 61f2c972eb
CQ property suite: Fix race condition after closing consumers 2022-01-25 11:23:22 +01:00
Loïc Hoguin 08d78f0885
CQ property suite: Add queue crash
The added command resulted in the following additional changes:

- Allow more queues restart for test purposes

- Fix crashes that may happen following a queue crash+restart

- Fix reading from the index after crash+restart where messages
  may be both in q1 and read from the index

- Fix a race condition when stopping a node while a queue
  crash+restart: make lookup of message store use pg
2022-01-25 11:23:22 +01:00
Loïc Hoguin 673bdecea2
CQ property suite: Add vhost restart
This helps us test cases where the queue restarts cleanly.
Because it is not completely deterministic, there is some
clever handling of messages to accept messages that were
acked by the client but the server didn't ack before the
restart, and messages published without confirms that
never made it to the server before it restarted.

There is potential to improve confirms handling for that
scenario but that is left as an exercise for a later time.
2022-01-25 11:23:22 +01:00
Loïc Hoguin 678ae3c83a
Cleanup the CQ property suite 2022-01-25 11:23:21 +01:00
Loïc Hoguin eb69bc04ff
Set backtrace_depth to 16 in CQ property suite
I had this locally for the longest time, it's time to commit it.
It helps figure out exactly what triggers crashes in the logs.
2022-01-25 11:23:21 +01:00
Loïc Hoguin 8cceeb1248
CQ property suite: queue:all/2 and queue:delete/2 not in OTP 23 2022-01-25 11:23:21 +01:00
Loïc Hoguin 3587c1756d
CQ property suite: queue:fold/3 not available in OTP 23 2022-01-25 11:23:21 +01:00
Loïc Hoguin 16f1843725
CQ property suite: Add publisher confirms 2022-01-25 11:23:20 +01:00
Loïc Hoguin 47f1198cb1
CQ property: test messages with expiration 2022-01-25 11:23:20 +01:00
Loïc Hoguin a32e2f053f
CQ property: Set "mandatory" with channel publish too 2022-01-25 11:23:20 +01:00
Loïc Hoguin 3de6b8a73d
Tweak command weights in CQ property suite 2022-01-25 11:23:20 +01:00
Loïc Hoguin 59ca114b61
Add channel_receive_and_reject to CQ property suite 2022-01-25 11:23:19 +01:00
Loïc Hoguin a357bd4fa8
Add channel_receive_and_ack to CQ property suite
Also improves queue name handling (now has a random element
that lets us do shrinking properly) and clean up channels
when tearing down a queue.
2022-01-25 11:23:19 +01:00
Loïc Hoguin bc23a05d60
Add basic.cancel to CQ property test suite 2022-01-25 11:23:19 +01:00
Loïc Hoguin aafd4c6e14
Add channel consume to the CQ property suite
The messages are not being received/acked at the moment.
2022-01-25 11:23:18 +01:00
Loïc Hoguin 44fd112e6d
Implement resuming v2->v1 conversion during dirty recovery 2022-01-25 11:23:16 +01:00
Loïc Hoguin 390bffb4cd
Fix classic_queue_SUITE for OTP < 24
It doesn't have rand:bytes/1 so instead we will be
calling crypto:strong_rand_bytes/1 until such a time
comes that we can drop it.
2022-01-25 11:23:15 +01:00
Loïc Hoguin 92d95bf5c7
Fix unused vars warning in classic_queue_SUITE 2022-01-25 11:23:14 +01:00
Loïc Hoguin 469788a820
Ensure index files with holes get removed
On dirty recovery the count in the segment file was already
accurate. It was not accurate otherwise as it assumed that
all messages would be written to the index, which is not
the case in the current implementation.
2022-01-25 11:23:14 +01:00
Loïc Hoguin e5f51e960c
Add missing callback in channel_operation_timeout_test_queue 2022-01-25 11:23:14 +01:00
Loïc Hoguin 467309418c
Reenable some checks in backing_queue_SUITE 2022-01-25 11:23:13 +01:00
Loïc Hoguin ffe3be34d1
Update channel_operation_timeout_test_queue 2022-01-25 11:23:13 +01:00
Loïc Hoguin 4231143900
Remove commented-out code, todos, unused _Vars 2022-01-25 11:23:12 +01:00
Loïc Hoguin c63a75ca66
Reenable internal publish/basic_get 2022-01-25 11:23:11 +01:00
Loïc Hoguin 0f9d36a73b
Test using concurrent channels
Queue purge and direct publish/get no longer work for the time
being as a result.
2022-01-25 11:23:11 +01:00
Loïc Hoguin 216025f192
Test queue purge 2022-01-25 11:23:11 +01:00
Loïc Hoguin 94cb595b50
Set mandatory/confirm flags in proper test suite 2022-01-25 11:23:11 +01:00
Loïc Hoguin 3595cbd34d
Do a basic_get even when the queue is empty 2022-01-25 11:23:11 +01:00
Loïc Hoguin dff6e2ad82
Test changing both mode and version at the same time 2022-01-25 11:23:10 +01:00
Loïc Hoguin 97a7e8128c
Tweak the property based test suite efficacy
Removed the now pointless command checking for the process liveness.

Increased the number of checks to 500 (x5 the default).

Added weights for the commands: publish/get at 900 and
set mode/version at 100.
2022-01-25 11:23:10 +01:00
Loïc Hoguin 9f15f86252
CQ version switch via policies + proper test for this 2022-01-25 11:23:10 +01:00
Loïc Hoguin 006996014c
Small cleanup of CQ property suite 2022-01-25 11:23:10 +01:00
Loïc Hoguin 3ac5a678fc
Switch queue mode 2022-01-25 11:23:10 +01:00
Loïc Hoguin fcaefd786b
Test all combinations of classic/lazy v1/v2
Confirmed manually that the right mode/version are picked.
2022-01-25 11:23:10 +01:00
Loïc Hoguin d7925af1eb
Initial classic queue property based suite
Much remains to be added but there's some publish/basic_get
going on now. It is starting to look good.
2022-01-25 11:23:09 +01:00
Loïc Hoguin 3fc1eb14de
Fix remaining tests for CQ v1 2022-01-25 11:23:09 +01:00
Loïc Hoguin c4672b6f2c
Test both indexes 2022-01-25 11:23:09 +01:00
Loïc Hoguin 6dfe6a7be8
Test both CQ v1 and v2 2022-01-25 11:23:09 +01:00
Loïc Hoguin ad67f787ab
Reenable embed 0/1024 groups and fix embed 0 recovery 2022-01-25 11:23:06 +01:00
Loïc Hoguin d1b8f623dc
Fix channel_operation_timeout test
The records have to be kept in sync.
2022-01-25 11:23:05 +01:00
Loïc Hoguin 2473ff7328
Reenable some tests that were commented out 2022-01-25 11:23:05 +01:00
Loïc Hoguin b0b9b46313
Fix remaining tests 2022-01-25 11:23:04 +01:00
Loïc Hoguin c02de4d252
Some cleanup and fix most tests
Still need to improve recovery and do some sort of check in the
store so we know the file isn't corrupted.
2022-01-25 11:23:04 +01:00
Loïc Hoguin fc9846d01d
Fix obvious mistakes in previous commit 2022-01-25 11:23:04 +01:00
Loïc Hoguin 33fada8847
Track delivers per-queue rather than per-message
Because queues deliver messages sequentially we do not need to
keep track of delivers per message, we just need to keep track
of the highest message that was delivered, via its seq_id().

This allows us to avoid updating the index and storing data
unnecessarily and can help simplify the code (not seen in this
WIP commit because the code was left there or commented out
for the time being).

Includes a few small bug fixes.
2022-01-25 11:23:03 +01:00
Loïc Hoguin b439d2e6bb
Per-queue classic queue store
This currently works both with confirms and not.
It currently always writes to the per-queue store,
it would be good to write fan-out messages to the
shared store though.

It would be good to remove the usage of MsgId except
when the shared store is needed.
2022-01-25 11:23:02 +01:00
Loïc Hoguin cf080b9937
Add file_handle_cache FD reservations 2022-01-25 11:23:02 +01:00
Loïc Hoguin 0102191e2b
Rename to rabbit_classic_queue_index_v2 2022-01-25 11:23:00 +01:00
Loïc Hoguin 0f431876f2
No longer tests with different embed settings
Since messages are no longer embedded those settings are ignored.
2022-01-25 11:22:58 +01:00
Loïc Hoguin 98f64f2fa8
Replace classic queue index with a modern implementation 2022-01-25 11:22:56 +01:00
Philip Kuryloski efcd881658 Use rules_erlang v2
bazel-erlang has been renamed rules_erlang. v2 is a substantial
refactor that brings Windows support. While this alone isn't enough to
run all rabbitmq-server suites on windows, one can at least now start
the broker (bazel run broker) and run the tests that do not start a
background broker process
2022-01-18 13:43:46 +01:00
Karl Nilsson 9a5d0f9d85 Make stream coodinator machine versioned
In order to retain deterministic results of state machine applications
during upgrades we need to make the stream coordinator versioned such
that we only use the new logic once the stream coordinator switches to
machine version 1.
2022-01-07 12:11:11 +00:00
dcorbacho 0bd8d41b72 Skip new import testcase on mixed environments 2022-01-03 17:37:06 +01:00
Michael Klishin 19ae35aa14
#3925 follow-up: don't include Erlang client headers 2021-12-28 01:24:32 +03:00
Michael Klishin b569ab5d74
Rename two newly introduced test modules 2021-12-28 00:35:55 +03:00
dcorbacho c88605aab4
Import definitions: support user limits 2021-12-26 04:32:00 +03:00
Luke Bakken d1496a2c7c
Fix tests 2021-12-26 04:32:00 +03:00
Luke Bakken 043641c99f
Use protected ets so that data can be read quickly 2021-12-26 04:31:59 +03:00
Thuan Duong Ba dc6fb24761 minor fix on condition to stop batching when total batch size is large 2021-12-20 17:39:06 -08:00
Thuan Duong Ba 1ab485b44c minor update for batching messages when syncthroughput is 0 2021-12-20 17:39:06 -08:00
Thuan Duong Ba 157bffa332 Support configure max sync throughput in CMQs 2021-12-20 17:39:06 -08:00
polaris-alioth 6431584a10 Prevent creating unnamed policy when loading definition 2021-12-19 12:52:26 +08:00
Philip Kuryloski 249e8c853c Adjust the way rabbit_fifo.hrl is referenced in rabbit_fifo_SUITE
For erlang_ls convenience
2021-12-16 16:41:15 +01:00
Michael Klishin ebd79836c1 Revisit operator policy merging rules for boolean fields
For booleans, we can prefer the operator policy value
unconditionally, without any safety implications.

Per discussion with @binarin @pjk25

(cherry picked from commit 6edb7396fd)
2021-12-10 19:48:16 +00:00
Loïc Hoguin 1b0eb9a4a3
Fix case where confirms may not be sent
A channel that first sends a mandatory publish before enabling
confirms mode may not receive confirms for messages published
after that. This is because the publish_seqno was increased
also for mandatory publishes even if confirms were disabled.
But the mandatory feature has nothing to do with publish_seqno.

The issue exists since at least
38e5b687de

The test case introduced focuses for multiple=false. The issue
also exists for multiple=true but it has a different impact:
sending multiple=true,delivery_tag=2 results in both messages
1 and 2 being acked, even if message 2 doesn't exist as far
as the client is concerned. If the message does exist
it might get confirmed earlier than it should have been. The
issue is a bigger problem the more mandatory messages were
sent before enabling confirms mode.
2021-12-08 15:53:47 +01:00
Luke Bakken 9ff201c3ab
Remove flaky assertion
Thanks @kjnilsson
2021-12-01 06:57:25 -08:00
dcorbacho 5e9664f9e7 Query total number of messages on stream leader on queue.declare 2021-11-30 15:09:30 +01:00
David Ansari 45f69f8829 Add missing Ra commands to the log
Before this commit, the tests were not including any settle, return, or
discard Ra commands.

Do not pattern match against 'ra_event' because nowadays:
_Opts = [local, ra_event]
2021-11-26 16:16:45 +01:00
Michael Klishin 4f09fd109c
quorum_queue_SUITE: bump some timeouts 2021-11-24 18:04:35 +03:00
Michael Klishin 6a08e143e9
quorum_queue_SUITE: drop a debug line 2021-11-24 16:47:20 +03:00
Luke Bakken 6d545447b9
Fix quorum queue crash during consumer cancel with return
Fixes #3729
2021-11-23 08:59:47 -08:00
Michael Klishin e22e667a10
Do not count unroutable message in global totals 2021-11-23 16:37:46 +03:00
Luke Bakken 6aaf7ec597
Merge pull request #3740 from rabbitmq/rabbitmq-server-3739
Distribution listener settings support in rabbitmq.conf
2021-11-16 06:36:48 -08:00
Michael Klishin 8a30cf1c86
Distribution listener settings support in rabbitmq.conf
* distribution.listener.interface
 * distribution.listener.port_range.min
 * distribution.listener.port_range.max

Closes #3739
2021-11-16 16:37:28 +03:00
Karl Nilsson bc7b339e7a Stream coordinator: only update amqqueue record if stream id matches
From the coordinator's POV each stream has a unique id consisting of the
vhost, queuename and a high resolution timestamp even if several stream ids
relate to the same queue record.

When performing the mnesia update the coordinator now checks that the current stream id
matches that of the update_mnesia action and does not change the queue record if
the stream id is not the same.

This should avoid "old" incarnations of a stream queue updating newer ones
with incorrect information.
2021-11-16 12:32:33 +00:00
Karl Nilsson 1c6e45257d QQ: set better timeouts for commands
Refactor how the single active consumer check is performed when consuming.

Improve timeouts in rabbit_fifo_client.
2021-11-08 11:07:41 +00:00
Michael Klishin 686dccf410 Introduce a target cluster size hint setting
This is meant to be used by deployment tools,
core features and plugins
that expect a certain minimum
number of cluster nodes
to be present.

For example, certain setup steps
in distributed plugins might require
at least three nodes to be available.

This is just a hint, not an enforced
requirement. The default value is 1
so that for single node clusters,
there would be no behavior changes.
2021-11-03 08:42:58 +00:00
Karl Nilsson 691de2bea4 Take all clustered nodes into account when declaring stream.
Deriving a max-cluster-size only from running nodes would create situations where
in a three-node with only two nodes running cluster it would select an non-running
node as follower.
2021-10-18 15:44:53 +01:00
Karl Nilsson 5520c6cafe Stream queue: handle unsupported header value types
As AMQP 0.9.1 headers are translated into AMQP 1.0 application properties
they are not able to contain complex values such as arrays or tables.

RabbitMQ federation does use array and table values so to avoid crashing when
delivering a federated message to a stream queue we drop them. These header values
should be considered internal however so dropping them before a final queue deliver should not be a huge problem.
2021-10-13 10:27:00 +01:00
Philip Kuryloski 9c9fb7ffb0 Shard cluster_management_SUITE by testcase to better manage timeouts
The suite level timeout the .erl I've learned is actually per
case. By sharding bu testcase, we can better match the common test
level and bazel level timeouts, such that we can get logs from remote
test run failures.
2021-09-30 10:38:39 +02:00
Philip Kuryloski 860653c97a Adjust the clustering_management_SUITE timeout at the ct level
Previously the bazel timeout and common test timeout were equal, which
meant that in practice the bazel timeout was often reached first, in
which case we don't receive the test logs
2021-09-23 13:55:18 +02:00
Philip Kuryloski 7dc0c29227 Use only 3 nodes for feature_flags_with_unpriveleged_user_SUITE
The test does not appear reliable when it runs in Github actions. This
is currently the only test that does so. Other tests run of BuildBuddy workers.
2021-09-22 17:22:49 +02:00
Philip Kuryloski 6e6279eb2b Reduce a test timeout
The original value of 15 minutes was inherited from a larger suite. 5
should be sufficient, as a passing run is typically around 2 minutes.
2021-09-21 10:16:38 +02:00
Karl Nilsson eaa216da82 QQ: emit release cursors after consumer cancel
If this is not done apps that consume/cancel from empty queues in a loop
will grow the raft log in an unbounded manner. This could also be the
case for the garbage_collect command.
2021-09-17 17:09:30 +01:00
Karl Nilsson 5779059bd5 QQ: fix memory leak when cancelling consumer
If the queue is empty when a consumer is cancelled it would leave the
consumer id inside the service queue. If an application subscribes/unsubscibes
in a loop from an empty queue this would cause the service queue to never be
cleared up.

NB: whenever we make a change to how the quorum queue state machien is
calculated we need to consider how this effects determinism as during an
upgrade different members may calculate a different service queue state.
In this case it should be ok as they will eventually converge on the same
state once all "dead" consumer ids have been removed from the queue.

In any case it should not affect how messages are assigned to consumers.
2021-09-17 14:53:33 +01:00
Philip Kuryloski eea99e1cd5 Split the feature_flags_SUITE into two parts for CI/Bazel
Two testcases in the original suite fail if the test is run as the
root user. Currently under remote execution with bazel this is the
only working option. There is a workaround in place, but the entire
suite when run that way takes around 12 minutes. This splits the suite
so that the minimal set of cases is executed using the slower workaround.
2021-09-17 11:08:48 +02:00
Michal Kuratczyk 624767281f Enable metrics collection in run_tests
Proposed `min-masters` implementation relies on metrics so they need to
be collected during queue_master_location tests.
2021-09-10 14:51:11 +02:00
Gerhard Lazu 6a1faa6fd6
Keep checking that replica recovered in rabbit_stream_queue
Rather than sleeping for 6 seconds, we want to check that replica
recovered multiple times within 30 seconds, and either eventually
succeed, or fail if this does not recover within 30 seconds, the default
await_condition time interval.

Pair: @kjnilsson

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-08-31 17:02:21 +01:00
Philip Kuryloski 09fb5c5321 Skip additional tests in mixed versions
The tests in question won't pass consistently as they are at the mercy
of how the quorum queue is placed across the mixed version nodes
2021-08-30 17:17:25 +02:00
Michael Klishin 83f007be54
Merge pull request #3341 from rabbitmq/local-exclusive-queues
Always place exclusive queues on the local node
2021-08-28 09:54:49 +03:00
Michael Klishin 3b4b4dc222
Exclude roundtrip definition import cases from mixed version runs
References #3333
2021-08-26 19:10:11 +03:00
Michael Klishin 54f7b6d77c
Re-format two definition import input files 2021-08-26 19:03:14 +03:00
Michael Klishin 42a3dfa81b
Exclude the #3333 test case from mixed version runs 2021-08-26 17:25:07 +03:00
Michal Kuratczyk d3dcd48ea5 Always place exclusive queues on the local node
Prior to this change, exclusive queues have been subject to the queue
location process, just like other queues. Therefore, if
queue_master_locator was not client-local and x-queue-master-locator was
not set to client-local, an exclusive queue was likely to be located on
a different node than the connection it is exclusive to.  This is
suboptimal and may lead to inconsistencies when the queue's node goes
down while the connection's node is still up.
2021-08-26 13:05:55 +02:00
Michael Klishin 2e61f51773
Commit definition import case16 file 2021-08-24 04:41:51 +03:00
Michael Klishin 6f97707dac
Definition import: correctly import vhost metadata 2021-08-24 04:41:04 +03:00
Michael Klishin 6a0058fe7c
Introduce TLS-related rabbitmq.conf settings for definition import
currently only used by the HTTPS mechanism but can be used by
any other.
2021-08-17 20:42:53 +03:00
Michael Klishin f3a5235408
Refactor definition import to allow for arbitrary sources
The classic local filesystem source is still supported
using the same traditional configuration key, load_definitions.

Configuration schema follows peer discovery in spirit:

 * definitions.import_backend configures the mechanism to use,
   which can be a module provided by a plugin
 * definitions.* keys can be defined by plugins and contain any
   keys a specific mechanism needs

For example, the classic local filesystem source can now be
configured like this:

``` ini
definitions.import_backend = local_filesystem
definitions.local.path = /path/to/definitions.d/definition.json
```

``` ini
definitions.import_backend = https
definitions.https.url = https://hostname/path/to/definitions.json
```

HTTPS may require additional configuration keys related to TLS/x.509
peer verification. Such extra keys will be added as the need for them
becomes evident.

References #3249
2021-08-14 14:53:45 +03:00
Loïc Hoguin 24c25ab3cc
Add tests for the regression introduced in #3041 2021-08-11 12:50:04 +02:00
Jean-Sébastien Pédron 6c8cf4c510
Logging: Fix crash when Epoch-based timestamps are used with JSON
The code was passing a number (the timestamp) to
unicode:characters_to_binary/1 which expects an iolist to convert to
UTF-8.

We now verify if we have a number before calling that function. If this
is a number (integer or float), we keep it as is because JSON supports
that type.
2021-08-10 12:34:11 +02:00
Michael Klishin 2efc3d22fa
Merge pull request #3176 from rabbitmq/stream-error-handling
Better error handling for streams
2021-07-27 22:25:06 +03:00
David Ansari 6d968718c8 Fix function_clause error in tracking_status/2
Before this commit:

> ./sbin/rabbitmq-streams stream_status --tracking s1
Status of stream s1 on node rabbit@localhost ...
Error:
{:function_clause,
[{:rabbit_stream_queue, :"-tracking_status/2-fun-0-",
[:offsets, %{"s1-1" => 5}, []],
[file: 'src/rabbit_stream_queue.erl', line: 608]},
{:maps, :fold_1, 3, [file: 'maps.erl', line: 410]},
{:rabbit_stream_queue, :tracking_status, 2, []}]}

After this commit:

> ./sbin/rabbitmq-streams stream_status --tracking s1
Status of stream s1 on node rabbit@localhost ...
┌────────┬───────────┬───────┐
│ type   │ reference │ value │
├────────┼───────────┼───────┤
│ offset │ s1-1      │ 51    │
└────────┴───────────┴───────┘
2021-07-23 19:34:31 +02:00
Michael Klishin c1e3710140
Squash a compiler warning 2021-07-20 00:55:40 +03:00
Philip Kuryloski d6399bbb5b
Mixed version testing in bazel (#3200)
Unlike with gnu make, mixed version testing with bazel uses a package-generic-unix for the secondary umbrella rather than the source. This brings the benefit of being able to mixed version test releases built with older erlang versions (even though all nodes will run under the single version given to bazel)

This introduces new test labels, adding a `-mixed` suffix for every existing test. They can be skipped if necessary with `--test_tag_filters` (see the github actions workflow for an example)

As part of the change, it is now possible to run an old release of rabbit with rabbitmq_run rule, such as:

`bazel run @rabbitmq-server-generic-unix-3.8.17//:rabbitmq-run run-broker`
2021-07-19 14:33:25 +02:00
Philip Kuryloski 0f4cf2755d Increase a timeout for flakiness sake 2021-07-19 14:24:46 +02:00
Philip Kuryloski 0a78484999 Make things a little more consistent between per_*_limit suites 2021-07-16 14:40:51 +02:00
Philip Kuryloski 4f514f435b Try to reduce flakes in per_user_connection_channel_limit_partitions_SUITE 2021-07-16 14:32:35 +02:00
Philip Kuryloski 97e8037b80 Replace some static sleeps in tests with dynamic waits
This should help with flakiness
2021-07-15 16:42:14 +02:00
dcorbacho e65ba8347c Fix delete_replica bug
It caused a lot of flakiness on the rabbit_stream_queue_SUITE, both on `delete_replica`
and `delete_last_replica` test cases.
2021-07-14 17:18:20 +02:00
dcorbacho 6052ecdc9c Split cluster_size_3_parallel in two groups
Faster to test locally the flaky tests and isolate them
2021-07-14 17:18:20 +02:00
Philip Kuryloski 923d87f847 Avoid using a duplicate group name in rabbitm_stream_queue_SUITE
Since bazel-erlang doesn't support this with sharding
2021-07-09 16:26:56 +02:00
Karl Nilsson 284809e750
Merge pull request #3170 from rabbitmq/stream-flaky
Fix restart of stream coordinator when there are no stream queues
2021-07-05 15:03:45 +01:00
Philip Kuryloski da6da8d6c7 Clear memory alarms on all nodes in the memory_alarm_rolls_wal test
If the alarm is triggered directly with `rabbit_alarm` it has to be
cleared on all nodes
2021-07-05 15:57:56 +02:00
dcorbacho deaa42ecac Fix restart of stream coordinator when there are no stream queues
Recovering from an existing queue is fine but if a node is restarted when
there are no longer stream queues on the system, the recovery process won't
restart the pre-existing coordinator as that's only performed on queue recovery.
The first attempt to declare a new stream queue on this cluster will crash with
`coordinator unavailable` error, as it only restarts the local coordinator
and not the whole ra cluster, thus lacking quorum.

Recovering the coordinator during the boot process ensures that a pre-existing
coordinator cluster is restarted in any case, and does nothing if there was
never a coordinator on the node.
2021-07-05 15:34:05 +02:00
Philip Kuryloski 390a00b828 Handle feature flag enablement failure more gracefully in test setup 2021-07-05 11:22:38 +02:00
Philip Kuryloski 1b92fadd80 Skip additional quorum_queue_SUITE cases under mixed versions 2021-06-30 12:41:59 +02:00
Philip Kuryloski d086af8070 Reduce test case flakyness in quorum_queue_SUITE
for the clustered/cluster_size_3/confirm_availability_on_leader_change case
2021-06-30 10:05:23 +02:00
Philip Kuryloski ef9647671f Introduce dynamic wait in parts of the quorum_queue_SUITE
to help with test flakes
2021-06-29 18:32:53 +02:00
Philip Kuryloski a8ae32e2f7 Skip an additional quorum_queue_SUITE case in mixed versions 2021-06-29 16:43:19 +02:00
Michael Klishin cf147ebfe5
Merge branch 'master' into mk-stricter-stop-start-assertions-in-quorum-queue-suite 2021-06-29 12:49:31 +03:00
Michael Klishin a1ab7452ef
Improve assertions in a QQ suite test 2021-06-29 12:16:23 +03:00
Philip Kuryloski 3cb8ff1ab9 Mixed version testing skip updates 2021-06-29 10:49:06 +02:00
dcorbacho c9305d948a
Use number of publishing channels as global publishers in amqp091 2021-06-29 08:10:42 +01:00
Michael Klishin bed64f2cc9
Reduce priority_queue_SUITE to single node tests
Other tests (that produce flakes) arguably test classic mirrored
queues, a deprecated feature reasonably well
covered in other suites.

Per discussion with @gerhard.
2021-06-28 21:59:16 +03:00
Michael Klishin a19a0f924a
quorum_queue_SUITE: don't unconditionally skip node_removal_is_not_quorum_critical
Unintentionally introduced in a3c97d491f
2021-06-28 13:02:55 +03:00
Philip Kuryloski a3c97d491f Update additional test skipping for 3.8/3.9 mixed versions 2021-06-25 11:17:46 +02:00
Philip Kuryloski dca208abce Additional skipping of unsupported tests in mixed version clusters
Also consolidate the mixed version check on
rabbit_ct_helpers:is_mixed_versions/1 as much as possible
2021-06-23 14:27:41 +02:00
Gerhard Lazu c7971252cd
Global counters per protocol + protocol AND queue_type
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
2021-06-22 14:14:21 +01:00
Philip Kuryloski 2d06a67921 Fix test for rabbitmq-ct-helpers change
rabbit_ct_broker_helpers:rpc now raised an exception if the remote
call fails. This matches the assertion with the new behavior
2021-06-22 11:12:17 +02:00
Michael Klishin 0e6298d1ae
Explain
(cherry picked from commit 2387022e8c)
2021-06-21 14:43:54 +08:00
Michael Klishin 6550cd1752
Reduce flakiness in rabbitmq_queues_cli_integration_SUITE
In case removed node hosts a leader, it takes a moment for
the QQ to elect a new one and begin accepting cluster
membership change operations again.

(cherry picked from commit a9d8816c6a)
2021-06-21 14:41:10 +08:00
Philip Kuryloski 40a7a1c24c Bring rabbit:logger_SUITE online in bazel and bump mismatched deps 2021-06-18 14:41:14 +02:00
Philip Kuryloski cff7516317 Skip some tests that are not mixed version compatible
Mark per_user_connection_channel_tracking_SUITE:cluster_size_2_network
as not mixed version compatible.

In a mixed 3.8/3.9 cluster, changes to rabbit_core_ff.erl imply that
some feature flag related migrations cannot occur, and therefore
user_limits cannot be enabled as required by the test
2021-06-17 15:36:22 +02:00
Karl Nilsson 8a4f4c6d45 Ignore dynamic_qq test that isn't mixed version compatible
quorum_unaffected_after_vhost_failure isn't mixed versions compatible as
it tries to declare a queue in a mixed cluster from a node running RA 1.x where all other
nodes are running Ra 2.0.
2021-06-17 11:52:34 +01:00
Karl Nilsson d8ac46d745 Mark quorum queue test as non-mixed-version compatible
simple_confirm_availability_on_leader_change can't be made forwards compatible
as when running in mixed mode the queue declaration happens on an old node in
a cluster of mostly new nodes. As new nodes run Ra 2.0 and Ra 1.x does not know
how to create members on Ra 2.0 nodes this test fails. This is an acceptable limitation
for a transient mixed versions cluster.
2021-06-17 10:48:20 +01:00
Michal Kuratczyk 437d8aa8c5 Don't run policy tests in parallel
Now that a policy overwrites queue arguments, running policy tests in
parallel with other tests leads to non-deterministic test results with
some tests randomly failing.
2021-06-07 16:46:14 +02:00
David Ansari 0876746d5f Remove randomized startup delays
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in https://github.com/rabbitmq/cluster-operator/issues/662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
https://github.com/rabbitmq/cluster-operator/pull/689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).

2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
2021-06-03 08:01:28 +02:00
Michael Klishin 12253d2fb4
Merge pull request #2954 from rabbitmq/new-segment-entry-count-default
Set segment_entry_count per vhost and use a better default
2021-05-27 01:56:02 +03:00
Karl Nilsson 1ea7bf5519 quorum_queue_SUITE restructure tests
Run more tests with 3 node cluster and have only one group definition
for cluster_size_2
2021-05-21 15:13:18 +01:00
Karl Nilsson 355b1cbe21 quorum_queue_SUITE only configure dist proxy when needed
Only configure the dist proxy for groups that require it.
2021-05-21 12:39:07 +01:00
Arnaud Cogoluègnes c30e013d7a
Rename max-segment-size to stream-max-segment-size-bytes 2021-05-20 10:16:19 +02:00
Michael Klishin 040f8cc912
Replace a few more leftover MPLv1.1 license headers
Most files have been using the MPLv2 headers for months now.
These were detected by the OSL process.
2021-05-19 21:20:47 +03:00
Michael Klishin 09a4ad411e
Merge pull request #3046 from rabbitmq/mk-extra-bcc-routing-target-in-queue-metadata
Make it possible for queues to have extra BCC targets specified as options
2021-05-19 17:56:39 +03:00
Karl Nilsson a96670b6c6 Fix stream x-stream-offset regression
x-stream-offset supports "friendly" relative timebase specifications
such as 100s. A recent change introduced a validation of the x-stream-offset
that disallowed such specs.
2021-05-19 11:23:37 +01:00
Michael Klishin 38c15d691d
Make it possible for queues to have extra BCC targets specified as options
This introduces a backup mechanism that can be controlled
by plugins via policies.

Benchmarks suggest the cost of this change on
Erlang 24 is well under 1%. With a stream target, it is less
than routing to one extra queue of the same type (e.g. a quorum queue).
2021-05-18 22:22:16 +03:00
dcorbacho 733f5fb367 Report stream coordinator unavailable as an amqp error
Uses code 506: resource_error
2021-05-12 17:12:09 +01:00
Karl Nilsson 94e943692b
Merge pull request #3022 from rabbitmq/relative-time-offset
Support relative time based offset specs
2021-05-11 13:50:00 +01:00
Loïc Hoguin d9344b2b58
Set segment_entry_count per vhost and use a better default
The new default of 2048 was chosen based on various scenarios.
It provides much better memory usage when many queues are used
(allowing one host to go from 500 queues to 800+ queues) and
there seems to be none or negligible performance cost (< 1%)
for single queues.
2021-05-11 10:45:28 +02:00
dcorbacho 464bf69cc4 Support relative time based offset specs 2021-05-03 17:55:43 +02:00
dcorbacho bcac37d442 Disallow removal of the last stream member 2021-04-30 17:25:06 +02:00
Michael Klishin 62df3b7ebc
Reduce log output 2021-04-28 00:24:06 +03:00
Karl Nilsson 63e33aef6d
Merge pull request #2996 from rabbitmq/stream-add-replica-check
Streams: safer replica addition
2021-04-27 10:37:57 +01:00
kjnilsson a827275a43 Streams: safer replica addition
Disallow replica additions if any of the existing replicas are more than
10 seconds out of date.
2021-04-27 09:38:39 +01:00
Ayanda-D d78e14ad3b Allow #amqp_error{} responses in channel interceptors 2021-04-20 14:57:55 +01:00
Michael Klishin d147a08aee
Correct parse tags provided as a list
Discovered while testing a PR for rabbit-hole
2021-04-16 18:35:47 +03:00
kjnilsson b35c29d7b2 QQ: ensure that messages are delivered in order
In the case where there are some messages kept in memory mixed with
some that are not it is possible that a messages are delivered to the
consuming channel with gaps/out of order which would in some cases cause
the channel to treat them as re-sends it has already seen and just
discard them. When this happens the messages get stuck in the consumer
state inside the queue and are never seen by the client consumer and
thus never acked. When this happen the release cursors can't be emitted
as the smallest raft index will be one of the stuck messages.
2021-04-15 15:01:22 +01:00
Michael Klishin 6a4ee16b79
Merge pull request #2968 from rabbitmq/longer-qq-names
Allow quorum queue names to exceed atom max chars
2021-04-12 18:45:22 +03:00
kjnilsson 432edb11fc Allow quorum queue names to exceed atom max chars
If the concatenation of the vhost and the queue name exceeds 255 chars
we instead generate an arbitrary atom name instead of throwing an
exception.
2021-04-12 14:14:26 +01:00
Michael Klishin 7f98bc3d1c
Add more VM memory monitor tests, pass Dialyzer
(cherry picked from commit 57ec1f8768)
2021-04-11 11:36:30 +03:00
Michael Klishin 30cbbba167
High VM watermark: support {relative, N} values set via advanced.config
for usability. It is not any different from when a float value
is used and only exists as a counterpart to '{absolute, N}'.

Also nothing changes for rabbitmq.conf users as that format performs
validation and correct value translation.

See #2694, #2965 for background.
2021-04-11 10:28:35 +03:00
Philip Kuryloski 3644ed58ee Test sharding and flaky annotations
Also rename a nested common test group in quorum_queue_SUITE to avoid
a name collision that prevented running the duplicates individually
2021-04-08 15:33:19 +02:00
kjnilsson e2fd14b996 Bump timeouts for peer discovery suite 2021-04-07 10:00:07 +01:00
kjnilsson b576242952 Increase rabbit_stream_queue_SUITE timetrap
And set the default of make start-cluster to 3 nodes.
2021-04-06 15:50:22 +01:00
Jean-Sébastien Pédron 95f9e92caa
unit_log_management_SUITE: Use $RABBITMQ_LOGS to configure logging
Now that the Cuttlefish schema sets default values for the application
environment in `{rabbit, [{log, ...}]}`, the values set in the testsuite
using application:setenv() are overwritten.

By using the $RABBITMQ_LOGS environment variable, we can override those
default values.
2021-04-06 11:52:55 +02:00
Philip Kuryloski 0caeb65d04 Shard the eager_sync_SUITE by case
This suite contains only one group, but is long enough to warrant
sharding. This is probably a bit of a time penalty in absolute terms
because init_per_suite and init_per_group re-run in each shard.
2021-03-31 15:47:36 +02:00
Jean-Sébastien Pédron 571b97513f
Logging: Allow to set timezone in rfc3339- and format-string-based time formats
This is not exposed to the end user (yet) through the Cuttlefish
configuration. But this is required to make logging_SUITE timezone
agnostic (i.e. the timezone of the host running the testsuite should not
affect the formatted times).
2021-03-31 14:13:40 +02:00
Carl Hörberg 330b820a0f Update proxy protocol test cases 2021-03-30 16:55:36 +02:00
Jean-Sébastien Pédron 2f648da118
config_schema_SUITE: Stop testing log configuration
The design of the rabbit_ct_config_schema helper makes it impossible to
do pattern matching and thus handle default values in the schema. As a
consequence, the helper explicitly removes the `{rabbit, {log, _}}`
configuration key to work around this limitation until a proper solution
is implemented and all testsuites rewritten. See
rabbitmq/rabbitmq-ct-helpers@b1f1f1ce68.

Therefore, we can't test log configuration variables anymore using this
helper. Thatt's ok because logging_SUITE already tests many things.
2021-03-30 10:21:26 +02:00
Jean-Sébastien Pédron aca638abbb
Logging: Add configuration variables to set various formats
In addition to the existing configuration variables to configure
logging, the following variables were added to extend the settings.

log.*.formatter = plaintext | json
  Selects between the plain text (default) and JSON formatters.

log.*.formatter.time_format = rfc3339_space | rfc3339_T | epoch_usecs | epoch_secs | lager_default
  Configures how the timestamp should be formatted. It has several
  values to get RFC3339 date & time, Epoch-based integers and Lager
  default format.

log.*.formatter.level_format = lc | uc | lc3 | uc3 | lc4 | uc4
  Configures how to format the level. Things like uppercase vs.
  lowercase, full vs. truncated.
  Examples:
    lc: debug
    uc: DEBUG
    lc3: dbg
    uc3: DBG
    lw4: dbug
    uc4: DBUG

log.*.formatter.single_line = on | off
  Indicates if multi-line messages should be reformatted as a
  single-line message. A multi-line message is converted to a
  single-line message by joining all lines and separating them
  with ", ".

log.*.formatter.plaintext.format
  Set to a pattern to indicate the format of the entire message. The
  format pattern is a string with $-based variables. Each variable
  corresponds to a field in the log event. Here is a non-exhaustive list
  of common fields:
    time
    level
    msg
    pid
    file
    line
  Example:
    $time [$level] $pid $msg

log.*.formatter.json.field_map
  Indicates if fields should be renamed or removed, and the ordering
  which they should appear in the final JSON object. The order is set by
  the order of fields in that coniguration variable.
  Example:
    time:ts level msg *:-
  In this example, `time` is renamed to `ts`. `*:-` tells to remove all
  fields not mentionned in the list. In the end the JSON object will
  contain the fields in the following order: ts, level, msg.

log.*.formatter.json.verbosity_map
  Indicates if a verbosity field should be added and how it should be
  derived from the level. If the verbosity map is not set, no verbosity
  field is added to the JSON object.
  Example:
    debug:2 info:1 notice:1 *:0
  In this example, debug verbosity is 2, info and notice verbosity is 1,
  other levels have a verbosity of 0.

All of them work with the console, exchange, file and syslog outputs.

The console output has specific variables too:

log.console.stdio = stdout | stderr
  Indicates if stdout or stderr should be used. The default is stdout.

log.console.use_colors = on | off
  Indicates if colors should be used in log messages. The default
  depends on the environment.

log.console.color_esc_seqs.*
  Indicates how each level is mapped to a color. The value can be any
  string but the idea is to use an ANSI escape sequence.
  Example:
    log.console.color_esc_seqs.error = \033[1;31m

V2: A custom time format pattern was introduced, first using variables,
    then a reference date & time (e.g. "Mon 2 Jan 2006"), thanks to
    @ansd. However, we decided to remove it for now until we have a
    better implementation of the reference date & time parser.

V3: The testsuite was extended to cover new settings as well as the
    syslog output. To test it, a fake syslogd server was added (Erlang
    process, part of the testsuite).

V4: The dependency to cuttlefish is moved to rabbitmq_prelaunch which
    actually uses the library. The version is updated to 3.0.1 because
    we need Kyorai/cuttlefish#25.
2021-03-29 17:39:50 +02:00
Philip Kuryloski 388654c542
Add a partial Bazel build (#2938)
Adds WORKSPACE.bazel, BUILD.bazel & *.bzl files for partial build & test with Bazel. Introduces a build-time dependency on https://github.com/rabbitmq/bazel-erlang
2021-03-29 11:01:43 +02:00
Philip Kuryloski 09e85d2e3d
Merge pull request #2935 from rabbitmq/rabbitmq-queue-int-tests
Fix integration tests to wait until ra cluster is ready
2021-03-26 17:28:06 +01:00
dcorbacho a1caff2a86 Fix integration tests to wait until ra cluster is ready
Publish/confirm before grow/shrink members is enough
2021-03-26 17:04:50 +01:00
Philip Kuryloski 1ead01081a Increase startup delay range in peer_discovery_classic_config_SUITE
I suspect the second ra system for coordination requires a bit more
time in boot, as this seems to flake more often since the merge
2021-03-26 14:11:36 +01:00
Philip Kuryloski 3c0c0901b1 Restore retry in peer_discovery_classic_config_SUITE
It was accidentally left commented out
2021-03-25 20:05:36 +01:00
Philip Kuryloski c313f36b57 Fix Makefile for feature_flags_SUITE_data/my_plugin
It was not updated for the rabbitmq-components.mk consolidation
2021-03-25 19:43:48 +01:00
Philip Kuryloski 008e47ef3c Fixup the behavior of rabbit_mnesia:is_virgin_node/0
Given the addition of the Coord ra system (and additional files on disk)
2021-03-25 10:49:17 +01:00
kjnilsson 8d8b67bb34 fix rabbit_fifo_int_SUITE 2021-03-24 14:17:34 +00:00
Michael Klishin 8eac876bc8
Use "quorum_queues" for QQ Ra system
"quorum" and "coordination" are not very distinctive
2021-03-22 21:44:19 +03:00
kjnilsson 75cea78415
fixes 2021-03-22 21:44:19 +03:00
kjnilsson f6f02a5d2d
ra systems wip 2021-03-22 21:44:15 +03:00
Philip Kuryloski a63f169fcb Remove duplicate rabbitmq-components.mk and erlang.mk files
Also adjust the references in rabbitmq-components.mk to account for
post monorepo locations
2021-03-22 15:40:19 +01:00
Michael Klishin 373285093e
Merge pull request #2899 from rabbitmq/parallel-stream-suite
Run most stream tests in parallel
2021-03-19 22:21:18 +03:00
Jean-Sébastien Pédron 9fd2d68e7a
rabbit_prelaunch_logging: $RABBITMQ_LOGS doesn't override log level
... if it is set in the configuration file.

Here is an example of that use case:
* The official Docker image sets RABBITMQ_LOGS=- in the environment
* A user of that image adds a configuration file with:
      log.console.level = debug

The initial implementation, introduced in rabbitmq/rabbitmq-server#2861,
considered that if the output is overriden in the environment (through
$RABBITMQ_LOGS), any output configuration in the configuration file is
ignored.

The problem is that the output-specific configuration could also set the
log level which is not changed by $RABBITMQ_LOGS. This patch fixes that
by keeping the log level from the configuration (if it is set obviously)
even if the output is overridden in the environment.
2021-03-19 15:43:28 +01:00
dcorbacho 9b3b5d48ec Run most stream tests in parallel
The test suite isn't faster, I guess some contention on the coordinator,
but is finding some bugs.
2021-03-17 21:32:42 +01:00
kjnilsson cbf0107605 Stream coordinator bug fix
Fix issue where a deleted replica could be restarted if the leader went
down whilst the replica was still running it's start phase.
2021-03-17 13:54:28 +00:00
kjnilsson 9d83e0c5d9 Add logging to config decryption test
To possibly get a bit more information on failure reasons on GH Actions.
2021-03-16 16:28:41 +00:00
kjnilsson 3a26cf8654 Stream coordinator: handle commands for unknown streams
To avoid crashing.
2021-03-12 15:04:40 +00:00
kjnilsson 1709208105 Throw resource error when no local stream member
As well as some additional tests
2021-03-12 15:04:40 +00:00
dcorbacho e19aca8075 Use right map fields to compute streams info 2021-03-12 15:04:40 +00:00
kjnilsson 7fa3f6b6e1 Stream Coordinator: primitive backoff
Sleep for 5s after a failure due to a node being down before reporting
back to stream coordinator (which will immediately retry).

stream coordinator: correct command type spec

tidy up

fix rabbit_fifo_prop tests

stream coord: add function for member state query
2021-03-12 15:03:47 +00:00
kjnilsson bb3e0a7674 Move stream coordinator unit tests into ct suite 2021-03-12 15:03:10 +00:00
kjnilsson 9fb2e6d2dd Stream Coordinator refactor 2021-03-12 15:03:08 +00:00
Jean-Sébastien Pédron cdcf602749
Switch from Lager to the new Erlang Logger API for logging
The configuration remains the same for the end-user. The only exception
is the log root directory: it is now set through the `log_root`
application env. variable in `rabbit`. People using the Cuttlefish-based
configuration file are not affected by this exception.

The main change is how the logging facility is configured. It now
happens in `rabbit_prelaunch_logging`. The `rabbit_lager` module is
removed.

The supported outputs remain the same: the console, text files, the
`amq.rabbitmq.log` exchange and syslog.

The message text format slightly changed: the timestamp is more precise
(now to the microsecond) and the level can be abbreviated to always be
4-character long to align all messages and improve readability. Here is
an example:

    2021-03-03 10:22:30.377392+01:00 [dbug] <0.229.0> == Prelaunch DONE ==
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>  Starting RabbitMQ 3.8.10+115.g071f3fb on Erlang 23.2.5
    2021-03-03 10:22:30.377860+01:00 [info] <0.229.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com

The example above also shows that multiline messages are supported and
each line is prepended with the same prefix (the timestamp, the level
and the Erlang process PID).

JSON is also supported as a message format and now for any outputs.
Indeed, it is possible to use it with e.g. syslog or the exchange. Here
is an example of a JSON-formatted message sent to syslog:

    Mar  3 11:23:06 localhost rabbitmq-server[27908] <0.229.0> - {"time":"2021-03-03T11:23:06.998466+01:00","level":"notice","msg":"Logging: configured log handlers are now ACTIVE","meta":{"domain":"rabbitmq.prelaunch","file":"src/rabbit_prelaunch_logging.erl","gl":"<0.228.0>","line":311,"mfa":["rabbit_prelaunch_logging","configure_logger",1],"pid":"<0.229.0>"}}

For quick testing, the values accepted by the `$RABBITMQ_LOGS`
environment variables were extended:
  * `-` still means stdout
  * `-stderr` means stderr
  * `syslog:` means syslog on localhost
  * `exchange:` means logging to `amq.rabbitmq.log`

`$RABBITMQ_LOG` was also extended. It now accepts a `+json` modifier (in
addition to the existing `+color` one). With that modifier, messages are
formatted as JSON intead of plain text.

The `rabbitmqctl rotate_logs` command is deprecated. The reason is
Logger does not expose a function to force log rotation. However, it
will detect when a file was rotated by an external tool.

From a developer point of view, the old `rabbit_log*` API remains
supported, though it is now deprecated. It is implemented as regular
modules: there is no `parse_transform` involved anymore.

In the code, it is recommended to use the new Logger macros. For
instance, `?LOG_INFO(Format, Args)`. If possible, messages should be
augmented with some metadata. For instance (note the map after the
message):

    ?LOG_NOTICE("Logging: switching to configured handler(s); following "
                "messages may not be visible in this log output",
                #{domain => ?RMQLOG_DOMAIN_PRELAUNCH}),

Domains in Erlang Logger parlance are the way to categorize messages.
Some predefined domains, matching previous categories, are currently
defined in `rabbit_common/include/logging.hrl` or headers in the
relevant plugins for plugin-specific categories.

At this point, very few messages have been converted from the old
`rabbit_log*` API to the new macros. It can be done gradually when
working on a particular module or logging.

The Erlang builtin console/file handler, `logger_std_h`, has been forked
because it lacks date-based file rotation. The configuration of
date-based rotation is identical to Lager. Once the dust has settled for
this feature, the goal is to submit it upstream for inclusion in Erlang.
The forked module is calld `rabbit_logger_std_h` and is based
`logger_std_h` in Erlang 23.0.
2021-03-11 15:17:36 +01:00
Michael Klishin d77609bba4
Merge pull request #2846 from rabbitmq/cleanup-rabbit-fifo-usage
Clean up rabbit_fifo_usage table on queue.delete
2021-03-03 18:33:33 +03:00
Michael Klishin a2f98f25e9
Merge pull request #2804 from rabbitmq/rabbitmq-server-2756
Add federation support for quorum queues
2021-02-25 19:10:15 +03:00
dcorbacho a147cc4877 Clean up rabbit_fifo_usage table on queue.delete 2021-02-25 16:57:43 +01:00
Michael Klishin cd1a271499
As of Lager 3.8.2, Lager has a log_root default
so override it unconditionally.
2021-02-25 00:43:02 +03:00
dcorbacho 699cd1ab29 Add federation support for quorum queues 2021-02-18 17:15:47 +01:00
Carl Hörberg 413bfe7b37 Disable Erlang busy wait by default
By disabling Erlang busy wait threshold CPU usage with 5000 idle connection
drops from 110% to 14%. Throughput does not seem to be affected at all,
if any thing it actually goes up a bit when you have 5000 idle connections
(because less CPU cycles are wasted polling idle connections).

rabbitmq-perf-test-2.13.0/bin/runjava com.rabbitmq.perf.PerfTest -s 8000 -z 15

With default erlang busy wait threshold:
id: test-115706-497, sending rate avg: 39589 msg/s
id: test-115706-497, receiving rate avg: 39570 msg/s

With busy wait disabled:
id: test-115807-719, sending rate avg: 40340 msg/s
id: test-115807-719, receiving rate avg: 40301 msg/s

rabbitmq-diagnostics runtime_thread_stats output while running the
PerfTest:

with default busy wait threshold:

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    0.01%    0.00%    0.00%    0.00%    0.00%    0.00%   99.98%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.03%    0.05%    0.00%   99.92%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.01%    0.00%   99.99%
          poll    0.00%    0.67%    0.00%    0.00%    0.00%    0.00%   99.33%
     scheduler    0.69%    0.18%   28.41%    5.49%    9.50%    7.43%   48.29%

without busy wait threshold:

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    0.01%    0.00%    0.00%    0.00%    0.01%    0.00%   99.98%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
          poll    0.00%    0.77%    0.00%    0.00%    0.00%    0.00%   99.23%
     scheduler    0.70%    0.14%   28.29%    5.41%    0.86%    7.22%   57.38%
2021-02-10 12:35:12 +01:00
Michael Klishin ad20bfbc40
Use new crypto API cipher name here
References rabbitmq/credentials-obfuscation#10
2021-02-09 11:22:48 +03:00
Michael Klishin 0939cec51a
Exclude aes_ige256 in one more test suite 2021-02-08 11:21:16 +03:00
Michael Klishin c7b9c39352
Don't perform CMQ leadership transfer when entering maintenance mode
The time this operation can take in clusters with a lot of classic
mirrored queue (say, 10s or 100s of thousands) be prohibitive for
upgrades.

Upgrades that use a health check to ensure that there are in-sync
replicas before entering maintenance mode, in which case
the transfer is not really necessary.

All of the above is more obvious with the recent changes in #2749.
2021-01-27 19:11:26 +03:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
kjnilsson f2418cfe4c Fix crash bug in QQ state conversion
When there are consumers in the service queue.
2021-01-20 14:19:33 +00:00
kjnilsson 2f0dba45d8 Stream: Channel resend on leader change
Detect when a new stream leader is elected and make stream_queues
re-send any unconfirmed, pending messages to ensure they did not get
lost during the leader change. This is done using the osiris
deduplication feature to ensure the resend does not create duplicates of
messages in the stream.
2021-01-13 12:09:44 +00:00
dcorbacho 9ef9dde6ce Apply retention policy in all osiris members 2021-01-12 12:18:13 +00:00
dcorbacho e5a2eaaa0d Update retention when only stream retention policy has changed
In any other case, the worker needs to be restarted
2021-01-12 12:18:13 +00:00
Michal Kuratczyk 6a81589c11 Expose `bypass_pem_cache` through rabbitmq.conf
Bypassing PEM cache may speed up TLS handshakes in some cases as described
here:
https://blog.heroku.com/how-we-sped-up-sni-tls-handshakes-by-5x
2020-12-17 16:53:14 +01:00
Michael Klishin 4ea9ce1c0b
Clarify what version will be the first to use this format 2020-12-09 12:48:56 +03:00
Michael Klishin e4c37db689
Support importing users with arrays of tags
as opposed to a comma-separated binary.

Part of #2667.
2020-12-08 18:22:56 +03:00
kjnilsson 6fdb7d29ec Handle errors in crashing_queues_SUITE
As the connection may crash during the previous declaration and a caught
error would be returned in amqp_connection:open_channel/1 that wasn't
handled previously. Exactly how things fail in this test is most likely
very timing dependent and may vary.

Also fixes mqtt test where the process that set up a mock auth ETS table
was transient when an rpc timeout was introduced
2020-12-03 13:56:09 +00:00
Luke Bakken ccf624211a
Add test that fails prior to the change for #2668 2020-12-02 12:33:02 -08:00
Arnaud Cogoluègnes ffd66027af
Merge pull request #2506 from rabbitmq/stream-timestamp-offset
Support timestamp offsets for stream consumers
2020-11-27 14:49:38 +01:00
Arnaud Cogoluègnes 43cfb45a74
Convert AMQP 091 timestamp to millisecond
For start offset in stream queue.
2020-11-27 14:47:36 +01:00
kjnilsson ea7c9e9b61 QQ: Emit release cursor for empty basic gets
Else an application that polled an empty quorum queue frequntly using basic.get
would never result in a snapshot being taken and results in unlimited
log growth.
2020-11-19 15:59:51 +00:00
dcorbacho f23a51261d Merge remote-tracking branch 'origin/master' into stream-timestamp-offset 2020-11-18 14:27:41 +00:00
kjnilsson d88b623c18 Use correct credit mode x-credit
When the x-credit consumer arg is defined Quorum Queues should use use
credit mode `credited` and not `simple_prefetch`.
2020-11-16 10:45:10 +01:00
Philip Kuryloski a1fe3ab061 Change repo "root" to deps/rabbit
rabbit must not be the monorepo root application, as other applications depend on it
2020-11-13 14:34:42 +01:00