If a delete happens shortly after a declare or other stream change
there is a chance the mnesia update process that is spawned will crash
when the amqqueue record cannot be recovered from durable storage.
This isn't harmful but does pollute the logs.
For booleans, we can prefer the operator policy value
unconditionally, without any safety implications.
Per discussion with @binarin @pjk25
(cherry picked from commit 6edb7396fd)
A channel that first sends a mandatory publish before enabling
confirms mode may not receive confirms for messages published
after that. This is because the publish_seqno was increased
also for mandatory publishes even if confirms were disabled.
But the mandatory feature has nothing to do with publish_seqno.
The issue exists since at least
38e5b687de
The test case introduced focuses for multiple=false. The issue
also exists for multiple=true but it has a different impact:
sending multiple=true,delivery_tag=2 results in both messages
1 and 2 being acked, even if message 2 doesn't exist as far
as the client is concerned. If the message does exist
it might get confirmed earlier than it should have been. The
issue is a bigger problem the more mandatory messages were
sent before enabling confirms mode.
Only a couple of fields of the stream consumer record change
very frequently (credits and Osiris log reference), so this commit
introduces a nested record in the main consumer record that
contains the immutable fields. This potentially avoids producing
a lot of garbage, especially when the consumer state contains
several properties (consumer name, or single active consumer information
in the future).
Fixes#3841
Note that Lager itself doesn't handle certain combinations:
* $W0H45 is fine
* $W0D1H45 fails with an error
but hopefully what we have now should be enough for
a minimalistic built-in log rotation feature.
Otherwise, the QPids will still to ask limiter whether it can be sent before delivering.
This will degrade performance, especially when the limiter and QPid are on different nodes.
When 'can_send' is deactivated, the test results are as follows:
id: test-100147-150, time: 400.016s, sent: 17654 msg/s, returned: 0 msg/s, confirmed: 17658 msg/s, nacked: 0 msg/s, received: 17663 msg/s, min/median/75th/95th/99th consumer latency: 1775/5899/6486/7369/8440 μs, confirm latency: 2171/5581/6127/7026/7911 μs
test stopped (Reached time limit)
id: test-100147-150, sending rate avg: 17630 msg/s
id: test-100147-150, receiving rate avg: 17630 msg/s
When limiter and QPid are on the same node and 'can_send' is activated, the test results are as follows:
id: test-095229-474, time: 400.015s, sent: 13246 msg/s, returned: 0 msg/s, confirmed: 13247 msg/s, nacked: 0 msg/s, received: 13245 msg/s, min/median/75th/95th/99th consumer latency: 3777/7316/8345/10447/11392 μs, confirm latency: 4074/7308/8257/10336/11341 μs
test stopped (Reached time limit)
id: test-095229-474, sending rate avg: 13317 msg/s
id: test-095229-474, receiving rate avg: 13317 msg/s
we have seen, for the message rate, the test showed a 24% drop.
We're also typically storing the encoded properties as well.
We only really need one. e.g. an enqueue command with a 2 byte payload
serialises to 290 bytes compared to 463. A nice saving.
Before this commit, the tests were not including any settle, return, or
discard Ra commands.
Do not pattern match against 'ra_event' because nowadays:
_Opts = [local, ra_event]
The most recent description of Osiris chunk format does not reference
the timestamp field to be "posix-ish" anymore. This was bit misleading
as it is Erlang's system time.
Add link to Erlang system time documentation to the subscription command
description to avoid confusion about the timestamp field.
From the coordinator's POV each stream has a unique id consisting of the
vhost, queuename and a high resolution timestamp even if several stream ids
relate to the same queue record.
When performing the mnesia update the coordinator now checks that the current stream id
matches that of the update_mnesia action and does not change the queue record if
the stream id is not the same.
This should avoid "old" incarnations of a stream queue updating newer ones
with incorrect information.
When the suite passes, it's about 120 seconds total, so 5 minutes per
case seems to be too much. Additionally, if the suite times out at the
bazel level, we get no logs, so the cause of the timeout is unclear.
Avoids multiple calls to `application:get_env` which can be very expensive.
Also limits filter to vhost_msg_stats, as queue_msg_stats are required for
individual queue metrics
So that a reply is sent to the caller immediately after the command has
been processed as intended. Previously it was possible if reply_to was
already set that a reply never was sent to the caller and the caller
times out. This should improve some flakyness in the rabbit_stream_queue suite
as well.
Strictly this is a change that introduces indeterminism in the coordinator
state machine as during an upgrade different members may run different code
for this command. But as this state only affects side effects (replies) and
the state for the streams affected will shortly be removed this is very
unlikely to cause any real issues.
Fixes#2941
This adds proper exception handlers in the right places. And tests
ensure that it indeed provides nice neat logs without large
stacktraces for every amqp operation.
Unnecessary checking for subscribe permissions on topic was dropped,
as `queue.bind` does exactly the same check. Topic permissions tests
were also added, and they indeed confirm that there was no change in
behaviour.
Ideally the same explicit topic permission check should be dropped for
publishing, but it's more complicated - so for now there only a
detailed comment in the source code explaining it.
A few other things were also optimized away:
- Using amqp client to test for queue existence
- Creating queues/starting consumptions too eagerly, even if not yet
requested by client
debugging of situation where messages may be stuck.
Also cancel rabbit_fifo_client timer after message resend to avoid
resending them again when the timer triggers.
Some plugins might create internal queues that should not be accounted
for the total number of messages on the system. These can now be filtered
out using a regular expression on the queue name. Individual queue stats
are still available
As we only need to make sure the rabbit_queues table is populated
use a dirty write function that only does this instead. This could potentially
half recovery times for many QQ scenarios.
Technically duplicate names is supported by common test, but we have
seen it contribute to flakiness in our suite in practice
(cherry picked from commit 513446b6d1)