Stop sending connection_stats from protocol readers to rabbit_event.
Stop sending queue_stats from queues to rabbit_event.
Sending these stats every 5 seconds to the event manager process is
superfluous because noone handles these events.
They seem to be a relict from before rabbit_core_metrics ETS tables got
introduced in 2016.
Delete test head_message_timestamp_statistics because it tests that
head_message_timestamp is set correctly in queue_stats events
although queue_stats events are used nowhere.
The functionality of head_message_timestamp itself is still tested in
deps/rabbit/test/priority_queue_SUITE.erl and
deps/rabbit/test/temp/head_message_timestamp_tests.py
`rabbit_ff_registry` is an internal module of the feature flags
subsystem, it is not meant to be called outside of it. In particular, it
may lead to crashes very difficult to debug when a feature flag is
enabled. The reason is that this module is compiled and reloaded at
runtime, so the module may disappear for a small period of time. Another
reason is that a process calling `rabbit_ff_registry` may hold that
module, preventing the feature flags subsystem from performing its task.
The correct public API to query the state of a feature flag is
`rabbit_feature_flags:is_enabled/1`. It will always return a boolean. If
the feature flag being queried is in `state_changing` state, this
function takes care of waiting for the outcome.
See #5018.
This line gets logged when the client closes the connection to the
stream port before it authenticates successfully.
Some external load balancers for example connect to the stream port to
do health checks without sending any stream protocol frame.
This commits prevents the RabbitMQ log from being polluted.
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps
Also:
- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
- Fetching seems blocked in CI, and this should reduce hex api usage in
all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
the defaults
Some tools such as nvim + erlang_ls sometimes change the cwd if
they encounter a rebar.config. Here we move all rebar.config files
into the root of the project to avoid this and also have a single
point for formatting configuration.
Single active consumer must have a name, which is used as the reference
for storing offsets and as the name of the group the consumer belongs
to in case the stream is a partition of a super stream.
References #3753
A formerly active consumer can have in-flight credit
requests when it becomes inactive. This commit checks
the state of consumer on credit requests and make sure
not to dispatch messages if it's inactive.
Stream consumers can be active or not with SAC, so these 2 fields
are added to the stream metrics. This is the same as with
regular consumers.
References #3753
Separate logic between single SAC and SAC in a partition of a super
stream (makes code clearer), wait for former active consumer
notification to select the new active consumer (avoids a lock
of some sort on the group, so consumers can come and go).
References #3753
In SAC coordinator. The messages to connections cannot be sent
from the state machine or they will be sent also during the
replication. Only the leader enforces side effects.
References #3753
The SAC group coordination should happen as part of raft state
machine, to make it more reliable and robust.
It would also benefit from different "services" the coordinator
and RA provide.
References #3753
Deprecate queue-leader-locator values 'random' and 'least-leaders'.
Both become value 'balanced'.
From now on only queue-leader-locator values 'client-local', and
'balanced' should be set.
'balanced' will place the leader on the node with the least leaders if
there are few queues and will select a random leader if there are many
queues.
This avoid expensive least leaders calculation if there are many queues.
This change also allows us to change the implementation of 'balanced' in
the future. For example 'balanced' could place a leader on a node
depending on resource usage or available node resources.
There is no need to expose implementation details like 'random' or
'least-leaders' as configuration to users.
rabbitmq_cli uses some private rules_erlang apis that have changed in
the upcoming release
Additionally:
- Avoid including both standard and test versions of amqp_client in
integration test suites
- Eliminate most of the compilation order hints (explicit first_srcs)
in the bazel build
- Fix an include statement - in bazel, an app is not available to
itself as a library at compilation time
Several listeners for the same offset can be registered,
which causes problems for low rate, where consumers hit
the end of the stream often. Then Osiris messages accumulate
in the process mailbox, which slows it down considerably after
some time.
This commit makes the listener registration idempotent for a given
offset.
A stream may not be available or in an inconsistent state and the
stream coordinator reports this with an error that the stream
manager handles like an appropriate response. The stream protocol
adapter then tries to extract the topology from the error.
This commit makes the stream manager handles the error correctly
so that the stream protocol adapter reports the unavailability
of the stream to the client.
This is useful in case new fields are needed in further
versions. A new version node can ask for new fields to an
old version node in a mixed-version cluster.
If the old version node returns a known
value for unknown fields instead of failing, the new
node can set up appropriate default value for these
fields in the result of the CLI commands.
bazel-erlang has been renamed rules_erlang. v2 is a substantial
refactor that brings Windows support. While this alone isn't enough to
run all rabbitmq-server suites on windows, one can at least now start
the broker (bazel run broker) and run the tests that do not start a
background broker process
Only a couple of fields of the stream consumer record change
very frequently (credits and Osiris log reference), so this commit
introduces a nested record in the main consumer record that
contains the immutable fields. This potentially avoids producing
a lot of garbage, especially when the consumer state contains
several properties (consumer name, or single active consumer information
in the future).
Fixes#3841
The most recent description of Osiris chunk format does not reference
the timestamp field to be "posix-ish" anymore. This was bit misleading
as it is Erlang's system time.
Add link to Erlang system time documentation to the subscription command
description to avoid confusion about the timestamp field.
Technically duplicate names is supported by common test, but we have
seen it contribute to flakiness in our suite in practice
(cherry picked from commit 513446b6d1)
In test suite. Note a snapshot for 1.0-SNAPSHOT has been
pushed by mistake a while ago, so this one must be
excluded. It's unlikely it will be erased, as the snapshots
for the first stable version should be 1.0.0-SNAPSHOT.
The protocol documentation uses decimal values for error and request key
codes.
Let's use hex values instead. This helps when looking at a request and
its response - 0x0006 and 0x8006 vs. 6 and 32774.
Also, when looking at output of protocol analysis tools like Wireshark,
a hexadecimal value will be printed, for example:
"Nov 1, 2021 23:05:19.395825508 GMT","60216,5552","00000009000600010000000701"
"Nov 1, 2021 23:05:19.396069528 GMT","5552,60216","0000000a80060001000000070001"
Above, we can visually identify delete publisher request and response
(0x0006 and 0x8006) and easily match them in the documentation of the
protocol.
Finally, above argument applies to logging as it is common to log
hex values, not decimal.
1. Response for publisher declaration request does not contain
publisher id.
2. Add mechanism entry to the details of SASL handshake request.
3. SASL handshake response contains list of mechanisms, not just single
mechanism.
Use case: Allow plain connections over one (internal IP), and TLS
connections over another IP (eg. internet routable IP). Without this
patch a cluster can only support access over one or the other IP, not
both.
(cherry picked from commit b9e6aad035)
We expect to have 1 stream for each routing key, but
as binding can return several queues for a given key we
let that possibility open in the stream protocol.
Instead of injecting it into varios places inside the code.
When the osiris log is closed it will decrement the global "readers"
counter which is why it is much safer to do this in terminate.
Otherwise metrics will not get cleaned up correctly when processes crash.
It's also tidier to do this in a single place, in terminate/3
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
since it fails with Bazel.
As discussed with @pjk25, let's set this value via application env,
make it configurable to the test, but not configurable to the user.
Add state timeouts.
If the client takes more than 10s for a single step in the authentication
protocol, make the server close the TCP connection.
Also close the TCP connection if the server times out in state
close_sent. That's the case when the client sends an invalid command
(after successful authentication), the server requests the client to
close the connection, but the client doesn't respond anymore.
Default gen_server timeout is not enough to list busy connections.
Setting it to infinity allows the caller to decide the timeout,
as classic queues do. The `emit_info` function's family sets its
own timeout for all the cli commands.
Calling ensure_stats_timer after init_stats_timer and reset_stats_timer
is enough.
The idea is to call stop_stats_timer before hibernation and
ensure_stats_timer on wakeup. However, since we never call
stop_stats_timer in rabbit_stream_reader, we don't need to call
ensure_stats_timer on every network activity.
Before this commit test AlarmsTest.diskAlarmShouldNotPreventConsumption
of the Java client was failing.
When executing that test, the server failed with:
2021-06-25 16:11:02.886935+02:00 [error] <0.1301.0> exception exit: {unexpected_message,resume}
2021-06-25 16:11:02.886935+02:00 [error] <0.1301.0> in function rabbit_heartbeat:heartbeater/3 (src/rabbit_heartbeat.erl, line 138
because the heartbeater was tried to be resumed without being paused
before.
Above exception exit also happens on master branch when executing this
test. However, the test falsely succeeds on master because the following FIXME was
never implemented:
8e569ad8bf/deps/rabbitmq_stream/src/rabbit_stream_reader.erl (L778)
This is pure refactoring - no functional change.
Benefits:
* code is more maintainable
* smaller methods (instead of previous 350 lines listen_loop_post_auth function)
* well defined state transitions (e.g. useful to enforce authentication protocol)
* we get some gen_statem helper functions for free (e.g. debug utilities)
Useful doc: https://ninenines.eu/docs/en/ranch/2.0/guide/protocols/
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.
The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.
This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
https://github.com/rabbitmq/seshat/pull/1.
We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.
While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.
Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.
Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Most tests that can start rabbitmq nodes have some chance of
flaking. Rather than chase individual flakes for now, this commit
changes the default (though it can still be overriden, as is the case
for config_scheme_SUITE in many places, since I have yet to see that
particular suite flake).
Before this commit sending garbarge data to the server stream port
caused the RabbitMQ node to eat more and more memory.
In this commit, we fix it by expecting the client to go through the
proper authentication sequence. Otherwise, the server closes the socket.
Co-authored-by: Michal Kuratczyk <mkuratczyk@pivotal.io>
This is an important metric to keep track of and be aware (maybe even
alert on) when consumers fall behind consuming stream messages. While
they should be able to catch up, if they fall behind too much and the
stream gets truncated, they may miss on messages.
This is something that we want to expose via Prometheus metrics as well,
but we've started closer to the core, CLI & Management.
This should be merged as soon as it passes CI, we shouldn't wait on the
Prometheus changes - they can come later.
Pair: @kjnilsson
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>