The rabbitmq_stream.advertised_tls_host setting is not used in the
metadata frame of the stream protocol, even if it is set. This commit
makes sure the setting is used if set.
References rabbitmq/rabbitmq-stream-java-client#803
Retry unregistering a stream from its group in case of stream
coordinator timeout/unavailability. The operation can fail during or
after a network partition, which is normally, but it is harmless to
retry it to clean up the SAC group. The operation is idempotent anyway.
Trigger a 4.2.x alpha release build / trigger_alpha_build (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 26) (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 27) (push) Waiting to runDetails
Test (make) / Build and Xref (1.18, 28) (push) Waiting to runDetails
Test (make) / Test (1.18, 28, khepri) (push) Waiting to runDetails
Test (make) / Test (1.18, 28, mnesia) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.18, 28, khepri) (push) Waiting to runDetails
Test (make) / Test mixed clusters (1.18, 28, mnesia) (push) Waiting to runDetails
Test (make) / Type check (1.18, 28) (push) Waiting to runDetails
Switch from ra_metrics to ra_counters
* Expose many more metrics (they are also up to date)
* Bump Seshat, Ra, Osiris, Prometheus.erl
* switch from proplists to maps
New CLI command to trigger a rebalancing in a SAC group and activate a
consumer. This is a last resort solution if all consumers in a group
accidently end up in {connected, waiting} state.
The command re-uses an existing function, which only picks the consumer
that should be active. This means it does not try to "fix" the state
(e.g. removing a disconnected consumer because its node is definitely
gone from the cluster).
Fixes#14055
Calls to the stream SAC coordinator can fail for various reason
(e.g. a timeout because of a network partition). The stream reader does not
take into account what the SAC coordinator returns and moves on even
in case of errors. This can lead to inconsistent state for SAC groups.
This commit changes this behavior by handling unexpected errors from the
SAC coordinator and closing the connection. The client is expected to
reconnect. This is safer than risking inconsistent state.
Fixes#14040
The clean-up of a stream connection state when a stream member goes down can
remove subscriptions not affected by the member. The subscription state is
removed from the connection, but the subscription is not removed from
the SAC state (if the subscription is a SAC), because the subscription member
PID does not match the down member PID.
When the actual member of the subscription goes down, the subscription is no
longer part of the state, so the clean-up does not find the subscription
and does not remove it from the SAC state. This lets a ghost consumer in
the corresponding SAC group.
This commit makes sure only the affected subscriptions are removed from
the state when a stream member goes down.
Fixes#13961
A boolean status in the stream SAC coordinator is not enough to follow
the evolution of a consumer. For example a former active consumer that
is stepping down can go down before another consumer in the group is
activated, letting the coordinator expect an activation request that
will never arrive, leaving the group without any active consumer.
This commit introduces 3 status: active (formerly "true"), waiting
(formerly "false"), and deactivating. The coordinator will now know when
a deactivating consumer goes down and will trigger a rebalancing to
avoid a stuck group.
This commit also introduces a status related to the connectivity state
of a consumer. The possible values are: connected, disconnected, and
presumed_down. Consumers are by default connected, they can become
disconnected if the coordinator receives a down event with a
noconnection reason, meaning the node of the consumer has been
disconnected from the other nodes. Consumers can become connected again when
their node joins the other nodes again.
Disconnected consumers are still considered part of a group, as they are
expected to come back at some point. For example there is no rebalancing
in a group if the active consumer got disconnected.
The coordinator sets a timer when a disconnection occurs. When the timer
expires, corresponding disconnected consumers pass into the "presumed
down" state. At this point they are no longer considered part of their
respective group and are excluded from rebalancing decision. They are expected
to get removed from the group by the appropriate down event of a
monitor.
So the consumer status is now a tuple, e.g. {connected, active}. Note
this is an implementation detail: only the stream SAC coordinator deals with
the status of stream SAC consumers.
2 new configuration entries are introduced:
* rabbit.stream_sac_disconnected_timeout: this is the duration in ms of the
disconnected-to-forgotten timer.
* rabbit.stream_cmd_timeout: this is the timeout in ms to apply RA commands
in the coordinator. It used to be a fixed value of 30 seconds. The
default value is still the same. The setting has been introduced to
make integration tests faster.
Fixes#14070
A stream may not have a leader temporarily for several reasons, e.g.
after it has been restarted. The stream manager may return undefined in
this case. Some client code may crash because it expects a PID or an
error, but not undefined.
This commit makes sure the leader PID is an actual Erlang PID and
returns {error, not_available} if it is not.
References #13962
Not only when it is removed explicitly by the client. This is necessary
to make sure the consumer record is removed from the management ETS
tables (consumer_stats) and to avoid ghost consumers.
For other protocols like AMQP 091, the consumer_status ETS table is
cleaned up when a channel goes down, but there is no channel concept in
the stream protocol.
This is not consistent with other protocols or queue implementations
(which emits the event only on explicit consumer cancellation)
but is necessary to clean up stats correctly.
References #13092
Consumers with a same name, consuming from the same stream should have
the same partition index. This commit adds a check to enforce this rule
and make the subscription fail if it does not comply.
Fixes#13835
A connection which terminated before it was fully established
would lead to a function_clause, since metadata is not available
to really call notify_connection_closed. We can just ignore such
connections and not notify about them.
Resolves https://github.com/rabbitmq/rabbitmq-server/discussions/13670
The same connection can contain several consumers belonging to a SAC
group (group key = vhost + stream + consumer name). The whole new group
must be re-evaluated to select a new active consumer after the consumers
of the down connection are removed from it.
The previous behavior would not re-evaluate the new group and could
select a consumer from the down connection, letting the group with only
inactive consumers, as the selected active consumer would never receive
the activation message from the stream SAC coordinator.
This commit fixes this problem by removing the consumers of the down
down connection from the affected groups and then performing the
appropriate operations for the groups to keep on consuming (e.g.
notifying an active consumer that it needs to step down).
References #13372
The connection cannot return some information while initializing, so we
just return no information.
The CLI info call was supported only in the open gen_statem callback, so
such a call during the connection init would make it crash. This can
happen when several stream connections get closed and the user calls
list_stream_consumers or list_stream_connections while the connection
are recovering.
This commit adds a clause for CLI info calls in the all the gen_statem
callbacks and returns actual information only when appropriate.
The stream manager does not need to be a gen_server (no cast, no state)
and the gen_server can create contention for large stream deployments
(some functions make cluster-wide calls that can take some time).
This reverts commit 620fff22f1.
It intoduced a regression in another area - a TCP health check,
such as the default (with cluster-operator) readinessProbe,
on a TLS-enabled instance would log a `rabbit_reader` crash
every few seconds:
```
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> crasher:
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> initial call: rabbit_reader:init/3
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> pid: <0.999.0>
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> registered_name: []
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> exception error: no match of right hand side value {error, handshake_failed}
tls-server-0 rabbitmq 2024-09-13 09:03:13.010115+00:00 [error] <0.999.0> in function rabbit_reader:init/3 (rabbit_reader.erl, line 171)
```
It was observed that `rabbit_core_metrics_gc` and
`rabbit_stream_metrics_gc` processes can grow to several MBs of
memory (probably because fetching the list of all queues). As they
execute infrequently (every 2 minutes by default) it can save some
memory to hibernate them in-between (similar to other similar
processes).
The spec of `rabbit_exchange:declare/7` needs to be updated to return
`{ok, Exchange} | {error, Reason}` instead of the old return value of
`rabbit_types:exchange()`. This is safe to do since `declare/7` is not
called by RPC - from the CLI or otherwise - outside of test suites, and
in test suites only through the CLI's `TestHelper.declare_exchange/7`.
Callers of this helper are updated in this commit.
Otherwise this commit updates callers to unwrap the `{ok, Exchange}`
and bubble up errors.
A common case for exchange deletion is that callers want the deletion
to be idempotent: they treat the `ok` and `{error, not_found}` returns
from `rabbit_exchange:delete/3` the same way. To simplify these
callsites we add a `rabbit_exchange:ensure_deleted/3` that wraps
`rabbit_exchange:delete/3` and returns `ok` when the exchange did not
exist. Part of this commit is to update callsites to use this helper.
The other part is to handle the `rabbit_khepri:timeout()` error possible
when Khepri is in a minority. For most callsites this is just a matter
of adding a branch to their `case` clauses and an appropriate error and
message.
This ensures that the call graph of `rabbit_db_binding:create/2` and
`rabbit_db_binding:delete/2` handle the `{error, timeout}` error
possible when Khepri is in a minority.
Since feature flag `message_containers` introduced in 3.13.0 is required in 4.0,
we can also require all other feature flags introduced in or before 3.13.0
and remove their compatibility code for 4.0:
* restart_streams
* stream_sac_coordinator_unblock_group
* stream_filtering
* stream_update_config_command
Fixes#11171
An MQTT user encountered TLS handshake timeouts with their IoT device,
and the actual error from `ssl:handshake` / `ranch:handshake` was not
caught and logged.
At this time, `ranch` uses `exit(normal)` in the case of timeouts, but
that should change in the future
(https://github.com/ninenines/ranch/issues/336)
This commit will print
```
[debug] <0.725.0> Transitioned from tcp_connected to peer_properties_exchanged
[debug] <0.725.0> Transitioned from peer_properties_exchanged to authenticating
[debug] <0.725.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal
[debug] <0.725.0> Transitioned from authenticating to authenticated
[debug] <0.725.0> Tuning response 1048576 0
[debug] <0.725.0> Open frame received for fakevhost
[warning] <0.725.0> Opening connection failed: access to vhost 'fakevhost' refused for user 'guest'
[debug] <0.725.0> Transitioned from tuning to failure
[warning] <0.725.0> Closing socket #Port<0.48>. Invalid transition from tuning to failure.
[debug] <0.725.0> rabbit_stream_reader terminating in state 'tuning' with reason 'normal'
```
when the user doesn't have access to the vhost.