This configuration is not guaranteed to be safe to change after a stream has bee n
declared and thus we'll remove the ability to change it after the initial
declaration. Users should favour the x- queue arg for this config but it will still
be possible to configure it as a policy but it will only be evaluated at
declara tion time.
This means that if a policy is set for a stream that re-configures the
`stream-m ax-segment-size-bytes` key it will show in the UI as updated but
the pre-existing stream will not use the updated configuration.
The key has been removed from the UI but for backwards compatibility it is still
settable.
NB: this PR adds a new command `update_config` to the stream coordinator state
machine. Strictly speaking this should require a new machine version but we're by
passing that by relying on the feature flag instead which avoids this command
being committed before all nodes have the new code version. A new machine version
can lower the availability properties during a rolling cluster upgrade so in
this case it is preferable to avoid that given the simplicity of the change.
Per discussion in #10415, this introduces a new module,
rabbit_mgmt_nodes, which provides a couple of helpers
that can be used to implement Cowboy REST's
resource_exists/2 in the modules that return
information about cluster members.
This configuration is not guaranteed to be safe to change after a stream has bee n
declared and thus we'll remove the ability to change it after the initial
declaration. Users should favour the x- queue arg for this config but it will still
be possible to configure it as a policy but it will only be evaluated at
declara tion time.
This means that if a policy is set for a stream that re-configures the
`stream-m ax-segment-size-bytes` key it will show in the UI as updated but
the pre-existing stream will not use the updated configuration.
The key has been removed from the UI but for backwards compatibility it is still
settable.
NB: this PR adds a new command `update_config` to the stream coordinator state
machine. Strictly speaking this should require a new machine version but we're by
passing that by relying on the feature flag instead which avoids this command
being committed before all nodes have the new code version. A new machine version
can lower the availability properties during a rolling cluster upgrade so in
this case it is preferable to avoid that given the simplicity of the change.
Per discussion in #10415, this introduces a new module,
rabbit_mgmt_nodes, which provides a couple of helpers
that can be used to implement Cowboy REST's
resource_exists/2 in the modules that return
information about cluster members.
Listing queues with the HTTP API when there are many (1000s) of
quorum queues could be excessively slow compared to the same scenario
with classic queues.
This optimises various aspects of HTTP API queue listings.
For QQs it removes the expensive cluster wide rpcs used to get the
"online" status of each quorum queue. This was previously done _before_
paging and thus would perform a cluster-wide query for _each_ quorum queue in
the vhost/system. This accounted for most of the slowness compared to
classic queues.
Secondly the query to separate the running from the down queues
consisted of two separate queries that later were combined when a single
query would have sufficed.
This commit also includes a variety of other improvements and minor
fixes discovered during testing and optimisation.
MINOR BREAKING CHANGE: quorum queues would previously only display one
of two states: running or down. Now there is a new state called minority
which is emitted when the queue has at least one member running but
cannot commit entries due to lack of quorum.
Also the quorum queue may transiently enter the down state when a node
goes down and before its elected a new leader.
Introduce GET /api/queues/detailed endpoint
Just removed garbage_collection, idle_since and any 'null' value
/api/queues with 10k classic queues returns 7.4MB of data
/api/queues/detailed with 10k classic queues returns 11MB of data
This sits behind a new feature flag, required to collect data from
all nodes: detailed_queues_endpoint
The default is 20 MiB, which is enough to upload
a definition file with 200K queues, a few virtual host
and a few users. In other words, it should accomodate
a lot of environments.
A previous PR removed backing_queue_status as it is mostly unused,
but classic queue version is still useful. This PR returns version
as a top-level key in queue objects.
[Why]
Mnesia is a very powerful and convenient tool for Erlang applications:
it is a persistent disc-based database, it handles replication accross
multiple Erlang nodes and it is available out-of-the-box from the
Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
metadata:
* virtual hosts' properties
* intenal users
* queue, exchange and binding declarations (not queues data)
* runtime parameters and policies
* ...
Unfortunately Mnesia makes it difficult to handle network partition and,
as a consequence, the merge conflicts between Erlang nodes once the
network partition is resolved. RabbitMQ provides several partition
handling strategies but they are not bullet-proof. Users still hit
situations where it is a pain to repair a cluster following a network
partition.
[How]
@kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
already uses successfully to implement quorum queues and streams for
instance. Those queues do not suffer from network partitions.
We created Khepri [2], a new persistent and replicated database engine
based on Ra and we want to use it in place of Mnesia in RabbitMQ to
solve the problems with network partitions.
This patch integrates Khepri as an experimental feature. When enabled,
RabbitMQ will store all its metadata in Khepri instead of Mnesia.
This change comes with behavior changes. While Khepri remains disabled,
you should see no changes to the behavior of RabbitMQ. If there are
changes, it is a bug. After Khepri is enabled, there are significant
changes of behavior that you should be aware of.
Because it is based on the Raft consensus algorithm, when there is a
network partition, only the cluster members that are in the partition
with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
can "make progress". In other words, only those nodes may write to the
Khepri database and read from the database and expect a consistent
result.
For instance in a cluster of 5 RabbitMQ nodes:
* If there are two partitions, one with 3 nodes, one with 2 nodes, only
the group of 3 nodes will be able to write to the database.
* If there are three partitions, two with 2 nodes, one with 1 node, none
of the group can write to the database.
Because the Khepri database will be used for all kind of metadata, it
means that RabbitMQ nodes that can't write to the database will be
unable to perform some operations. A list of operations and what to
expect is documented in the associated pull request and the RabbitMQ
website.
This requirement from Raft also affects the startup of RabbitMQ nodes in
a cluster. Indeed, at least a quorum number of nodes must be started at
once to allow nodes to become ready.
To enable Khepri, you need to enable the `khepri_db` feature flag:
rabbitmqctl enable_feature_flag khepri_db
When the `khepri_db` feature flag is enabled, the migration code
performs the following two tasks:
1. It synchronizes the Khepri cluster membership from the Mnesia
cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
the `khepri_mnesia_migration` application [3].
2. It copies data from relevant Mnesia tables to Khepri, doing some
conversion if necessary on the way. Again, it uses
`mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
it.
This can be performed on a running standalone RabbitMQ node or cluster.
Data will be migrated from Mnesia to Khepri without any service
interruption. Note that during the migration, the performance may
decrease and the memory footprint may go up.
Because this feature flag is considered experimental, it is not enabled
by default even on a brand new RabbitMQ deployment.
More about the implementation details below:
In the past months, all accesses to Mnesia were isolated in a collection
of `rabbit_db*` modules. This is where the integration of Khepri mostly
takes place: we use a function called `rabbit_khepri:handle_fallback/1`
which selects the database and perform the query or the transaction.
Here is an example from `rabbit_db_vhost`:
* Up until RabbitMQ 3.12.x:
get(VHostName) when is_binary(VHostName) ->
get_in_mnesia(VHostName).
* Starting with RabbitMQ 3.13.0:
get(VHostName) when is_binary(VHostName) ->
rabbit_khepri:handle_fallback(
#{mnesia => fun() -> get_in_mnesia(VHostName) end,
khepri => fun() -> get_in_khepri(VHostName) end}).
This `rabbit_khepri:handle_fallback/1` function relies on two things:
1. the fact that the `khepri_db` feature flag is enabled, in which case
it always executes the Khepri-based variant.
4. the ability or not to read and write to Mnesia tables otherwise.
Before the feature flag is enabled, or during the migration, the
function will try to execute the Mnesia-based variant. If it succeeds,
then it returns the result. If it fails because one or more Mnesia
tables can't be used, it restarts from scratch: it means the feature
flag is being enabled and depending on the outcome, either the
Mnesia-based variant will succeed (the feature flag couldn't be enabled)
or the feature flag will be marked as enabled and it will call the
Khepri-based variant. The meat of this function really lives in the
`khepri_mnesia_migration` application [3] and
`rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
about the feature flag.
However, some calls to the database do not depend on the existence of
Mnesia tables, such as functions where we need to learn about the
members of a cluster. For those, we can't rely on exceptions from
Mnesia. Therefore, we just look at the state of the feature flag to
determine which database to use. There are two situations though:
* Sometimes, we need the feature flag state query to block because the
function interested in it can't return a valid answer during the
migration. Here is an example:
case rabbit_khepri:is_enabled(RemoteNode) of
true -> can_join_using_khepri(RemoteNode);
false -> can_join_using_mnesia(RemoteNode)
end
* Sometimes, we need the feature flag state query to NOT block (for
instance because it would cause a deadlock). Here is an example:
case rabbit_khepri:get_feature_state() of
enabled -> members_using_khepri();
_ -> members_using_mnesia()
end
Direct accesses to Mnesia still exists. They are limited to code that is
specific to Mnesia such as classic queue mirroring or network partitions
handling strategies.
Now, to discover the Mnesia tables to migrate and how to migrate them,
we use an Erlang module attribute called
`rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
tables and an associated converter module. Here is an example in the
`rabbitmq_recent_history_exchange` plugin:
-rabbit_mnesia_tables_to_khepri_db(
[{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).
The converter module — `rabbit_db_rh_exchange_m2k_converter` in this
example — is is fact a "sub" converter module called but
`rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
converter module to learn more about these modules.
[1] https://github.com/rabbitmq/ra
[2] https://github.com/rabbitmq/khepri
[3] https://github.com/rabbitmq/khepri_mnesia_migration
See #7206.
Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
Using GET /api/queues?disable_stats=true&enable_queue_totals=true is far more efficient than the standard GET /api/queues and in many cases will suffice for monitoring and operating purposes.
Rather than relying on queue name conventions, allow applying policies
based on the queue type. For example, this allows multiple policies that
apply to all queue names (".*") that specify different parameters for
different queue types.
The issue is that users retrieved with
the intention to list in the limits view
are not paged hence they are not wrapped
around a paging struct where users would be
under items attribute.
Pending selenium tests
1. Allow to inspect an (web) MQTT connection.
2. Show MQTT client ID on connection page as part of client_properties.
3. Handle force_event_refresh (when management_plugin gets enabled
after (web) MQTT connections got created).
4. Reduce code duplication between protocol readers.
5. Display '?' instead of 'NaN' in UI for absent queue metrics.
6. Allow an (web) MQTT connection to be closed via management_plugin.
For 6. this commit takes the same approach as already done for the stream
plugin:
The stream plugin registers neither with {type, network} nor {type,
direct}.
We cannot use gen_server:call/3 anymore to close the connection
because the web MQTT connection cannot handle gen_server calls (only
casts).
Strictly speaking, this commit requires a feature flag to allow to force
closing stream connections from the management plugin during a rolling
update. However, given that this is rather an edge case, and there is a
workaround (connect to the node directly hosting the stream connection),
this commit will not introduce a new feature flag.
1. Allow to create queues without `x-queue-type` argument, which give
default queue type logic a chance to run. What's more, those queues
definitions will be exported without `x-queue-type`, so they can be
loaded into another vhost and default queue logic will be applied
again.
2. Show default queue type on the vhost page and the vhosts list pages
The issue was primarily that UAA was
not properly configured. We had to whitelist
the uri used for logout otherwise UAA redirects
to its login page
WIP verify that logout.js works when running in
headless mode. For that we need a docker image
and at the moment, make docker-image is not
working because it is still using old otp 24.0.2
When the oauth user has a token without enough
credentials to access the management ui, the
rest request `/api/whoami` returns a 401 with
www-authentication response header which instructs
the browser to show a popup dialog box for basic
auth. With this change, we had to remove the response
header so that we could use the same mechanism we
use to show other oauth errors, i.e. use the login-status
panel instead.
It is important that RabbitMQ specifies which
scopes it has to request. We control that via the
management.oauth_scopes field. If we have enable_uaa = true,
the scopes are automatically configured for us as follows:
"openid profile " + authSettings.oauth_resource_id + ".*"
Else we have to configure oauth_scopes field.
Format code
Fix whitespace, fix warning
Update API docs
Remove blank lines
Add get all connections by username
Fix method name issue
Enable GET method to get connections by username
Update API documentation
Modify list all connections of username method
Remove list_by_username method and modify get all connections of user API
Code formatting, break up lines for readability
Refactor code to use pattern matching more effectively
Typo
The UI handles the case where the 2 fields are not present.
This can happen in a mixed-version cluster, where a node
of a previous version returns records without the fields.
The UI uses default values (active = true, activity status = up),
which is valid as the consumers of the node are "standalone"
consumers (not part of a group).
References #3753