Quorum queues were introduced in RabbitMQ 3.8.0. This was first time we
added a breaking change protected behind a feature flag. This allowed a
RabbitMQ cluster to be upgraded one node at a time, without having to
stop the entire cluster.
The breaking change was a new field in the `#amqqueue{}` record. This
broke the API and the ABI because records are a compile-time thing in
Erlang.
The compatibility code is in the wild for long enough that we want to
support the new `#amqqueue{}` record only from now on. The
`quorum_queue` feature flag was marked as required in a previous commit
(see #5202). This allows us to remove code in this patch.
References #5215.
Thoas is more efficient both in terms of encoding
time and peak memory footprint.
In the process we have discovered an issue:
https://github.com/lpil/thoas/issues/15
Pair: @pjk25
In the initial Feature flags subsystem implementation, we used a
migration function taking three arguments:
* the name of the feature flag
* its properties
* a command (`enable` or `is_enabled`)
However we wanted to implement a command (`post_enable`) and pass more
variables to the migration function. With the rework in #3940, the
migration function was therefore changed to take a single argument. That
argument was a record containing the command and much more information.
The content of the record could be different, depending on the command.
This solved the extensibility and the flexilibity of how we could call
the migration function. Unfortunately, this didn't improve its return
value as we wanted to return different things depending on the command.
In this patch, we change this completely by using a map of callbacks,
one per command, instead of that single migration function.
So before, where we had:
#{migration_fun => {Module, Function}}
The new property is now:
#{callbacks => #{enable => {Module, Function},
post_enable => {Module, Function}}}
All callbacks are still optional. You don't have to implement a fallback
empty function clause to skip commands you don't want to use oryou don't
support, as you would have to with the migration function.
Callbacks may be called with a callback-specific map of argument and
they should return the expected callback-specific return values.
Everything is defined with specs.
If a feature flag uses this new callbacks map, it will automatically
depend on `feature_flags_v2`, like the previous arity-1 migration
function. A feature flag can't define both the old migration function
and the new callbacks map.
Note that this arity-1 migration function never made it to a release of
RabbitMQ, so its support is entirely dropped with no backward
compatibility support. Likewise for the `#ffcommand{}` record.
The `feature_flags_v2` test group is only relevant if all nodes support
the `feature_flags_v2` feature flags. When doing mixed-version testing,
we must ensure that both umbrellas support that feature flags. If they
are not, we can skip the entire test group.
While here, add a dot at the end of a comment title.
We had this setup step to work around the by-design circular dependency
in the CLI (the broker depends on the CLI to start and the CLI depends
on the broker to work).
Unfortunately, it pollutes the code path and breaks the testsuite when
doing mixed-version testing: the copied `rabbit` taken from the main
umbrella ends up in the code path of the secondary umbrella, overriding
the `rabbit` there.
This patch removes this workaround to see what breaks, but it seems to
work fine so far. Let's see how it goes!
Up to RabbitMQ 3.10.x, this function took an application attribute. In
`master`, it takes the feature flags map only.
The testsuite now verifies if the remote node supports
`feature_flags_v2` to determine if it should call the function with the
feature flags map directly or convert it to an application attribute
first.
This fixes some failure of the testsuite in mixed-version cluster
testing.
Set the policy _before_ creating the stream as there is a current
limitation which means streams won't be updated immediately when
only changing the segment size.
Prior to this commit, when bindings were deleted while enabling feature
flag direct_exchange_routing_v2 - specifically AFTER
rabbit_binding:populate_index_route_table/0 ran and BEFORE the feature
flags controller changed the feature state from 'state_changing' to 'enabled' -
some bindings were incorrectly present in table rabbit_index_route.
This commit fixes this issue.
If the state is 'state_changing', bindings must be deleted when the
table already got created.
(Somewhat unexpectedly) checking for table existence within a Mnesia
transaction can return 'true' although the subsequent Mnesia table operation
will fail with {no_exists, rabbit_index_route}.
Therefore, a lock on the schema table must be set when checking for
table existence.
(Mnesia itself creates a write lock on the schema table when creating a
table, see
09c601fa21/lib/mnesia/src/mnesia_schema.erl (L1525) )
An alternative fix would have been to catch {no_exists,
rabbit_index_route} outside the Mnesia transaction, i.e. in all the callers of
the rabbit_binding:remove... functions and then retry the
rabbit_binding:remove... function.
This allows operators to override the default queue type on a per-vhost
basis by setting the default_queue_type metadata key on the vhost record.
When a queue is declared without specifiying a queue type (x-queue-type)
and there is a default set for the vhost we will augment the declare arguments
with the default. This allows future declares with and without the x-queue-type
argument to succeed.
Also only change the default _if_ the queue is durable and not
exclusive.
Changes in this commit:
1.
Use feature_flags_v2 API for feature flag direct_exchange_routing_v2.
Both feature flags feature_flags_v2 and direct_exchange_routing_v2
will be introduced in 3.11.
Therefore, direct_exchange_routing_v2 can depend on feature_flags_v2.
2.
rabbit_binding:populate_index_route_table/0 should be run only during the
feature flag migration. Thereafter, upon node start-ups, binding recovery takes
care of populating the table.
3.
Merge direct_exchange_routing_v2_post_3_11_SUITE into
direct_exchange_routing_v2_SUITE. We don't need two separate test
suites with almost identical setup and teardown.
`quorum_queue` is now required and can't be used in tests.
Also, part of the `enable_feature_flag_when_ff_file_is_unwritable`
testcase was commentted out because it relied on the `is_enabled` thing
which was dropped in `rabbit_ff_controller`. This should be introduced
at some point with a more robust design.
Our test framework can inject feature flags, but we would need a special
handling for required injected feature flags which would not test the
regular code path.
Therefore we rely on the `quorum_queue` feature flag, now required, to
test that it is correctly enabled on boot and when clustering.
This gen_statem-based process is responsible for handling concurrency
when feature flags are enabled and synchronized when a cluster is
expanded.
This clarifies and stabilizes the behavior of the feature flag subsystem
w.r.t. situations where e.g. a feature flag migration function takes
time to update data and a new node joins a cluster and synchronizes its
feature flag states with the cluster. There was a chance that the
feature flag was marked as enabled on the joining node, even though the
migration function didn't take care of that node.
With this new feature flags controller, enabling or synchronizing
feature flags blocks and delays any concurrent operations which try to
modify feature flags states too.
This change also clarifies where and when the migration function is
called: it is called at least once on each node who knows the feature
flag and when the state goes from "disabled" to "enabled" on that node.
Note that even if the feature flag is being enabled on a subset of the
nodes (because other nodes already have it enabled), it is marked as
"state_changing" everywhere during the migration. This is to prevent
that a node where it is enabled assumes it is enabled on all nodes who
know the feature flag.
There is a new feature as well: just after a feature flag is enabled,
the migration function is called a second time for any post-enable
actions. The feature flag is marked as enabled between these "enable"
and "post-enable" steps. The success or failure of this "post-enable"
run does not affect the state of the feature flag (i.e. it is ignored).
A new migration function API is introduced to allow more advanced
things. The new API is:
my_migration_function(
#ffcommand{name = ...,
props = ...,
command = enable | post_enable,
extra = #{...}})
The record is defined in `include/feature_flags.hrl`. Here is the
meaning of each field:
* `name` and `props` are the equivalent of the `FeatureName` and
`FeatureProps` arguments of the previous migration function API.
* `command` is basically the same as the previous `Arg` arguments.
* `extra` is map containing context-specific information. For instance, it
contains the list of nodes where the feature flag state changes.
This whole new behavior is behind a new feature flag called
`feature_flags_v2`. If a feature flag uses the new migration function
API, `feature_flags_v2` will be automatically enabled.
If many feature flags are enabled at once (like when a fresh RabbitMQ
node is started for the first time), `feature_flags_v2` will be enabled
first if it is in the list.
The format string started with "~s" but there was no corresponding
argument. I just removed the "~s".
While here, the log message now contains the stacktrace too.