As mentioned in discussion #14426, the way that `cacerts` is handled by
cuttlefish schemas simply will not work if set.
If `cacerts` were set to a string value containing one X509 certificate,
it would eventually result in a crash because the `cacerts` ssl option
must be of [this type](https://www.erlang.org/doc/apps/ssl/ssl.html#t:client_option_cert/0):
```
{cacerts, CACerts :: [public_key:der_encoded()] | [public_key:combined_cert()]}
```
Neither of those are strings, of course.
This PR removes all use of `cacerts` in cuttlefish schemas. In addition,
it filters out `cacerts` and `certs_keys` from being JSON-encoded by an
HTTP API call to `/api/overview`. It _is_ technically possible to set
`cacerts` via `advanced.config`, so, if set, it would crash this API
call, as would `certs_keys`.
This avoids using Mix while compiling which simplifies
a number of things and let us do further build improvements
later on.
Elixir is only enabled from within rabbitmq_cli currently.
Eunit is disabled since there are only Elixir tests.
Dialyzer will force-enable Elixir in order to process
Elixir-compiled beam files.
This commit also includes a few changes that are
related:
* The Erlang distribution will now be started for parallel-ct
* Many unnecessary PROJECT_MOD lines have been removed
* `eunit_formatters` has been removed, it provides little value
* The new `maybe_flock` Erlang.mk function is used where possible
* Build test deps when testing rabbitmq_cli (Mix won't do it anymore)
* rabbitmq_ct_helpers now use the early plugins to have Dialyzer
properly set up
... are being used at the same time.
[Why]
Depending on which node clusters with which, a node running an older
version of the Khepri Ra machine may not be able to apply Ra commands
and could be stuck.
There is no real solution and this clearly an unsupported scenario. An
old node won't always be able to join a newer cluster.
[How]
In the testsuites, we skip clustering tests if we detect that multiple
Khepri Ra machine versions are being used.
[Why]
When running mixed-version tests, nodes 1/3/5/... are using the primary
umbrella, so usually the newest version. Nodes 2/4/6/... are using the
secondary umbrella, thus the old version.
When clustering, we used to use node 1 (running a new version) as the
seed node, meaning other nodes would join it.
This complicates things with feature flags because we have to make sure
that we start node 1 with new stable feature flags disabled to allow old
nodes to join.
This is also a problem with Khepri machine versions because the cluster
would start with the latest version, which old nodes might not have.
[How]
This patch changes the logic to use a node running the secondary
umbrella as the seed node instead. If there is no node running it, we
pick the first node as before.
V2: Revert part of "rabbitmq_ct_helpers: Fix how we set
`$RABBITMQ_FEATURE_FLAGS` in tests" (commit
57ed962ef6). These changes are no
longer needed with the new logic.
V3: The check that verifies that the correct metadata store is used has
a special case for nodes that use the secondary umbrella: if Khepri
is supposed to be used but it's not, the feature flag is enabled.
The reason is that the `v4.0.x` branch doesn't know about the `rel`
configuration of `forced_feature_flags_on_init`. The nodes will
have ignored thies parameter and booted with the stable feature
flags only.
Many testsuites are adapted to the new clustering order. If they
manage which node joins which node, either the order is changed in
the testcases, or nodes are started with only required feature
flags. For testsuites that rely on peer discovery where the order is
unknown, nodes are started with only required feature flags.
[Why]
The test configuration was querying a network interface IP address based
on its name. However, the name, "eth0", is very specific to Linux. This
broke the test on other systems.
[How]
We still have to set an explicit `bind_addr` because Consul refuses to
start if the host has multiple private IPv4 addresses, as it is the case
in CI.
Therefore, we hard-code 127.0.0.1 as the IPv4 address to use because it has a
great chance to exist about anywhere.
[How]
We must check the return value of `rabbit_ct_broker_helpers:run_steps/2`
because it could ask that the testsuite/testgroup/testcase should be
skipped.
We don't need to duplicate so many patterns in so many
files since we have a monorepo (and want to keep it).
If I managed to miss something or remove something that
should stay, please put it back. Note that monorepo-wide
patterns should go in the top-level .gitignore file.
Other .gitignore files are for application or folder-
specific patterns.
[Why]
The default node selection of the peer discovery subsystem doesn't work
well with Consul. The reason is that that selection is based on the
nodes' uptime. However, the node with the highest uptime may not be the
first to register in Consul.
When this happens, the node that registered first will only discover
itself and boot as a standalone node. Then, the node with the highest
uptime will discover both of them, but will select itself as the node to
join because of its uptime. In the end, we end up with two clusters
instead of one.
[How]
We use the `CreateIndex` property in the Consul response to sort
services. We then derive the name of the node to join after the service
that has the lower `CreateIndex`, meaning it was the first to register.
[Why]
The new implementation of `rabbit_peer_discovery` acquires the lock only
when a node needs to join another one. This is meant to disappear in the
medium/long term anyway.
Here, we need to lock the query to Consul to make sure that queries
happen sequentially, not concurrently. This is a work in progress and we
may not keep it either.
[Why]
Add a `system_SUITE` testsuite, copied from
rabbitmq_peer_discovery_etcd, that attempts to start a RabbitMQ cluster
where nodes use a Consul server to discover themselves.
[How]
The new testcases try to create a cluster using the local Consul node
started by the testsuite. The first one starts one RabbitMQ node at a
time. the second one starts all of them concurrently.
While here, use the Consul source code added as a Git submodule in a
previous commit to compile Consul locally just for the testsuite.
[Why]
This allows other nodes to discover the actual node names, instead of
deriving one from the Consul agent node name and their own node name.
This permits to register several RabbitMQ nodes in the same Consul
agent. This is at least handy for testsuites.
[How]
The Erlang node name is added to the `Meta` properties list as long as
the RabbitMQ cluster name.
Note that this also fixes when the cluster name is added to `Meta`:
before this commit, a non-default cluster name was not added if the
user-configured properties list was empty at the beginning.
[Why]
The `consul_svc` parameter is used as the service name and to construct
the service ID. The problem with the way the service ID is constructed
is that it doesn't allow to register several distinct RabbitMQ nodes in
the same Consul agent.
This is a problem for testsuites where we want to run several RabbitMQ
nodes on the same host with a single local Consul agent.
[How]
The service ID has now its own parameters, `consul_svc_id`. If this one
is unset, it falls back to the previous construction from the service
name. This allows to remain 100% compatible with previous versions.
[Why]
A lock is acquired to protect against concurrent cluster joins.
Some backends used to use the entire list of discovered nodes and used
`global` as the lock implementation. This was a problem because a side
effect was that all discovered Erlang nodes were connected to each
other. This led to conflicts in the global process name registry and
thus processes were killed randomly.
This was the case with the feature flags controller for instance. Nodes
are running some feature flags operation early in boot before they are
ready to cluster or run the peer discovery code. But if another node was
executing peer discovery, it could make all nodes connected. Feature
flags controller unrelated instances were thus killed because of another
node running peer discovery.
[How]
Acquiring a lock on the joining and the joined nodes only is enough to
achieve the goal of protecting against concurrent joins. This is
possible because of the new core logic which ensures the same node is
used as the "seed node". I.e. all nodes will join the same node.
Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed
to take a list of nodes (the two nodes mentionned above) instead of one
node (which was the current node, so not that helpful in the first
place).
These backends also used to check if the current node was part of the
discovered nodes. But that's already handled in the generic peer
discovery code already.
CAUTION: This brings a breaking change in the peer discovery backend
API. The `Backend:lock/1` callback now takes a list of node names
instead of a single node name. This list will contain the current node
name.
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).
In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
- Use the same base .plt everywhere, so there is no need to list
standard apps everywhere
- Fix typespecs: some typos and the use of not-exported types
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps
Also:
- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
- Fetching seems blocked in CI, and this should reduce hex api usage in
all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
the defaults