* AWS peer discovery: ensure consistent hostname path ordering
AWS EC2 API returns networkInterfaceSet and privateIpAddressesSet in
arbitrary order, causing non-deterministic hostname resolution during
peer discovery. This leads to inconsistent cluster formation.
Changes:
- Sort network interfaces by deviceIndex (0 first for primary ENI)
- Sort private IP addresses by primary flag (primary=true first)
- Add debug logging to show hostname path selection and sorting results
- Add comprehensive unit tests for sorting behavior
The sorting ensures deviceIndex=0 and primary=true IPs are consistently
selected first, making peer discovery deterministic across deployments.
* AWS peer discovery: ensure consistent hostname path ordering (address feedback on debug logs and sorting helper functions)
We don't need to duplicate so many patterns in so many
files since we have a monorepo (and want to keep it).
If I managed to miss something or remove something that
should stay, please put it back. Note that monorepo-wide
patterns should go in the top-level .gitignore file.
Other .gitignore files are for application or folder-
specific patterns.
[Why]
A lock is acquired to protect against concurrent cluster joins.
Some backends used to use the entire list of discovered nodes and used
`global` as the lock implementation. This was a problem because a side
effect was that all discovered Erlang nodes were connected to each
other. This led to conflicts in the global process name registry and
thus processes were killed randomly.
This was the case with the feature flags controller for instance. Nodes
are running some feature flags operation early in boot before they are
ready to cluster or run the peer discovery code. But if another node was
executing peer discovery, it could make all nodes connected. Feature
flags controller unrelated instances were thus killed because of another
node running peer discovery.
[How]
Acquiring a lock on the joining and the joined nodes only is enough to
achieve the goal of protecting against concurrent joins. This is
possible because of the new core logic which ensures the same node is
used as the "seed node". I.e. all nodes will join the same node.
Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed
to take a list of nodes (the two nodes mentionned above) instead of one
node (which was the current node, so not that helpful in the first
place).
These backends also used to check if the current node was part of the
discovered nodes. But that's already handled in the generic peer
discovery code already.
CAUTION: This brings a breaking change in the peer discovery backend
API. The `Backend:lock/1` callback now takes a list of node names
instead of a single node name. This list will contain the current node
name.
Bazel build files are now maintained primarily with `bazel run
gazelle`. This will analyze and merge changes into the build files as
necessitated by certain code changes (e.g. the introduction of new
modules).
In some cases there hints to gazelle in the build files, such as `#
gazelle:erlang...` or `# keep` comments. xref checks on plugins that
depend on the cli are a good example.
Also rework elixir dependency handling, so we no longer rely on mix to
fetch the rabbitmq_cli deps
Also:
- Specify ra version with a commit rather than a branch
- Fixup compilation options for erlang 23
- Add missing ra reference in MODULE.bazel
- Add missing flag in oci.yaml
- Reduce bazel rbe jobs to try to save memory
- Use bazel built erlang for erlang git master tests
- Use the same cache for all the workflows but windows
- Avoid using `mix local.hex --force` in elixir rules
- Fetching seems blocked in CI, and this should reduce hex api usage in
all builds, which is always nice
- Remove xref and dialyze tags since rules_erlang 3 includes them in
the defaults
bazel-erlang has been renamed rules_erlang. v2 is a substantial
refactor that brings Windows support. While this alone isn't enough to
run all rabbitmq-server suites on windows, one can at least now start
the broker (bazel run broker) and run the tests that do not start a
background broker process
AWS, Kubernetes and Classic peer discovery plugins use list_nodes and
Erlang global:set_lock to create a mutex lock. To unlock, these plugins
get the latest list with list_nodes and call global:del_lock.
However, if list_nodes within unlock fails, RabbitMQ will throw an
uncaught exception and the lock will not be released until the node
holding the lock is restarted. This prevents new nodes from joining the
cluster.
This failure can be avoided by passing the list of nodes from lock to
unlock. If a node goes away (and comes back) between the lock and unlock
calls, del_lock could still successfully remove the lock. Similarly, if
a new node starts up between the lock and unlock calls, del_lock
wouldn't need to inform the new node.
Unlike with gnu make, mixed version testing with bazel uses a package-generic-unix for the secondary umbrella rather than the source. This brings the benefit of being able to mixed version test releases built with older erlang versions (even though all nodes will run under the single version given to bazel)
This introduces new test labels, adding a `-mixed` suffix for every existing test. They can be skipped if necessary with `--test_tag_filters` (see the github actions workflow for an example)
As part of the change, it is now possible to run an old release of rabbit with rabbitmq_run rule, such as:
`bazel run @rabbitmq-server-generic-unix-3.8.17//:rabbitmq-run run-broker`
Most tests that can start rabbitmq nodes have some chance of
flaking. Rather than chase individual flakes for now, this commit
changes the default (though it can still be overriden, as is the case
for config_scheme_SUITE in many places, since I have yet to see that
particular suite flake).
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.
Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.
For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in https://github.com/rabbitmq/cluster-operator/issues/662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.
To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.
Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
https://github.com/rabbitmq/cluster-operator/pull/689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.
In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.
global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.
The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).
Two special cases worth mentioning:
1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).
2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.
Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.
Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).