rabbitmq-server/deps/rabbitmq_peer_discovery_aws/test/unit_SUITE.erl

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

292 lines
12 KiB
Erlang
Raw Permalink Normal View History

2020-07-14 20:32:09 +08:00
%% This Source Code Form is subject to the terms of the Mozilla Public
%% License, v. 2.0. If a copy of the MPL was not distributed with this
%% file, You can obtain one at https://mozilla.org/MPL/2.0/.
2017-05-22 16:55:17 +08:00
%%
2024-01-02 11:02:20 +08:00
%% Copyright (c) 2007-2025 Broadcom. All Rights Reserved. The term “Broadcom” refers to Broadcom Inc. and/or its subsidiaries. All rights reserved.
2017-05-22 16:55:17 +08:00
%%
-module(unit_SUITE).
2017-05-22 16:55:17 +08:00
-compile(export_all).
-include_lib("common_test/include/ct.hrl").
-include_lib("eunit/include/eunit.hrl").
all() ->
[
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
{group, unit},
{group, lock}
2017-05-22 16:55:17 +08:00
].
groups() ->
[
{unit, [], [
maybe_add_tag_filters,
get_hostname_name_from_reservation_set,
registration_support,
network_interface_sorting,
private_ip_address_sorting
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
]},
{lock, [], [
lock_single_node,
lock_multiple_nodes,
rabbit_peer_discovery: Acquire a lock for the joining and the joined nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.
2023-11-27 19:18:47 +08:00
lock_local_node_not_discovered
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
]}
].
2017-05-22 16:55:17 +08:00
%%%
%%% Testcases
%%%
maybe_add_tag_filters(_Config) ->
2017-06-06 02:48:44 +08:00
Tags = maps:from_list([{"region", "us-west-2"}, {"service", "rabbitmq"}]),
Expectation = lists:sort(
[{"Filter.2.Name", "tag:service"},
{"Filter.2.Value.1", "rabbitmq"},
{"Filter.1.Name", "tag:region"},
{"Filter.1.Value.1", "us-west-2"}]),
Result = lists:sort(rabbit_peer_discovery_aws:maybe_add_tag_filters(Tags, [], 1)),
2017-05-22 16:55:17 +08:00
?assertEqual(Expectation, Result).
get_hostname_name_from_reservation_set(_Config) ->
2023-12-11 03:57:38 +08:00
ok = eunit:test({
2017-05-22 16:55:17 +08:00
foreach,
fun on_start/0,
fun on_finish/1,
[{"from private DNS",
fun() ->
2023-12-11 03:57:38 +08:00
Expectation = ["ip-10-0-16-29.eu-west-1.compute.internal",
"ip-10-0-16-31.eu-west-1.compute.internal"],
?assertEqual(Expectation,
rabbit_peer_discovery_aws:get_hostname_name_from_reservation_set(
reservation_set(), []))
end},
{"from arbitrary path",
fun() ->
os:putenv("AWS_HOSTNAME_PATH", "networkInterfaceSet,1,association,publicDnsName"),
Expectation = ["ec2-203-0-113-11.eu-west-1.compute.amazonaws.com",
"ec2-203-0-113-21.eu-west-1.compute.amazonaws.com"],
2017-05-22 16:55:17 +08:00
?assertEqual(Expectation,
rabbit_peer_discovery_aws:get_hostname_name_from_reservation_set(
reservation_set(), []))
end},
{"from private IP",
fun() ->
os:putenv("AWS_USE_PRIVATE_IP", "true"),
2023-12-11 03:57:38 +08:00
Expectation = ["10.0.16.29", "10.0.16.31"],
2017-05-22 16:55:17 +08:00
?assertEqual(Expectation,
rabbit_peer_discovery_aws:get_hostname_name_from_reservation_set(
reservation_set(), []))
end},
{"from private IP DNS in network interface",
fun() ->
os:putenv("AWS_HOSTNAME_PATH", "networkInterfaceSet,2,privateIpAddressesSet,1,privateDnsName"),
Expectation = ["ip-10-0-15-100.eu-west-1.compute.internal",
"ip-10-0-16-31.eu-west-1.compute.internal"],
?assertEqual(Expectation,
rabbit_peer_discovery_aws:get_hostname_name_from_reservation_set(
reservation_set(), []))
2017-05-22 16:55:17 +08:00
end}]
2023-12-11 03:57:38 +08:00
}).
2017-05-22 16:55:17 +08:00
registration_support(_Config) ->
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
?assertEqual(false, rabbit_peer_discovery_aws:supports_registration()).
network_interface_sorting(_Config) ->
%% Test ENI sorting by deviceIndex (DescribeInstances only returns attached ENIs)
NetworkInterfaces = [
{"item", [
{"networkInterfaceId", "eni-secondary"},
{"attachment", [{"deviceIndex", "1"}]}
]},
{"item", [
{"networkInterfaceId", "eni-primary"},
{"attachment", [{"deviceIndex", "0"}]}
]},
{"item", [
{"networkInterfaceId", "eni-tertiary"},
{"attachment", [{"deviceIndex", "2"}]}
]}
],
%% Should sort ENIs by deviceIndex
Sorted = rabbit_peer_discovery_aws:sort_ec2_hostname_path_set_members("networkInterfaceSet", NetworkInterfaces),
%% Should have all 3 ENIs
?assertEqual(3, length(Sorted)),
%% Primary ENI (deviceIndex=0) should be first
{"item", FirstENI} = lists:nth(1, Sorted),
?assertEqual("eni-primary", proplists:get_value("networkInterfaceId", FirstENI)),
%% Secondary ENI (deviceIndex=1) should be second
{"item", SecondENI} = lists:nth(2, Sorted),
?assertEqual("eni-secondary", proplists:get_value("networkInterfaceId", SecondENI)),
%% Tertiary ENI (deviceIndex=2) should be third
{"item", ThirdENI} = lists:nth(3, Sorted),
?assertEqual("eni-tertiary", proplists:get_value("networkInterfaceId", ThirdENI)).
private_ip_address_sorting(_Config) ->
%% Test private IP address sorting by primary flag
PrivateIpAddresses = [
{"item", [
{"privateIpAddress", "10.0.14.176"},
{"privateDnsName", "ip-10-0-14-176.us-west-2.compute.internal"},
{"primary", "false"}
]},
{"item", [
{"privateIpAddress", "10.0.12.112"},
{"privateDnsName", "ip-10-0-12-112.us-west-2.compute.internal"},
{"primary", "true"}
]},
{"item", [
{"privateIpAddress", "10.0.15.200"},
{"privateDnsName", "ip-10-0-15-200.us-west-2.compute.internal"},
{"primary", "false"}
]}
],
Sorted = rabbit_peer_discovery_aws:sort_ec2_hostname_path_set_members("privateIpAddressesSet", PrivateIpAddresses),
?assertEqual(3, length(Sorted)),
%% Primary IP (primary=true) should be first
{"item", FirstIP} = lists:nth(1, Sorted),
?assertEqual("10.0.12.112", proplists:get_value("privateIpAddress", FirstIP)),
?assertEqual("true", proplists:get_value("primary", FirstIP)),
%% Non-primary IPs should maintain relative order
{"item", SecondIP} = lists:nth(2, Sorted),
?assertEqual("10.0.14.176", proplists:get_value("privateIpAddress", SecondIP)),
?assertEqual("false", proplists:get_value("primary", SecondIP)),
{"item", ThirdIP} = lists:nth(3, Sorted),
?assertEqual("10.0.15.200", proplists:get_value("privateIpAddress", ThirdIP)),
?assertEqual("false", proplists:get_value("primary", ThirdIP)).
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
lock_single_node(_Config) ->
LocalNode = node(),
Nodes = [LocalNode],
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
rabbit_peer_discovery: Acquire a lock for the joining and the joined nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.
2023-11-27 19:18:47 +08:00
{ok, {LockId, Nodes}} = rabbit_peer_discovery_aws:lock([LocalNode]),
?assertEqual(ok, rabbit_peer_discovery_aws:unlock({LockId, Nodes})).
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
lock_multiple_nodes(_Config) ->
application:set_env(rabbit, cluster_formation, [{internal_lock_retries, 2}]),
LocalNode = node(),
rabbit_peer_discovery: Acquire a lock for the joining and the joined nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.
2023-11-27 19:18:47 +08:00
OtherNodeA = a@host,
OtherNodeB = b@host,
meck:expect(rabbit_nodes, lock_id, 1, {rabbit_nodes:cookie_hash(), OtherNodeA}),
{ok, {{LockResourceId, OtherNodeA}, [LocalNode, OtherNodeA]}} = rabbit_peer_discovery_aws:lock([LocalNode, OtherNodeA]),
meck:expect(rabbit_nodes, lock_id, 1, {rabbit_nodes:cookie_hash(), OtherNodeB}),
?assertEqual({error, "Acquiring lock taking too long, bailing out after 2 retries"}, rabbit_peer_discovery_aws:lock([LocalNode, OtherNodeB])),
?assertEqual(ok, rabbit_peer_discovery_aws:unlock({{LockResourceId, OtherNodeA}, [LocalNode, OtherNodeA]})),
?assertEqual({ok, {{LockResourceId, OtherNodeB}, [LocalNode, OtherNodeB]}}, rabbit_peer_discovery_aws:lock([LocalNode, OtherNodeB])),
?assertEqual(ok, rabbit_peer_discovery_aws:unlock({{LockResourceId, OtherNodeB}, [LocalNode, OtherNodeB]})),
meck:unload(rabbit_nodes).
Remove randomized startup delays On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
2021-05-18 07:01:08 +08:00
lock_local_node_not_discovered(_Config) ->
rabbit_peer_discovery: Acquire a lock for the joining and the joined nodes only [Why] A lock is acquired to protect against concurrent cluster joins. Some backends used to use the entire list of discovered nodes and used `global` as the lock implementation. This was a problem because a side effect was that all discovered Erlang nodes were connected to each other. This led to conflicts in the global process name registry and thus processes were killed randomly. This was the case with the feature flags controller for instance. Nodes are running some feature flags operation early in boot before they are ready to cluster or run the peer discovery code. But if another node was executing peer discovery, it could make all nodes connected. Feature flags controller unrelated instances were thus killed because of another node running peer discovery. [How] Acquiring a lock on the joining and the joined nodes only is enough to achieve the goal of protecting against concurrent joins. This is possible because of the new core logic which ensures the same node is used as the "seed node". I.e. all nodes will join the same node. Therefore the API of `rabbit_peer_discovery_backend:lock/1` is changed to take a list of nodes (the two nodes mentionned above) instead of one node (which was the current node, so not that helpful in the first place). These backends also used to check if the current node was part of the discovered nodes. But that's already handled in the generic peer discovery code already. CAUTION: This brings a breaking change in the peer discovery backend API. The `Backend:lock/1` callback now takes a list of node names instead of a single node name. This list will contain the current node name.
2023-11-27 19:18:47 +08:00
Expectation = {error, "Local node " ++ atom_to_list(node()) ++ " is not part of discovered nodes [me@host]"},
?assertEqual(Expectation, rabbit_peer_discovery_aws:lock([me@host])).
2017-05-22 16:55:17 +08:00
%%%
%%% Implementation
%%%
on_start() ->
reset().
on_finish(_Config) ->
reset().
reset() ->
application:unset_env(rabbit, cluster_formation),
2023-12-11 03:57:38 +08:00
os:unsetenv("AWS_HOSTNAME_PATH"),
2017-05-22 16:55:17 +08:00
os:unsetenv("AWS_USE_PRIVATE_IP").
reservation_set() ->
[{"item", [{"reservationId","r-006cfdbf8d04c5f01"},
{"ownerId","248536293561"},
{"groupSet",[]},
{"instancesSet",
[{"item",
[{"instanceId","i-0c6d048641f09cad2"},
{"imageId","ami-ef4c7989"},
{"instanceState",
[{"code","16"},{"name","running"}]},
{"privateDnsName",
"ip-10-0-16-29.eu-west-1.compute.internal"},
{"dnsName",[]},
{"instanceType","c4.large"},
{"launchTime","2017-04-07T12:05:10"},
{"subnetId","subnet-61ff660"},
{"vpcId","vpc-4fe1562b"},
2023-12-11 03:57:38 +08:00
{"networkInterfaceSet", [
{"item",
[{"attachment", [{"deviceIndex", "1"}]},
{"association",
2023-12-11 03:57:38 +08:00
[{"publicIp","203.0.113.12"},
{"publicDnsName",
"ec2-203-0-113-12.eu-west-1.compute.amazonaws.com"},
{"ipOwnerId","amazon"}]},
{"privateIpAddressesSet", [
{"item", [
{"privateIpAddress", "10.0.15.101"},
{"privateDnsName", "ip-10-0-15-101.eu-west-1.compute.internal"},
{"primary", "false"}
]},
{"item", [
{"privateIpAddress", "10.0.15.100"},
{"privateDnsName", "ip-10-0-15-100.eu-west-1.compute.internal"},
{"primary", "true"}
]}
]}]},
{"item",
[{"attachment", [{"deviceIndex", "0"}]},
{"association",
[{"publicIp","203.0.113.11"},
{"publicDnsName",
"ec2-203-0-113-11.eu-west-1.compute.amazonaws.com"},
2023-12-11 03:57:38 +08:00
{"ipOwnerId","amazon"}]}]}]},
2017-05-22 16:55:17 +08:00
{"privateIpAddress","10.0.16.29"}]}]}]},
{"item", [{"reservationId","r-006cfdbf8d04c5f01"},
{"ownerId","248536293561"},
{"groupSet",[]},
{"instancesSet",
[{"item",
[{"instanceId","i-1c6d048641f09cad2"},
{"imageId","ami-af4c7989"},
{"instanceState",
[{"code","16"},{"name","running"}]},
{"privateDnsName",
"ip-10-0-16-31.eu-west-1.compute.internal"},
{"dnsName",[]},
{"instanceType","c4.large"},
{"launchTime","2017-04-07T12:05:10"},
{"subnetId","subnet-61ff660"},
{"vpcId","vpc-4fe1562b"},
2023-12-11 03:57:38 +08:00
{"networkInterfaceSet", [
{"item",
[{"attachment", [{"deviceIndex", "0"}]},
{"association",
2023-12-11 03:57:38 +08:00
[{"publicIp","203.0.113.21"},
{"publicDnsName",
"ec2-203-0-113-21.eu-west-1.compute.amazonaws.com"},
{"ipOwnerId","amazon"}]}]},
{"item",
[{"attachment", [{"deviceIndex", "1"}]},
{"association",
2023-12-11 03:57:38 +08:00
[{"publicIp","203.0.113.22"},
{"publicDnsName",
"ec2-203-0-113-22.eu-west-1.compute.amazonaws.com"},
{"ipOwnerId","amazon"}]},
{"privateIpAddressesSet", [
{"item", [
{"privateIpAddress", "10.0.16.31"},
{"privateDnsName", "ip-10-0-16-31.eu-west-1.compute.internal"},
{"primary", "true"}
]}
]}]}]},
2017-05-22 16:55:17 +08:00
{"privateIpAddress","10.0.16.31"}]}]}]}].