Commit Graph

23 Commits

Author SHA1 Message Date
Michael Klishin 968eefa1bb
Bump (c) line year
There are no functional changes to this massive diff.
2025-01-01 17:54:10 -05:00
Michael Klishin 725ddaa43d
AWS peer discovery tests: log at debug level 2024-01-19 12:33:29 -05:00
Michael Klishin 01092ff31f
(c) year bumps 2024-01-01 22:02:20 -05:00
Rin Kuryloski 51b99544c5 Capture container logs in rabbitmq_peer_discovery_aws suite
Using cloudwatch logs
2023-12-04 12:12:56 +01:00
Michael Klishin 1b642353ca
Update (c) according to [1]
1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023
2023-11-21 23:18:22 -05:00
Michael Klishin ec4f1dba7d
(c) year bump: 2022 => 2023 2023-01-01 23:17:36 -05:00
Michael Klishin 4ced055341
AWS peer discovery integration suite: squash a compiler warning 2022-04-10 09:38:06 +04:00
Philip Kuryloski 23226ad705 Attempt to reduce flakes in aws integration tests 2022-04-01 09:40:14 +02:00
Michael Klishin c38a3d697d
Bump (c) year 2022-03-21 01:21:56 +04:00
Philip Kuryloski a0a5bf3c01 Update peer discovery aws tests for docker image changes 2021-07-28 10:19:11 +02:00
Philip Kuryloski 0593e8307e Increase timeouts in aws peer discovery integration suite 2021-07-20 11:41:40 +02:00
Philip Kuryloski d6399bbb5b
Mixed version testing in bazel (#3200)
Unlike with gnu make, mixed version testing with bazel uses a package-generic-unix for the secondary umbrella rather than the source. This brings the benefit of being able to mixed version test releases built with older erlang versions (even though all nodes will run under the single version given to bazel)

This introduces new test labels, adding a `-mixed` suffix for every existing test. They can be skipped if necessary with `--test_tag_filters` (see the github actions workflow for an example)

As part of the change, it is now possible to run an old release of rabbit with rabbitmq_run rule, such as:

`bazel run @rabbitmq-server-generic-unix-3.8.17//:rabbitmq-run run-broker`
2021-07-19 14:33:25 +02:00
David Ansari 0876746d5f Remove randomized startup delays
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in https://github.com/rabbitmq/cluster-operator/issues/662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
https://github.com/rabbitmq/cluster-operator/pull/689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).

2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
2021-06-03 08:01:28 +02:00
Philip Kuryloski 73e3196e1f Use non-conflicting aws resource names in integration suite 2021-05-06 10:08:39 +02:00
Philip Kuryloski 999bed402c Add rabbitmq_peer_discovery_aws to bazel
For mysterious reasons, it turns out that using rabbit_ct_helpers:exec
with the `[binary]` option cause `aws ec2 describe-instances ...` to
fail with "bad argument". Removing the option and using
`list_to_binary/1` on the response before json parsing seems to
alleviate the issue.

Also make RABBITMQ_IMAGE non-optional for integration_SUITE
2021-05-05 09:51:00 +02:00
Philip Kuryloski 60ba1fffcd Use Amazon ECS to test rabbitmq_peer_discovery_aws
in the integration_SUITE
2021-05-05 09:48:46 +02:00
Philip Kuryloski 441550c58b Remove the integration_SUITE from peer discovery aws
#3015 will replace it with a working version as the current copy will block release pipelines
2021-05-04 15:52:16 +02:00
Michael Klishin 52479099ec
Bump (c) year 2021-01-22 09:00:14 +03:00
Jean-Sébastien Pédron b7783cad04 integration_SUITE: Bump timetrap to one hour
It takes more than 30 minutes to compile Erlang from source, even with
the larger VM we selected for CI.
2020-07-27 11:19:51 +02:00
Michael Klishin 2c0846e994 Switch to MPL2 2020-07-14 15:32:09 +03:00
Jean-Sébastien Pédron a2b90448be Update copyright (year 2020) 2020-03-10 16:40:23 +01:00
Spring Operator 53b971f487 URL Cleanup
This commit updates URLs to prefer the https protocol. Redirects are not followed to avoid accidentally expanding intentionally shortened URLs (i.e. if using a URL shortener).

# HTTP URLs that Could Not Be Fixed
These URLs were unable to be fixed. Please review them to see if they can be manually resolved.

* http://blog.listincomprehension.com/search/label/procket (200) with 1 occurrences could not be migrated:
   ([https](https://blog.listincomprehension.com/search/label/procket) result ClosedChannelException).
* http://dozzie.jarowit.net/trac/wiki/TOML (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/TOML) result SSLHandshakeException).
* http://dozzie.jarowit.net/trac/wiki/subproc (200) with 1 occurrences could not be migrated:
   ([https](https://dozzie.jarowit.net/trac/wiki/subproc) result SSLHandshakeException).
* http://e2project.org (200) with 1 occurrences could not be migrated:
   ([https](https://e2project.org) result AnnotatedConnectException).
* http://nitrogenproject.com/ (200) with 2 occurrences could not be migrated:
   ([https](https://nitrogenproject.com/) result ConnectTimeoutException).
* http://proper.softlab.ntua.gr (200) with 1 occurrences could not be migrated:
   ([https](https://proper.softlab.ntua.gr) result SSLHandshakeException).
* http://yaws.hyber.org (200) with 1 occurrences could not be migrated:
   ([https](https://yaws.hyber.org) result AnnotatedConnectException).
* http://choven.ca (503) with 1 occurrences could not be migrated:
   ([https](https://choven.ca) result ConnectTimeoutException).

# Fixed URLs

## Fixed But Review Recommended
These URLs were fixed, but the https status was not OK. However, the https status was the same as the http request or http redirected to an https URL, so they were migrated. Your review is recommended.

* http://fixprotocol.org/ (301) with 1 occurrences migrated to:
  https://fixtrading.org ([https](https://fixprotocol.org/) result SSLHandshakeException).
* http://169.254.169.254/latest/meta-data/instance-id (AnnotatedConnectException) with 1 occurrences migrated to:
  https://169.254.169.254/latest/meta-data/instance-id ([https](https://169.254.169.254/latest/meta-data/instance-id) result ConnectTimeoutException).
* http://erldb.org (UnknownHostException) with 1 occurrences migrated to:
  https://erldb.org ([https](https://erldb.org) result UnknownHostException).

## Fixed Success
These URLs were switched to an https URL with a 2xx status. While the status was successful, your review is still recommended.

* http://cloudi.org/ with 27 occurrences migrated to:
  https://cloudi.org/ ([https](https://cloudi.org/) result 200).
* http://erlware.org/ with 1 occurrences migrated to:
  https://erlware.org/ ([https](https://erlware.org/) result 200).
* http://inaka.github.io/cowboy-trails/ with 1 occurrences migrated to:
  https://inaka.github.io/cowboy-trails/ ([https](https://inaka.github.io/cowboy-trails/) result 200).
* http://ninenines.eu with 6 occurrences migrated to:
  https://ninenines.eu ([https](https://ninenines.eu) result 200).
* http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html with 1 occurrences migrated to:
  https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html ([https](https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html) result 200).
* http://www.actordb.com/ with 2 occurrences migrated to:
  https://www.actordb.com/ ([https](https://www.actordb.com/) result 200).
* http://www.cs.kent.ac.uk/projects/wrangler/Home.html with 1 occurrences migrated to:
  https://www.cs.kent.ac.uk/projects/wrangler/Home.html ([https](https://www.cs.kent.ac.uk/projects/wrangler/Home.html) result 200).
* http://www.rabbitmq.com/blog/2018/02/12/peer-discovery-subsystem-in-rabbitmq-3-7/ with 1 occurrences migrated to:
  https://www.rabbitmq.com/blog/2018/02/12/peer-discovery-subsystem-in-rabbitmq-3-7/ ([https](https://www.rabbitmq.com/blog/2018/02/12/peer-discovery-subsystem-in-rabbitmq-3-7/) result 200).
* http://www.rabbitmq.com/cluster-formation.html with 1 occurrences migrated to:
  https://www.rabbitmq.com/cluster-formation.html ([https](https://www.rabbitmq.com/cluster-formation.html) result 200).
* http://www.rabbitmq.com/github.html with 1 occurrences migrated to:
  https://www.rabbitmq.com/github.html ([https](https://www.rabbitmq.com/github.html) result 200).
* http://www.rabbitmq.com/plugins.html with 1 occurrences migrated to:
  https://www.rabbitmq.com/plugins.html ([https](https://www.rabbitmq.com/plugins.html) result 200).
* http://www.rebar3.org with 1 occurrences migrated to:
  https://www.rebar3.org ([https](https://www.rebar3.org) result 200).
* http://inaka.github.com/apns4erl with 1 occurrences migrated to:
  https://inaka.github.com/apns4erl ([https](https://inaka.github.com/apns4erl) result 301).
* http://inaka.github.com/edis/ with 1 occurrences migrated to:
  https://inaka.github.com/edis/ ([https](https://inaka.github.com/edis/) result 301).
* http://lasp-lang.org/ with 1 occurrences migrated to:
  https://lasp-lang.org/ ([https](https://lasp-lang.org/) result 301).
* http://saleyn.github.com/erlexec with 1 occurrences migrated to:
  https://saleyn.github.com/erlexec ([https](https://saleyn.github.com/erlexec) result 301).
* http://www.mozilla.org/MPL/ with 5 occurrences migrated to:
  https://www.mozilla.org/MPL/ ([https](https://www.mozilla.org/MPL/) result 301).
* http://zhongwencool.github.io/observer_cli with 1 occurrences migrated to:
  https://zhongwencool.github.io/observer_cli ([https](https://zhongwencool.github.io/observer_cli) result 301).
2019-03-20 03:19:31 -05:00
Jean-Sébastien Pédron 932c36ec2f integration_SUITE: Initial cluster formation testsuite
It creates a cluster of 2 RabbitMQ nodes on AWS EC2 VMs with the AWS
peer discovery plugin configured, and verify that the cluster is
correctly formed.

The plugin is tested with two configurations:
 * cluster formation based on tags
 * cluster formation based on an autoscaling group

[#153749132]
2018-03-08 17:18:56 +01:00