The m4.large could build Erlang and the testsuite could run in 28
minutes. That's an improvment, but we are still close to the limit.
Rather than bump the limit, try with an m5.large. It's also a bit
cheaper to my surprise.
The previous default of t2.micro was insufficient to compile Erlang from
sources in under 30 minutes. This caused the integration testsuite to
timeout.
Hopefully an m4.large instance type will be enough.
and add a VMware copyright notice.
We did not mean to make this code Incompatible with Secondary Licenses
as defined in [1].
1. https://www.mozilla.org/en-US/MPL/2.0/FAQ/
This is sometimes failing in GitHub Actions and we don't know why:
https://github.com/rabbitmq/rabbitmq-server/runs/877118328?check_suite_focus=true#step:6:6099
We confirmed in the logs that only 1 out of 3 nodes get unblocked. The
CT suite hits the 15 minute time trap and fails. We don't know whether
this rpc call doesn't make it through to the first or second node, or if
it does and the rpc call simply doesn't return within the time window.
We can't address this if we don't know where the problem lies, so this
will give us more insight when it fails again.
Signed-off-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
awaitMatch(Guard, Expr, Timeout) reevaluates Expr until it matches
Guard up to Timeout milliseconds from now. It returns the matched
value, in the event said value is useful later in a test.
Additionally simplify an instance of ?assertEqual(true, ... to ?assert(
`make test-dist` was already executed for project being tested,
therefore we can skip the build to save time when a RabbitMQ node is
started from there.
However, if the node is to be started from another place (i.e. `rabbit`
when plugins are disabled), we must not skip the build because the
project might have no .ez. files created at this point.
This reverts part of
dc5a04a503
because tests started failing in GitHub Actions with:
2020-04-30 16:38:32.238 [error] <0.228.0> Supervisor inet_tcp_proxy_dist_conn_sup had child
{undefined,false,#Ref<0.301488671.524812289.157484>}
started with
{inet_tcp_proxy_dist,dist_proc_start_link,undefined} at <0.776.0>
exit with reason net_tick_timeout in context child_terminated
We suspect that this is due to CPU contention on GitHub Actions shared runners.
When 5 Erlang VM nodes with 2 schedulers each start at the same time on a
host with 2 CPUs and then try to cluster via rabbitmqctl (which starts
5 more Erlang VMs), the 5 second net_tick_time is not long enough.
Rather than increasing the net_tick_time, we are choosing to put less
pressure on the host by clustering nodes one-by-one rather than all at
once.
Pair @dumbbell
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
I don't remember why I used `stop-rabbit-on-node` then `stop-node`. But
this was two CLI runs which costed a lot in time.
Now, `stop-node` does not use the CLI anymore and getting rid of
`stop-rabbit-on-node` reduces the number of CLI runs to 0, improving the
time it takes to stop RabbitMQ significantly: it shaved about 1 second,
giving a stop time of about 3 seconds now on my laptop.
For cases where a condition can materialize eventually but
we do not know when exactly since we have to observe it from
the outside.
E.g. a cluster of nodes can be formed in a second or two, there's
a randomized delay on startup involved by design.
If the `start-background-broker` recipe fails, it is possible that the
node was started but the follow-up wait/status failed.
To not let a, unused node around, we try to stop it in case of a failure
and ignore the result of that stop recipe.
This situation happened in CI where Elixir seems to crash (one of the
two CLI commands we run after starting the node):
Logger - error: {removed_failing_handler,'Elixir.Logger'}
Logger - error: {removed_failing_handler,'Elixir.Logger'}
Logger - error: {removed_failing_handler,'Elixir.Logger'}
escript: exception error: undefined function 'Elixir.Exception':blame/3
in function 'Elixir.Kernel.CLI':format_error/3 (lib/kernel/cli.ex, line 82)
in call from 'Elixir.Kernel.CLI':print_error/3 (lib/kernel/cli.ex, line 173)
in call from 'Elixir.Kernel.CLI':exec_fun/2 (lib/kernel/cli.ex, line 150)
in call from 'Elixir.Kernel.CLI':run/1 (lib/kernel/cli.ex, line 47)
in call from escript:run/2 (escript.erl, line 758)
in call from escript:start/1 (escript.erl, line 277)
in call from init:start_em/1
.../rabbit_common/mk/rabbitmq-run.mk:323: recipe for target 'start-background-broker' failed
In this case, rabbit_ct_broker_helpers tried again to start the node and
it worked. But it affected an unrelated testcase later because it tried
to use a TCP port already used by that left-over node.
rabbitmq_ct_helpers ensures everything is built earlier, so no need to
try again. This saves a bit of time and hopefully fixes a few
situations where RabbitMQ is recompiled without test code.
This helps in situations where rabbitmq-env guesses it wrong. I saw this
situation in the `feature_flags_SUITE` testsuite of rabbitmq-server, but
couldn't really explain why the guess was wrong.
At least on the Windows Server 2019 AWS EC2 image, the `taskkill`
command is unavailable.
If that's the case, we fallback to using a PowerShell oneliner. It's not
the default, just in case PowerShell is unavailable.
`rmp_plugins_dir` is set by testsuites to indicate an extra plugins
directory now (instead of the full value of `$RABBITMQ_PLUGINS_DIR`).
This helps when we are using a secondary Umbrella: the testsuite does
not have to mess with the computation of the regular plugins directory.
It was restored in rabbitmq-server's master branch to allow backward
compatibility. Therefore now, the same call works with the master branch
and supported release branches (v3.7.x, v3.8.x).
The definition of the `security_groups` variable in the `direct-vms`
module was incorrect. This may explain the error seen in CI. No idea why
the same error didn't appear locally though.
The variable is declared as read-only, but we overwrite it erroneously
in `kiex_install_elixir`. In this function, we didn't intend to use the
global variable, so let's rename the local variable to
`$latest_elixir_version`.
The idea is to reduce the load when testing RabbitMQ. It is especially
useful in CI where we might run multiple testsuites in parallel in
different containers.
The value of "2" is currently hard-coded.
At the same time, we change the "scheduler busy wait time" parameter to
"very_short" so thay unused schedulers are put to sleep quickly.
Discussed with: @michaelklishin @gerhard
* `rabbit_ct_vm_helpers.erl`: Switching to HTTPS broke the test
framework. We don't setup the remote HTTP server with certificates and
we totally do not care at all about making this connection secure.
* `erlang.mk` is a generated file: there is no point in changing URLs
because that change will be lost with the next update of the file.
Furthermore, it makes it more difficult to track changes compared to
upstream.
This reverts part of commit 70428fb0d0.
Before this patch, we were always setting the secondary
Umbrella-specific environment variables (like `DEPS_DIR` and paths to
`rabbitmq*` scripts) based on the number of the node, regardless of the
fact that a secondary Umbrella was configured or not.
This was fine when used inside an Umbrella becase the computed paths
were working by chance. However, when using a standalone clone of
rabbitmq-server, the computed paths to `rabbitmq*` paths were incorrect,
leading to the odd-numbered nodes to fail to start.
The fix is easy: do not set the secondary Umbrella-specific environment
variables when there is none configured.
... from a RabbitMQ 3.7.x node.
So if we get an `undef` exception from the queried node, we assume there
is no feature flag file and we don't set it in `rabbitmqctl` commands
targetting that node.
... when we want to execute code on the RabbitMQ node (as opposed to the
Common Test node).
The accepted paths are the one in `rabbitmq_ct_helpers` and
`rabbitmq_ct_client_helpers`, as well as the `test` directory (if the
module is inside a testsuite).
This should fix the situation where we push the path to
`deps/rabbit/ebin` from the main Umbrella to a node running from the
secondary Umbrella.
... for the secondary Umbrella.
I couldn't find a way with GNU Make to undefined variables on the
command line, so that the internal default values are used. Therefore,
we need to explicitely set their value so that the one from the parent
make(1) instance is not passed to the child instance.
... not environment variables.
The latter does not work in if the parent make(1) instance already has
those variables (e.g. `$DEPS_DIR`) as make variables on the command
because they are passed to children as make variables which have
precedence over environment variables.
This is the only place this function is used and this removes a
dependency cycle: rabbit_ct_broker_helpers can't depend on the broker:
the broker already depends on it as a test dependency.
[#159298729]
Before, we were relying on the `secondary_erlang_mk_depsdir` config key
value to compute the secondary path of an application. When there is no
secondary Umbrella configured, the computed value was based on the
primary `deps` dir.
However, when the caller asks for the currently tested app, we might not
be in an Umbrella at all: it's perfectly fine to e.g. clone the
application alone and test it. In this case, the computed value would
point to a non-existing directory (e.g. rabbitmq-server/deps/rabbit).
So instead of relying on the default secondary Umbrella when there is
none configured, we directly check if there is one configured. If there
is none, the secondary path for the application is set to the primary
path. This way, we are sure the directory exists and is correct, no
matter how the project was cloned.
Fixes#24.
... to start two different versions of RabbitMQ.
The codebase where the testsuite is executed is used to start even
RabbitMQ nodes (counting from 0) and the secondary Umbrella is used to
start odd RabbitMQ nodes.
This should help us test mixed-version clusters.
[#160169569]
This helps for Java client hostname verification tests on CI: the CI
containers resolve the hostname to an external IP address and the broker
doesn't accept the connection for guest because it's not from localhost.
By using localhost in the server certificate SAN, hostname verification
is enforced and the connection is from localhost.
We were still building Erlang 21.0 from sources. Elixir was also
compiled from sources, but 1.7.0-rc.1 was automatically selected and the
build fails.
Instead of trying to fix this issue with Elixir, let's just install
Erlang 21.0 and Elixir from Debian packages and be done with it.
This could happen if the setup run steps failed before we start the
testsuite monitor. In this case, `?config()` was returning `undefined`
and we tried to use it as a PID.
... instead of installing Debian packages. To use this feature, the
caller has to specify an Erlang Git reference (a branch, a tag or a
plain commit hash).
This is useful now to test with Erlang 21.0 which is not released yet at
the time of this commit. In the case of Erlang 21.0, if no Git reference
is specified, it defaults to `OTP-21.0-rc2`.
The time we wait for VMs to be ready is bumped from 7 to 20 minutes
because compiling Erlang and Elixir takes time.
rabbit_ct_vm_helpers takes a look at the return value of
`erlang:system_info(system_version)` because it may contain the commit
hash of the running Erlang system (if it was built from sources). It
allows to use the same commit on the remote VMs.
[#157964874]
OTP 21 deprecated erlang:get_stacktrace/0 in favor of a new
try/catch syntax. Unfortunately that's not realistic for projects
that support multiple Erlang versions (like us) until OTP 21 can be
the minimum version requirement. In order to compile we have to ignore
the warning. The broad compiler option seems to be the most common
way to support compilation on multiple OTP versions with warnings_as_errors.
[#157964874]
generate_config() was logging a few variables which are already
available in the config logged earlier.
Furthermore, it used ctl:pal() from the RabbitMQ node, not from the
common_test controler node. Thus it didn't work as expected and another
warning was logged in addition to the debug message.
... when we don't know if the parameter is set.
?config() logs a warning when the parameter is undefined. So even though
it will return `undefined` when this is the case, it's better to use
when we expect the parameter to be set.
When we don't know, it's best to use rabbit_ct_helpers:get_config(). In
particular, it can handle a default value.
The code we replace in this function was doing exactly that: use a
default value if the parameter is undefined.
Per discussion with @dumbbell.
This helped us find out the root cause of test suite failures on OTP 21.
References rabbitmq/rabbitmq-server#1616.
[#157964874]
... when we want to update the code path of a remote Erlang node. I.e.
if we fail to contact it, we must not try to use the return value of
`rpc:call()` in the code following.
If we pass directories to `erl_tar:create()`, it will recurse and
include all children of those directories. It causes the archive to be
included in itself, possibly causing an infinite archiving loop. That
loop is stopped when the disc is full.
Now, all directories are excluded from the list passed to
`erl_tar:create()`, effectively disabling recursion. This fixes the
ENOSPC errors we see sometimes.
[#153749132]
This greatly reduces code duplication. It also allows us to add Erlang
20.2 support easily, because it shares everything with Erlang 20.1
setup, except the package revision obviously.
The single setup template knows the relationship between a version of
Erlang and the corresponding revision of a Debian package.
The relationship between a version of Erlang and the system to use
(Debian Wheezy, Jessie or Stretch) is already recorded in a Terraform
variable. Thus now, we pass it to the template as a variable. The script
is responsible to install backports repositories and extra packages as
needed.
[#153749132]
It allows one to run a common_test testsuite on Erlang nodes running on
remote Amazon EC2 VMs. It configures Erlang distribution so that remote
nodes can communicate with each other and also commmunicate with the
commont_test master node.
rabbit_ct_broker_helpers also offers new setup and teardown steps to
work with VMs: it allows to start RabbitMQ nodes on those VMS and
possible cluster them. The configuration is unchanged compared to local
nodes. The number of RabbitMQ nodes doesn't have to match the number of
VMs: they are spread using round-robin on the available VMs.
v2: Add support to start RabbitMQ nodes spread on remote VMs.
v3: Various improvements to allow parallel executions of testcases. I.e.
several sets of VMs can be spawned in parallel without interference.
v4: Support user-specified VM names. If the name is missing, use the
unique ID generated for per-VM-set resources.
v5: Use a unique local node name when trying to ping the remote ct-peer.
While here use `rabbit_misc:random()` to create Terraform unique UI.
The previous base64-encoded string didn't make a valid node name.
Accept `$ERLANG_VERSION` environment/make variable to force the
Erlang version to use on VMs.
Add setup scripts for Erlang 19.3 and 20.1.
v6: Use Amazon S3 to upload the directories archive. Configure a VPC to
access it from the VMs.
Use `user_data` to provide the setup script. The setup script itself
is now a template.
Those changes allow to get rid of all `exec` or `file` provisioners
in the `aws_instance`. This means it can now be created using a
launch configuration which is the way to create instances via an
autoscaling group.
v7: Export hostnames, nodenames and IP addresses from Terraform state,
and generate `inetrc` in Erlang. This makes it possible to work with
a "two-step Terraform manifest". For instance, with an autoscaling
group, Terraform doesn't start instances. However, we can use a
second manifest to query the created instances.
Export `$HOME` in setup scripts. This fixes the use of `~/...` paths
and the start of the remote Erlang node.
v8: Add support to query Amazon EC2 VMs, based on tags, instead of
relying on the outputs of the manifest. This will allow us to query
VMs created with an autoscaling group for instance. This change is
based on a new query-only manifest called `vms-query`.
This new query-only manifest is used in a loop until we have enough
VMs (compared to the requested numbers) or we reach a timeout of 5
minutes.
v9: Add an autoscaling-group-based module to deploy VMs. The testsuite
is extended to use it, in addition to the `direct-vms` module.
Fix the setup scripts to handle the case where there is no
directories to upload (i.e. the archive is an empty file).
v10: Download log files from remote VMs before destroying them. This
allows further debugging if something fails.
v11: Use a per-VPC CIDR block. This resolves a possible conflict when
VMs in different VPCs gets the same private IP address: this breaks
name resolution on the local common_test node.
Use inet_db:add_host() to reconfigure name resolution, instead of
calling inet_config:init(). We still write the `inetrc` files: they
are used by sub-processes such as rabbitmqctl(8) and
rabbitmq-plugins(8).
Download each VM's common_test priv_dir before destroying the VMs.
They are useful because they contain the RabbitMQ nodes logs for
instance.
Fix several concurrency bugs around global resources accessed or
shared by several setups of rabbit_vm_helpers, in case of parallel
testing.
The upload dirs archive is now created by Erlang, not Terraform. It
allows us to create a single archive per directories set, which
saves time, I/O and CPU (for compression).
Configure an EBS root block device for each VM because the default
internal storage a `t2.micro` instance type is too small.
v12: Verify that terraform(1) is available and working before doing
anything else.
v13: Guess the Erlang application name being tested (using the value of
the `$DIALYZER_PLT` environment variable, lacking a better way). We
use it now as the instance name prefix.
Allow the caller to set the AWS EC2 region.
Allow the caller to set the files suffix. Also, we record it in the
instance and launch configuration tags. This allows the caller to
do things based on a known instance tag.
Install rsync, zip and vim-nox on VMs. They are useful when one
needs to connect to the VMs and try things.
Install Elixir on 19.3+ VMs. It's not used, but it silences a
warnings from `rabbitmq-build.mk` which calls it to initialize
`$ELIXIR_LIB_DIR`.
[#153749132]
get_message_store_pid() might be called while the vhost is shutting
down. If this happens, we loop again to double-check it's actually gone.
This should fix races in the vhost_SUITE testsuite in rabbitmq-server
seen on Travis CI.
These modules use export_all which halts the build due to `-Werror`. It
is useful to be able to build this project to add it to the code path
for debugging.
... instead of calling rabbit_plugins_main module directly. This way we
are sure to test the same command as regular user. Moreover, this also
works with the new CLI where rabbit_plugins_main is gone.