MINOR: Add serialized vagrant rsync until upstream fixes broken parallelism

See https://github.com/mitchellh/vagrant/issues/7531. The core of the issue is that vagrant rsync uses a fixed set of 1000 possible temp file entries for SSH ControlMaster files to cache SSH connections for rsyncing. A few notes:

* We can't break down the steps further and maintain performance due to various limitations in vagrant/vagrant-aws (rsync is only executed on `vagrant up`/`vagrant reload`/`vagrant rsync`, you can't enable/disable and rsync shared folder only during some of those stages, and provisioning only runs in parallel with vagrant-aws during `vagrant up`).
* We need to isolate each of the serialized rsync calls. (If we assumed `parallel` was available, we actually could get the parallelism back.) This is required because even across calls they could randomly choose the same temporary file.
* If there's a chance multiple instances were running on the same server at the same or nearly the same time, they can conflict since the same temp file entries are used globally. This means anything running on shared CI servers might end up syncing data between different CI jobs (!!), which could lead to some very strange results. Especially weird if they aren't even for the same type of job.
* Provisioning error check needs to be removed because it is catching rsync errors, but those can still happen in the initial `vagrant up` rsync step before the `vagrant up` provisioning step. It seems likely this bug was the cause of missing files anyway so this check might not be as valuable anymore.

Author: Ewen Cheslack-Postava <me@ewencp.org>

Reviewers: Ismael Juma <ismael@juma.me.uk>

Closes #3380 from ewencp/deparallelize-rsync
This commit is contained in:
Ewen Cheslack-Postava 2017-06-20 12:21:43 -07:00
parent cb2fdbd6c1
commit ffa8100457
2 changed files with 18 additions and 10 deletions

View File

@ -57,15 +57,6 @@ if [ -h /opt/kafka-dev ]; then
fi
ln -s /vagrant /opt/kafka-dev
# Verification to catch provisioning errors.
if [[ ! -x /opt/kafka-dev/bin/kafka-run-class.sh ]]; then
echo "ERROR: kafka-run-class.sh not found/executable in /opt/kafka-dev/bin"
find /opt/kafka-dev
ls -la /opt/kafka-dev/bin/kafka-run-class.sh || true
exit 1
fi
get_kafka() {
version=$1

View File

@ -226,8 +226,25 @@ function bring_up_aws {
if [[ ! -z "$worker_machines" ]]; then
echo "Bringing up test worker machines in parallel"
vagrant_batch_command "vagrant up $debug --provider=aws" "$worker_machines" "$max_parallel"
# Currently it seems that the AWS provider will always run
# rsync as part of vagrant up. However,
# https://github.com/mitchellh/vagrant/issues/7531 means
# it is not safe to do so. Since the bug doesn't seem to
# cause any direct errors, just missing data on some
# nodes, follow up with serial rsyncing to ensure we're in
# a clean state. Use custom TMPDIR values to ensure we're
# isolated from any other instances of this script that
# are running/ran recently and may cause different
# instances to sync to the wrong nodes
local vagrant_rsync_temp_dir=$(mktemp -d);
TMPDIR=$vagrant_rsync_temp_dir vagrant_batch_command "vagrant up $debug --provider=aws" "$worker_machines" "$max_parallel"
rm -rf $vagrant_rsync_temp_dir
vagrant hostmanager
for worker in $worker_machines; do
local vagrant_rsync_temp_dir=$(mktemp -d);
TMPDIR=$vagrant_rsync_temp_dir vagrant rsync $worker;
rm -rf $vagrant_rsync_temp_dir
done
fi
else
vagrant up --provider=aws --no-parallel --no-provision $debug