This is only a part of what #541 is supposed to cover but
it already helped in a particular node shutdown lockup we've
observed => worth merging earlier.
Per discussion with @dcorbacho.
If the directory or file does not exist before RabbitMQ starts, we can't let
RabbitMQ create it, otherwise, it's created with its short filename, not its
long one.
With this new correction, we can "escape" all variables instead of only
RABBITMQ_BASE.
Fixes#493.
Note that at the time of this commit, Lager does not support logging
to stdout on Windows. This commit still improves consistency between
Unix and Windows.
References #493.
On Windows, cmd.exe and batch scripts do not support Uniode apparently.
However, Windows uses UTF-16 to encode filenames one disk. In batch
scripts, filenames are converted to some one-byte-wide charset. Once
passed to Erlang and RabbitMQ, those filenames are incorrect. In
particular, the management UI is unhappy because filenames obviously
contain invalid UTF-8 characters.
Using short filenames makes sure filename only contain US-ASCII
characters.
To convert them, we use "for" expansion. At the same time, filenames are
made absolute. It works even better than realpath.exe because the latter
also converts filenames to another charset again.
Fixe #493.
Currently external erlang library `sd_notify` is used to make systemd
unit with `Type=notify` to work correctly. This library contains some C
code and thus cannot be built into architecture-independent package.
But it is not actually needed, as systemd provides systemd-notify(1)
helper for shell scripts which serves exactly the same purpose.
The only thing is that you need to add `NotifyAccess=all` to your unit
file to make everything work well.
Currently time-out when running 'rabbitmqctl list_channels' is treated
as a sign that current node is unhealthy. But it could not be the
case, as the hanging channel could be actually on some other
node. Given that currently we have more than one bug related to
'list_channels', it makes sense to improve diagnostics here.
This patch doesn't change any behaviour, only improves logging after
time-out happens. If time-outs continue to occur (even with latest
rabbitmq versions or with backported fixes), we could switch to this
improved list_channels and kill rabbitmq only if stuck channels are
located on current node. But I hope that all related rabbitmq bugs
were already closed.
When monitor detected the node as OCF_RUNNING_MASTER, this may be
lost while the monitor checks in progress.
* Rework the prev_rc by the rc_check to fix this.
* Also add info log if detected as running master.
* Break the monitor check loop early, if it shall be exiting to be
restarted by pacemaker.
* Do not recheck the master status and do not update the master score,
if the node was already detected by monitor as OCF_RUNNING_MASTER.
By that point, the running and healthy master shall not be checked
against other nodes uptime as it is pointless and only takes more
time and resources for the action monitor to finish.
* Fail early, if monitor detected the node as OCF_RUNNING_MASTER, but
the rabbit beam process is not running
* For OCF_CHECK_LEVEL>20, exclude the current node from the check
loop as we already checked it before
Related Fuel bug:
https://launchpad.net/bugs/1531838
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
When 'rabbitmqctl rotate_logs' is called without any parameters, it
clears logs unconditionally. And given that this form is used in
logrotate config files, this could result in data loss.
This could be reproduced with following scenario:
1) 'max_size' is set globally in lograte config
2) One of two rabbitmq logs is greater than that limit
3) Daily logrotate run was already performed today, and now we
are calling it manually. In this case logrotate will copy only file
that is bigger than max_size, but 'rabbitmqctl rotate_logs' will
clear both of them - leading to data loss.
* Fix get status() to catch beam state and output errors
* Fix action_stop() to force name-based mathcing then no
pidfile and the beam's unresponsive
* Fix proc_stop to use name based matching if no pidfile
found
* Fix proc_stop to retry sending the signal when using the name
based match as well
W/o this patch, the situation is possible when:
- beam's running and cannot process signals, but is reported "not running"
by the get_status(), while in fact it shall be reported as generic error
- which_applications() returned error, while its output is still
being parsed for the "what" match, while it shall not.
- action stop and proc_stop gives up then there is no pidfile and the beam's
running unresponsive.
The solution is to make get_status to return generic error and action
stop to use the rabbit process name matching for killing it.
Related Fuel bug:
https://bugs.launchpad.net/fuel/+bug/1529897
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix, the rabbit OCF cannot make
proc_stop to try to kill the pid-less beam process
by its name matching because the proc_kill()'s
1st parameter cannot be passed empty.
The fix is to use the "none" value then the pid-less
process must be matched by the service_name instead.
Also, fix the proc_kill to deal with Multi process
pid files as well (there are many pids, a space separated).
Related Fuel bugs:
https://launchpad.net/bugs/1529897https://launchpad.net/bugs/1532723
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>