OTP 27 reset all assumptions on how the vm reacts to processes that
buffer and process a lot of large binaries.
Substantially increasing the vheap sizes for such process restores
most of the same performance by allowing processes to hold more binary
data before major garbage collections are triggered.
This introduces a new module to capture process flag configurations.
The new vheap sizes are only applied when running on OTP 27 or
above.
It was still possible, although rare, to have message store
files lose message data, when the following conditions were
met:
* the message data contains byte values 255
(255 is used as an OK marker after a message)
* the message is located after a 0-filled hole in the file
* the length of the data is at least 4096 bytes and
if we misread it (as detailed below) we encounter
a 255 byte where we expect the OK marker
The trick for the code to previously misread the length can
be explained as follow:
A message is stored in the following format:
<<Len:64, MsgIdAndMsg:Len/unit:8, 255>>
With MsgId always being 16 bytes in length. So Len is always
at least 16, if the message data Msg is empty. But technically
it never is.
Now if we have a zero filled hole just before this message,
we may end up with this:
<<0, Len:64, MsgIdAndMsg:Len/unit:8, 255>>
When we are scanning we are testing bytes to see if there is
a message there or not. We look for a Len that gives us byte
255 after MsgIdAndMsg.
Len of value 4096 looks like this in binary:
<<0:48, 16, 0>>
Problem is if we have leading zeroes, Len may look like this:
<<0, 0:48, 16, 0>>
If we take the first 64 bits we get a potential length of 16.
We look at the byte after the next 16 bytes. If it is 255, we
think this is a message and skip by this amount of bytes, and
mistakenly miss the real message.
Solving this by changing the file format would be simple enough,
but we don't have the luxury to afford that. A different solution
was found, which is to combine file scanning with checking that
the message exists in the message store index (populated from
queues at startup, and kept up to date over the life time of
the store). Then we know for sure that the message above
doesn't exist, because the MsgId won't be found in the index.
If it is, then the file number and offset will not match,
and the check will fail.
There remains a small chance that we get it wrong during dirty
recovery. Only a better file format would improve that.