Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa.
The main changes are: - Support for xlm_roberta tokenization
configuration - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary - Adding a new
flat text file to resources that is the spm char normalizer.
This prevents docs files from *starting* with a "response" because when
that happens the response is converted to an assertion and appended
to the last snippet that was processed. If that last snipper was in a
different file then it's very hard to reason about the tests. That goes
double because the order we iterate files isn't defined....
Anyway! This adds a guard in the build, removes the offending
"response", and reenables the tests that we'd thought we failing here.
Closes#91081