mirror of https://github.com/ollama/ollama.git
When trimming whitespace at the end of every chunk, we were iterating backwards over the string byte-by-byte instead of rune-by-rune. As an example of how this can cause corruption, suppose we have the multi-byte character ✅ (`"\u2705"`), which is represented in utf-8 as the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this caused us to mistakenly slice in the middle of the rune, removing `0x85` and leaving `0xE2 0x9C`, which beyond being the incorrect place to slice, is not even a valid utf-8 character. `trailingWhitespaceLen()` was modified to count from the end in a rune-aware way. Tests with various multibyte unicode characters were also added. Fixes: #12414 |
||
|---|---|---|
| .. | ||
| imageproc | ||
| input | ||
| models | ||
| parsers | ||
| renderers | ||
| testdata | ||
| bytepairencoding.go | ||
| bytepairencoding_test.go | ||
| model.go | ||
| model_test.go | ||
| sentencepiece.go | ||
| sentencepiece_test.go | ||
| textprocessor.go | ||
| vocabulary.go | ||
| vocabulary_test.go | ||
| wordpiece.go | ||
| wordpiece_test.go | ||