ollama/model
Devon Rifkin 05ba4ca1f4 parsers: fix unicode handling for qwen3-coder
When trimming whitespace at the end of every chunk, we were iterating
backwards over the string byte-by-byte instead of rune-by-rune.

As an example of how this can cause corruption, suppose we have the
multi-byte character  (`"\u2705"`), which is represented in utf-8 as
the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which
passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this
caused us to mistakenly slice in the middle of the rune, removing `0x85`
and leaving `0xE2 0x9C`, which beyond being the incorrect place to
slice, is not even a valid utf-8 character.

`trailingWhitespaceLen()` was modified to count from the end in a
rune-aware way. Tests with various multibyte unicode characters were
also added.


Fixes: #12414
2025-09-25 15:47:46 -07:00
..
imageproc imageproc mllama refactor (#7537) 2024-12-14 19:50:15 -08:00
input batch: use tensors for outputs (#12185) 2025-09-15 14:33:06 -07:00
models Grace/deepseek v3 migration (#12385) 2025-09-24 15:19:47 -07:00
parsers parsers: fix unicode handling for qwen3-coder 2025-09-25 15:47:46 -07:00
renderers address comments 2025-09-15 11:46:25 -07:00
testdata gemma2 impl 2025-03-11 14:35:08 -07:00
bytepairencoding.go multi-regexp pretokenizer (#12325) 2025-09-23 13:21:47 -07:00
bytepairencoding_test.go multi-regexp pretokenizer (#12325) 2025-09-23 13:21:47 -07:00
model.go fix: leaf alt name (#12390) 2025-09-23 17:50:53 -07:00
model_test.go fix: leaf alt name (#12390) 2025-09-23 17:50:53 -07:00
sentencepiece.go model: implement bert in ollama engine (#9080) 2025-09-15 15:35:59 -07:00
sentencepiece_test.go model: implement bert in ollama engine (#9080) 2025-09-15 15:35:59 -07:00
textprocessor.go model: handle multiple eos tokens (#10577) 2025-05-16 13:40:23 -07:00
vocabulary.go embedding gemma model (#12181) 2025-09-04 09:09:07 -07:00
vocabulary_test.go model: treat 'user defined' tokens as special tokens (#11077) 2025-06-16 16:03:16 -07:00
wordpiece.go model: implement bert in ollama engine (#9080) 2025-09-15 15:35:59 -07:00
wordpiece_test.go model: implement bert in ollama engine (#9080) 2025-09-15 15:35:59 -07:00