ollama/ml
Jesse Gross f8c4f106cc kvcache: Use Cast instead of Copy for flash attention masks
Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.
2025-08-20 16:57:03 +02:00
..
backend kvcache: Use Cast instead of Copy for flash attention masks 2025-08-20 16:57:03 +02:00
nn update vendored llama.cpp and ggml (#11823) 2025-08-20 16:41:49 +02:00
backend.go kvcache: Use Cast instead of Copy for flash attention masks 2025-08-20 16:57:03 +02:00