mirror of https://github.com/ollama/ollama.git
Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected. |
||
---|---|---|
.. | ||
cache.go | ||
causal.go | ||
causal_test.go | ||
encoder.go | ||
wrapper.go |