ollama

History

Jesse Gross 05ccb17c6e kvcache: Use Cast instead of Copy for flash attention masks Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected.		2025-08-19 12:36:28 -07:00
..
cache.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00
causal.go	kvcache: Use Cast instead of Copy for flash attention masks	2025-08-19 12:36:28 -07:00
causal_test.go	kvcache: Enable SWA to retain additional entries	2025-07-31 14:48:01 -07:00
encoder.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00
wrapper.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00