ollama

History

Jesse Gross ea79003180 kvcache: Skip computing causal mask for worst case graph reservation Computing an attention mask for a large context and max batch is expensive - over 100ms. Models like Gemma3 that have multiple types of caches and custom attention masks need to do this 4 times, so this adds approximately 500ms to startup time when using 128k context When we are reserving the worst case graph, we don't need the mask, only its shape, so we can skip this.		2025-05-27 14:25:15 -07:00
..
cache.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00
causal.go	kvcache: Skip computing causal mask for worst case graph reservation	2025-05-27 14:25:15 -07:00
causal_test.go	ml: Panic rather than return error on tensor allocation failure	2025-05-22 14:38:09 -07:00
encoder.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00
wrapper.go	ollamarunner: Preallocate worst case graph at startup	2025-04-08 10:01:28 -07:00