Commit Graph

4227 Commits

Author SHA1 Message Date
Jesse Gross 06007c0a18 Allow models to force a new batch
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross a8e83a7654 Disable causal attention based on batch index
Currently we are using positions, which are relative to a
sequence and may not be unique.
2025-03-11 14:49:20 -07:00
Jesse Gross 475005504e Restrict Gemma to a single image per request 2025-03-11 14:49:20 -07:00
Jesse Gross 2c40c4d35e Fix follow up images and images split across batches 2025-03-11 14:49:19 -07:00
Michael Yang e95278932b use non-causal mask only for image positions 2025-03-11 14:49:19 -07:00
Michael Yang 9d2a20a763 use non-causal mask for inputs with images 2025-03-11 14:49:19 -07:00
Patrick Devine 2e54d72fc3 fix gemma3 1b conversion 2025-03-11 14:49:19 -07:00
Michael Yang 6b32a2d549 compat with upstream gguf 2025-03-11 14:49:19 -07:00
Michael Yang c5cbe4fc2a fallback to cpu 2025-03-11 14:49:19 -07:00
Michael Yang f888912870 fix vision encoder 2025-03-11 14:49:19 -07:00
Michael Yang 9e4642e9b3 ollama debug tensor 2025-03-11 14:49:19 -07:00
Michael Yang 6b0486c216 duplicate token_embd to output 2025-03-11 14:49:19 -07:00
Michael Yang d368c039f0 skip repacking vision tensors 2025-03-11 14:49:19 -07:00
Patrick Devine 9b54267e69 fix configs 2025-03-11 14:49:19 -07:00
Michael Yang 46bb0169c4 update model 2025-03-11 14:49:19 -07:00
Michael Yang 8934324b72 use fast attention 2025-03-11 14:49:18 -07:00
Jesse Gross 0e886595bf Fix tests and drift from main 2025-03-11 14:49:18 -07:00
Patrick Devine c62861f4fa fix conversion 2025-03-11 14:49:18 -07:00
Michael Yang 0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Patrick Devine 631fecc6d9 temporary work around for converting spm 2025-03-11 14:49:18 -07:00
Jesse Gross 4346c2409d fix drift from main 2025-03-11 14:49:18 -07:00
Michael Yang 4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Daniel Hiltgen 4dcf80167a
Build release for windows with local script (#9636) 2025-03-11 08:34:20 -07:00
Michael Yang 26a26998fb
Merge pull request #9590 from ollama/mxyng/dump-pad
fix: pad tensor item if ge zero
2025-03-10 16:34:55 -07:00
Michael Yang 9926eae015 fix: pad tensor item if ge zero
this produces a nicer output since both positive and negative values
produces the same width
2025-03-10 16:18:12 -07:00
Vincent Koc 8585b7b151
docs: add opik to observability integrations (#9626) 2025-03-10 16:15:10 -07:00
Parth Sareen 7e34f4fbfa
sample: add numerical stability to temperature/softmax transform (#9631) 2025-03-10 14:43:53 -07:00
Michael Yang fe776293f7
Merge pull request #9569 from dwt/patch-1
Better WantedBy declaration
2025-03-10 14:09:37 -07:00
frob d8a5d96b98
docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. (#9545) 2025-03-10 11:02:54 -07:00
Xiaowei Zhu 757668c42f
docs: add SwiftChat (#9540) 2025-03-10 11:01:09 -07:00
Sam 96ec8afd09
docs(tool): add mcp-llm (#9537) 2025-03-10 09:52:02 -07:00
Jeffrey Morgan e093db92c4
sample: temporarily use grammars for constrained generation in new engine (#9586) 2025-03-10 16:17:39 +01:00
Jesse Gross a1cda80bcb model: Update encoder cache to use multimodal input processing handler
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.
2025-03-09 17:05:26 -07:00
Jesse Gross 4614fafae0 ollamarunner: Don't panic for unimplemented features at runtime.
It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.
2025-03-08 18:58:18 -08:00
Jesse Gross 4100ed7bdd ml: Add support for quantized KV cache
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross f52b2615ef kvcache: Set context for shift offsets 2025-03-07 18:43:39 -08:00
Jesse Gross 25f9b152f9 ggml-backend: Ensure allocation meet backend requirements
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross 6da8b6a879 kvcache: Support non-causal attention
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
2025-03-07 18:39:27 -08:00
Jesse Gross 0daaaef8c9 ollamarunner: Quiet debug logging and panic on unimplemented features
Debug logging of every token has previously caused test timeouts
on slower machines.
2025-03-07 18:38:02 -08:00
Jesse Gross 98272fbd58 additional review comments 2025-03-07 14:08:21 -08:00
Michael Yang b27e8f3f10 ml/backend/ggml: use backend buffer type
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang 45df786f09 comments 2025-03-07 14:08:21 -08:00
Michael Yang daaf42e4a4 ml/backend/ggml: clean up 2025-03-07 14:08:21 -08:00
Michael Yang 2dc60d4620 ml/backend/ggml: offload vision to cpu
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang b5312f30e8 ml/backend/ggml: handle tensor split 2025-03-07 14:08:21 -08:00
Michael Yang 26c2e0bd35 ml/backend/ggml: handle user specified cpu offloading 2025-03-07 14:08:21 -08:00
Michael Yang bf920883d5 ml/backend/ggml: set cpu n_threads 2025-03-07 14:08:21 -08:00
Michael Yang 58b9ec1f6b kvcache: update tests 2025-03-07 14:08:21 -08:00
Michael Yang 7bae7fa5ce ml/backend/ggml: create tensor on specific backend
some tensors should be created on specific backends to reduce number of
copies and improve performance
2025-03-07 14:08:21 -08:00