ollama

Commit Graph

Author	SHA1	Message	Date
Jesse Gross	06007c0a18	Allow models to force a new batch This is useful for a few things: - Work around bugs, such as having 2 images in one batch - Keep the image in a single batch for fully connected attention - Improve performance by not evaluating embeddings multiple times	2025-03-11 14:49:20 -07:00
Jesse Gross	a8e83a7654	Disable causal attention based on batch index Currently we are using positions, which are relative to a sequence and may not be unique.	2025-03-11 14:49:20 -07:00
Jesse Gross	475005504e	Restrict Gemma to a single image per request	2025-03-11 14:49:20 -07:00
Jesse Gross	2c40c4d35e	Fix follow up images and images split across batches	2025-03-11 14:49:19 -07:00
Michael Yang	e95278932b	use non-causal mask only for image positions	2025-03-11 14:49:19 -07:00
Michael Yang	9d2a20a763	use non-causal mask for inputs with images	2025-03-11 14:49:19 -07:00
Patrick Devine	2e54d72fc3	fix gemma3 1b conversion	2025-03-11 14:49:19 -07:00
Michael Yang	6b32a2d549	compat with upstream gguf	2025-03-11 14:49:19 -07:00
Michael Yang	c5cbe4fc2a	fallback to cpu	2025-03-11 14:49:19 -07:00
Michael Yang	f888912870	fix vision encoder	2025-03-11 14:49:19 -07:00
Michael Yang	9e4642e9b3	ollama debug tensor	2025-03-11 14:49:19 -07:00
Michael Yang	6b0486c216	duplicate token_embd to output	2025-03-11 14:49:19 -07:00
Michael Yang	d368c039f0	skip repacking vision tensors	2025-03-11 14:49:19 -07:00
Patrick Devine	9b54267e69	fix configs	2025-03-11 14:49:19 -07:00
Michael Yang	46bb0169c4	update model	2025-03-11 14:49:19 -07:00
Michael Yang	8934324b72	use fast attention	2025-03-11 14:49:18 -07:00
Jesse Gross	0e886595bf	Fix tests and drift from main	2025-03-11 14:49:18 -07:00
Patrick Devine	c62861f4fa	fix conversion	2025-03-11 14:49:18 -07:00
Michael Yang	0df1800436	set non-causal attention	2025-03-11 14:49:18 -07:00
Patrick Devine	631fecc6d9	temporary work around for converting spm	2025-03-11 14:49:18 -07:00
Jesse Gross	4346c2409d	fix drift from main	2025-03-11 14:49:18 -07:00
Michael Yang	4b037a97dc	add gemma vision encoder	2025-03-11 14:49:17 -07:00
Patrick Devine	5f74d1fd47	gemma2 impl	2025-03-11 14:35:08 -07:00
Daniel Hiltgen	4dcf80167a	Build release for windows with local script (#9636 )	2025-03-11 08:34:20 -07:00
Michael Yang	26a26998fb	Merge pull request #9590 from ollama/mxyng/dump-pad fix: pad tensor item if ge zero	2025-03-10 16:34:55 -07:00
Michael Yang	9926eae015	fix: pad tensor item if ge zero this produces a nicer output since both positive and negative values produces the same width	2025-03-10 16:18:12 -07:00
Vincent Koc	8585b7b151	docs: add opik to observability integrations (#9626 )	2025-03-10 16:15:10 -07:00
Parth Sareen	7e34f4fbfa	sample: add numerical stability to temperature/softmax transform (#9631 )	2025-03-10 14:43:53 -07:00
Michael Yang	fe776293f7	Merge pull request #9569 from dwt/patch-1 Better WantedBy declaration	2025-03-10 14:09:37 -07:00
frob	d8a5d96b98	docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. (#9545 )	2025-03-10 11:02:54 -07:00
Xiaowei Zhu	757668c42f	docs: add SwiftChat (#9540 )	2025-03-10 11:01:09 -07:00
Sam	96ec8afd09	docs(tool): add mcp-llm (#9537 )	2025-03-10 09:52:02 -07:00
Jeffrey Morgan	e093db92c4	sample: temporarily use grammars for constrained generation in new engine (#9586 )	2025-03-10 16:17:39 +01:00
Jesse Gross	a1cda80bcb	model: Update encoder cache to use multimodal input processing handler The encoder cache needs to know the position of images in the input stream so that it knows when to delete them. Previously images didn't have a position, so we implied one by breaking batches before an image and then assuming the image was in the first position. However, multimodal objects are now given explicit positions in the input stream, so we can use that instead. Breaking batches was also a way to simulate a cross attention mask for mllama. However, given that it only supports a single sequence and a single image, this mask doesn't serve any real purpose. Removing the batch break does not appear to affect the quality of the output. Most of this is simply moving the input data structures to a new package to avoid import cycles.	2025-03-09 17:05:26 -07:00
Jesse Gross	4614fafae0	ollamarunner: Don't panic for unimplemented features at runtime. It's ok to fail on startup but we shouldn't panic during runtime based on user input. Downgrade the panic to a warning.	2025-03-08 18:58:18 -08:00
Jesse Gross	4100ed7bdd	ml: Add support for quantized KV cache Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.	2025-03-07 18:43:39 -08:00
Jesse Gross	f52b2615ef	kvcache: Set context for shift offsets	2025-03-07 18:43:39 -08:00
Jesse Gross	25f9b152f9	ggml-backend: Ensure allocation meet backend requirements Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.	2025-03-07 18:43:39 -08:00
Jesse Gross	6da8b6a879	kvcache: Support non-causal attention Models can disable causality for all or part of their processing while continuing to store data in the KV cache.	2025-03-07 18:39:27 -08:00
Jesse Gross	0daaaef8c9	ollamarunner: Quiet debug logging and panic on unimplemented features Debug logging of every token has previously caused test timeouts on slower machines.	2025-03-07 18:38:02 -08:00
Jesse Gross	98272fbd58	additional review comments	2025-03-07 14:08:21 -08:00
Michael Yang	b27e8f3f10	ml/backend/ggml: use backend buffer type this ensures the tensor is created on the right buffer type for backends such as cpu	2025-03-07 14:08:21 -08:00
Michael Yang	45df786f09	comments	2025-03-07 14:08:21 -08:00
Michael Yang	daaf42e4a4	ml/backend/ggml: clean up	2025-03-07 14:08:21 -08:00
Michael Yang	2dc60d4620	ml/backend/ggml: offload vision to cpu temporary until tensor loading can accurately account for vision models	2025-03-07 14:08:21 -08:00
Michael Yang	b5312f30e8	ml/backend/ggml: handle tensor split	2025-03-07 14:08:21 -08:00
Michael Yang	26c2e0bd35	ml/backend/ggml: handle user specified cpu offloading	2025-03-07 14:08:21 -08:00
Michael Yang	bf920883d5	ml/backend/ggml: set cpu n_threads	2025-03-07 14:08:21 -08:00
Michael Yang	58b9ec1f6b	kvcache: update tests	2025-03-07 14:08:21 -08:00
Michael Yang	7bae7fa5ce	ml/backend/ggml: create tensor on specific backend some tensors should be created on specific backends to reduce number of copies and improve performance	2025-03-07 14:08:21 -08:00

... 3 4 5 6 7 ...

4227 Commits All Branches Search

4227 Commits

All Branches