ollama

Commit Graph

Author	SHA1	Message	Date
Michael Yang	c890011322	feat: port qwen2 model (#10782 )	2025-05-21 10:21:24 -07:00
Michael Yang	e0ed984cde	feat: qwen3 dense and sparse models (#10708 ) * feat: qwen3 dense * feat: qwen3moe * fix llama4 moe	2025-05-21 10:21:07 -07:00
Michael Yang	69b2fe9282	fix: qwen25vl assign samebatch in multimodal input (#10789 ) setting samebatch on the vision start token is problematic because it will be shared with other inputs that also use images. this will cause the input to be cached and the runner will not see SameBatch. SameBatch will also be incorrect since it may be for a different image. assigning samebatch to the input tokens resolves this by ensure it's assigned correctly to inputs corresponding to the image. not setting same batch correctly may cause panics during inference since images are no longer guaranteed to be in the same batch.	2025-05-21 09:39:20 -07:00
Michael Yang	9ed8bf14cb	ml: add more rope options (#10775 )	2025-05-20 15:51:08 -07:00
Michael Yang	ff180c3466	fix llama and mistral3 models (#10774 ) * fix llama model * fix mistral3.1 model do not set default vision layers	2025-05-19 15:06:35 -07:00
Jesse Gross	94ab428e3f	ggml: Seperate tensor load from backend creation Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.	2025-05-19 09:54:22 -07:00
Michael Yang	333e360422	model: handle multiple eos tokens (#10577 ) * get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)	2025-05-16 13:40:23 -07:00
Jesse Gross	3c14461d5d	ollamarunner: Separate text and multimodal graphs For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.	2025-05-15 13:46:20 -07:00
Michael Yang	ef202789fa	fix pixel values padding (#10718 ) * panic if trying to pad 4d * fix pixel values padding	2025-05-15 13:44:44 -07:00
Bruce MacDonald	0aa8b371dd	model: add Qwen2.5-VL support (#10385 )	2025-05-13 20:58:02 -07:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Michael Yang	526b2ed102	fix vocabulary (#10679 )	2025-05-12 17:29:46 -07:00
Bruce MacDonald	a7240c6d63	models: remove unused qwen2vl processing (#10677 )	2025-05-12 16:08:42 -07:00
Michael Yang	f95a1f2bef	feat: add trace log level (#10650 ) reduce prompt log to trace level	2025-05-12 11:43:00 -07:00
Michael Yang	5cfc1c39f3	model: fix build (#10416 )	2025-04-25 19:24:48 -07:00
Michael Yang	7ba9fa9c7d	fixes for maverick	2025-04-25 16:59:20 -07:00
Michael Yang	8bf11b84c1	chunked attention	2025-04-25 16:59:20 -07:00
Michael Yang	470af8ab89	connect vision to text	2025-04-25 16:59:20 -07:00
Michael Yang	178761aef3	image processing Co-authored-by: Patrick Devine <patrick@infrahq.com>	2025-04-25 16:59:20 -07:00
Michael Yang	f0c66e6dea	llama4	2025-04-25 16:59:20 -07:00
Michael Yang	d26c18e25c	fix token type	2025-04-25 16:59:01 -07:00
Parth Sareen	a53d744b01	llama: remove model loading for grammar (#10096 )	2025-04-24 11:51:19 -07:00
Michael Yang	40b8fdbdca	arange	2025-04-18 11:45:44 -07:00
Jesse Gross	dbb149e6f7	ollamarunner: Preallocate worst case graph at startup Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.	2025-04-08 10:01:28 -07:00
Bruce MacDonald	6bd0a983cd	model: support for mistral-small in the ollama runner Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.	2025-04-03 16:57:36 -07:00
Michael Yang	3b96a93672	fs: move ml.Config to fs package	2025-04-03 13:12:24 -07:00
Jeffrey Morgan	b51e0f397c	model: fix issues with spm tokenizer for Gemma 3 (#10081 )	2025-04-02 13:22:56 -07:00
Michael Yang	74bd09652d	ml/backend/ggml: load tensors in 32KiB chunks	2025-03-21 14:43:52 -07:00
Jesse Gross	0fbfcf3c9c	model: Pass input tensor instead of raw data to models Rather than directly giving the input data to models, we can pass a tensor instead. In the short term, this saves some duplicated code. Longer term, we will want to overlap setting up the next batch with processing of the current one. In this case, we will only have the shape of tensor but it will not be loaded with data at the time of graph generation. By passing only a tensor to models now, we set up this possibility and prevent them from relying on data that they won't have in the future. Although the same could be done for Positions and Outputs, in some cases we either need the raw input data or don't use them at all. Therefore, for now we leave them as they are and allow models to convert them to tensors as needed.	2025-03-20 13:28:13 -07:00
Jesse Gross	0c220935bd	input: Rename Options to Batch Options is no longer very descriptive of this struct.	2025-03-20 13:28:13 -07:00
Jesse Gross	b078dd157c	gemma2: Remove second call to Rows Looks like a merge conflict that broke the model.	2025-03-19 17:28:49 -07:00
Jeffrey Morgan	da0e345200	ml: use input context for extracting outputs (#9875 )	2025-03-18 18:08:19 -07:00
Jesse Gross	282bfaaa95	ollamarunner: Use a separate context per multimodal input Currently there is a single context per sequence, shared all by all multimodal inputs. Since we build a vision encoder graph per image, with a large number of inputs we can eventually hit the maximum number of graph nodes per context. This changes to use a separate context for each image, ensuring that available resource limits are consistent.	2025-03-14 15:38:54 -07:00
Jesse Gross	9679f40146	ml: Allow models to constrain inputs to a single batch Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697	2025-03-14 15:38:54 -07:00
Michael Yang	3e102b7dad	Update model/model.go Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2025-03-13 13:11:52 -07:00
Michael Yang	5e2e0b46b1	fix: error if image requested without vision model	2025-03-13 10:52:09 -07:00
Bruce MacDonald	a70820daa0	models/gemma3: remove final logit softcap (#9692 ) Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.	2025-03-12 10:17:57 -07:00
jmorganca	83f0ec8269	all: address linter errors	2025-03-11 14:49:20 -07:00
jmorganca	fb4664fcec	model: add more spm tokenizer tests	2025-03-11 14:49:20 -07:00
jmorganca	20e3593863	model: validate left and right pairs before merging them	2025-03-11 14:49:20 -07:00
Michael Yang	63a394068c	use 2d pooling	2025-03-11 14:49:20 -07:00
jmorganca	11bfa62796	add trailing \n\n after <end_of_image> to match reference implementation	2025-03-11 14:49:20 -07:00
jmorganca	f63e62e546	reduce kernel size, add TODO for loading from config	2025-03-11 14:49:20 -07:00
jmorganca	65b0f329d1	Revert "Allow models to force a new batch" This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.	2025-03-11 14:49:20 -07:00
Jesse Gross	06007c0a18	Allow models to force a new batch This is useful for a few things: - Work around bugs, such as having 2 images in one batch - Keep the image in a single batch for fully connected attention - Improve performance by not evaluating embeddings multiple times	2025-03-11 14:49:20 -07:00
Jesse Gross	a8e83a7654	Disable causal attention based on batch index Currently we are using positions, which are relative to a sequence and may not be unique.	2025-03-11 14:49:20 -07:00
Jesse Gross	2c40c4d35e	Fix follow up images and images split across batches	2025-03-11 14:49:19 -07:00
Michael Yang	e95278932b	use non-causal mask only for image positions	2025-03-11 14:49:19 -07:00
Michael Yang	9d2a20a763	use non-causal mask for inputs with images	2025-03-11 14:49:19 -07:00
Michael Yang	6b32a2d549	compat with upstream gguf	2025-03-11 14:49:19 -07:00

1 2

84 Commits