Commit Graph

86 Commits

Author SHA1 Message Date
Michael Yang adff143bcd
fix: mllama quality (#10807)
* fix mllama convert

- transform attn_gate and ffn_gate
- swap attention heads for vision models

* fix mllama

the mlp gate which was applied in the wrong place
2025-05-22 11:30:49 -07:00
Jesse Gross 94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Michael Yang 333e360422
model: handle multiple eos tokens (#10577)
* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)
2025-05-16 13:40:23 -07:00
Michael Yang 55760195e6
fix mllama conversion (#10716)
cross attention Q and K projections needs to have their heads swapped, similar to non-cross attention Q and K tensors
2025-05-15 12:15:01 -07:00
Bruce MacDonald 0aa8b371dd
model: add Qwen2.5-VL support (#10385) 2025-05-13 20:58:02 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Michael Yang b585a58121
chore: remove unused ZipReader type (#10621) 2025-05-08 11:17:41 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
湛露先生 7e5c8eee5c
file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-05-04 15:37:59 -07:00
Michael Yang 7ba9fa9c7d fixes for maverick 2025-04-25 16:59:20 -07:00
Michael Yang 8bf11b84c1 chunked attention 2025-04-25 16:59:20 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang dc1e81f027 convert: use -1 for read all 2025-04-25 16:59:01 -07:00
Michael Yang 4892872c18 convert: change to colmajor 2025-04-25 15:27:39 -07:00
Michael Yang 2fec73eef6 fix write gguf padding 2025-04-16 10:24:35 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Bruce MacDonald 9876c9faa4
chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
Bruce MacDonald 61a8825216
convert: return name of unsupported architecture (#9862)
When a model's architecture cannot be converted return the name of the unsupported arch in the error message.
2025-03-18 10:38:28 -07:00
Patrick Devine 80c7ce381b
fix: change default context size for gemma3 (#9744) 2025-03-13 13:59:19 -07:00
jmorganca 83f0ec8269 all: address linter errors 2025-03-11 14:49:20 -07:00
Michael Yang 63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
Patrick Devine 2e54d72fc3 fix gemma3 1b conversion 2025-03-11 14:49:19 -07:00
Michael Yang 6b32a2d549 compat with upstream gguf 2025-03-11 14:49:19 -07:00
Michael Yang d368c039f0 skip repacking vision tensors 2025-03-11 14:49:19 -07:00
Patrick Devine 9b54267e69 fix configs 2025-03-11 14:49:19 -07:00
Michael Yang 46bb0169c4 update model 2025-03-11 14:49:19 -07:00
Patrick Devine c62861f4fa fix conversion 2025-03-11 14:49:18 -07:00
Michael Yang 0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Patrick Devine 631fecc6d9 temporary work around for converting spm 2025-03-11 14:49:18 -07:00
Michael Yang 4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Michael Yang 58245413f4
next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00
Josh 93a8daf285
convert: import support for command-r models from safetensors (#6063)
---------

Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-01-15 16:31:22 -08:00
Bruce MacDonald f6f3713001
convert: qwen2 from safetensors (#8408)
Add native support for converting Qwen2 family models (including Qwen2.5)
from safetensors to gguf format so we can run it.
2025-01-14 10:34:37 -08:00
Stefan Weil abfdc4710f
all: fix typos in documentation, code, and comments (#7021) 2024-12-10 12:58:06 -08:00
Michael Yang 4456012956 fix unmarshaling merges 2024-12-04 09:21:56 -08:00
Patrick Devine c7cb0f0602
image processing for llama3.2 (#6963)
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>
2024-10-18 16:12:35 -07:00
Patrick Devine 84b84ce2db
catch when model vocab size is set correctly (#6714) 2024-09-09 17:18:54 -07:00
Patrick Devine 608e87bf87
Fix gemma2 2b conversion (#6645) 2024-09-05 17:02:28 -07:00
Michael Yang 9cfd2dd3e3
Merge pull request #6522 from ollama/mxyng/detect-chat
detect chat template from configs that contain lists
2024-08-28 11:04:18 -07:00
Patrick Devine 6c1c1ad6a9
throw an error when encountering unsupport tensor sizes (#6538) 2024-08-27 17:54:04 -07:00
Michael Yang 60e47573a6 more tokenizer tests 2024-08-27 14:51:10 -07:00
Michael Yang eae3af6807 clean up convert tokenizer 2024-08-27 11:11:43 -07:00
Michael Yang 3eb08377f8 detect chat template from configs that contain lists 2024-08-27 10:49:33 -07:00
Patrick Devine 0c819e167b
convert safetensor adapters into GGUF (#6327) 2024-08-23 11:29:56 -07:00
Michael Yang 77903ab8b4 llama3.1 2024-08-21 11:49:31 -07:00
Michael Yang 3546bbd08c convert gemma2 2024-08-20 17:27:51 -07:00
Michael Yang 5a28b9cf5f bert 2024-08-20 17:27:34 -07:00
Bruce MacDonald aec77d6a05 support new "longrope" attention factor 2024-08-12 15:13:29 -07:00
Michael Yang 6ffb5cb017 add conversion for microsoft phi 3 mini/medium 4k, 128 2024-08-12 15:13:29 -07:00