Commit Graph

42 Commits

Author SHA1 Message Date
Michael Yang fa7776fd24
gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-08-05 12:21:16 -07:00
Michael Yang 73b642e6f3
add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-06-25 21:47:09 -07:00
Michael Yang 333e360422
model: handle multiple eos tokens (#10577)
* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)
2025-05-16 13:40:23 -07:00
Bruce MacDonald 0aa8b371dd
model: add Qwen2.5-VL support (#10385) 2025-05-13 20:58:02 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang 4892872c18 convert: change to colmajor 2025-04-25 15:27:39 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Bruce MacDonald 61a8825216
convert: return name of unsupported architecture (#9862)
When a model's architecture cannot be converted return the name of the unsupported arch in the error message.
2025-03-18 10:38:28 -07:00
jmorganca 83f0ec8269 all: address linter errors 2025-03-11 14:49:20 -07:00
Patrick Devine c62861f4fa fix conversion 2025-03-11 14:49:18 -07:00
Michael Yang 4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Michael Yang 58245413f4
next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00
Josh 93a8daf285
convert: import support for command-r models from safetensors (#6063)
---------

Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-01-15 16:31:22 -08:00
Bruce MacDonald f6f3713001
convert: qwen2 from safetensors (#8408)
Add native support for converting Qwen2 family models (including Qwen2.5)
from safetensors to gguf format so we can run it.
2025-01-14 10:34:37 -08:00
Patrick Devine 84b84ce2db
catch when model vocab size is set correctly (#6714) 2024-09-09 17:18:54 -07:00
Patrick Devine 0c819e167b
convert safetensor adapters into GGUF (#6327) 2024-08-23 11:29:56 -07:00
Michael Yang 3546bbd08c convert gemma2 2024-08-20 17:27:51 -07:00
Michael Yang 5a28b9cf5f bert 2024-08-20 17:27:34 -07:00
Michael Yang 6ffb5cb017 add conversion for microsoft phi 3 mini/medium 4k, 128 2024-08-12 15:13:29 -07:00
Michael Yang eafc607abb convert: only extract large files 2024-07-31 15:58:55 -07:00
Michael Yang df993fa37b comments 2024-07-31 15:58:55 -07:00
Michael Yang 5e9db9fb0b refactor convert 2024-07-31 15:58:33 -07:00
Michael Yang e40145a39d lint 2024-06-04 11:13:30 -07:00
Michael Yang bbbd9f20f3 cleanup 2024-05-20 16:13:57 -07:00
Michael Yang 547132e820 bpe pretokenizer 2024-05-20 16:13:57 -07:00
Patrick Devine d355d2020f add fixes for llama 2024-05-20 16:13:57 -07:00
Patrick Devine c8cf0d94ed llama3 conversion 2024-05-20 16:13:57 -07:00
Patrick Devine d88582dffd some changes for llama3 2024-05-20 16:13:57 -07:00
Michael Yang 9685c34509 quantize any fp16/fp32 model
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
2024-05-06 15:24:01 -07:00
Patrick Devine ce8ce82567
add mixtral 8x7b model conversion (#3859) 2024-04-23 20:17:04 -07:00
Patrick Devine 9f8691c6c8
Add llama2 / torch models for `ollama create` (#3607) 2024-04-15 11:26:42 -07:00
Michael Yang be517e491c no rope parameters 2024-04-05 18:05:27 -07:00
Patrick Devine 3b6a9154dd
Simplify model conversion (#3422) 2024-04-01 16:14:53 -07:00
Patrick Devine 5a5efee46b
Add gemma safetensors conversion (#3250)
Co-authored-by: Michael Yang <mxyng@pm.me>
2024-03-28 18:54:01 -07:00
Patrick Devine 1b272d5bcd
change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347) 2024-03-26 13:04:17 -07:00
Michael Yang 9ea492f1ce convert: fix shape 2024-03-11 09:41:01 -07:00
Michael Yang 76bdebbadf decode ggla 2024-03-08 15:46:25 -08:00
Michael Yang 18979ad4a1 convert: fix default shape 2024-03-08 15:42:48 -08:00
Patrick Devine 2c017ca441
Convert Safetensors to an Ollama model (#2824) 2024-03-06 21:01:51 -08:00