Commit Graph

29 Commits

Author SHA1 Message Date
Jesse Gross 94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Bruce MacDonald bd68d3ae50
ggml: update qwen25vl vision size estimate (#10711) 2025-05-14 16:42:30 -07:00
Bruce MacDonald 0aa8b371dd
model: add Qwen2.5-VL support (#10385) 2025-05-13 20:58:02 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Jesse Gross 7073600797 ggml: Reduce log level of "key not found"
Most of the time this is not an error.
2025-05-05 11:17:32 -07:00
Michael Yang f0ad49ea17 memory 2025-04-25 16:59:20 -07:00
Michael Yang f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang ced7d0e53d fix parameter count 2025-04-25 16:59:01 -07:00
Michael Yang a0dba0f8ae default slice values 2025-04-25 16:59:01 -07:00
Michael Yang 5e20b170a7 update comment 2025-04-25 16:59:01 -07:00
Michael Yang d26c18e25c fix token type 2025-04-25 16:59:01 -07:00
Michael Yang 8d376acc9b zero means zero
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
2025-04-25 16:59:01 -07:00
Michael Yang 5d0279164c generic ggml.array 2025-04-25 16:59:01 -07:00
Bruce MacDonald 6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Jesse Gross f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Michael Yang 4ea4d2b189
Merge pull request #9703 from ollama/mxyng/gemma3-memory
count gemma3 vision tensors
2025-03-13 16:56:34 -07:00
Michael Yang 8d76fa23ef count non-repeating vision layers 2025-03-13 16:53:29 -07:00
Michael Yang 65b88c544f fix divide by zero 2025-03-13 16:35:00 -07:00
Michael Yang a422ba39c9 roughly count gemma3 graph
the largest operation is by far (q @ k) so just count that for
simplicity
2025-03-13 16:35:00 -07:00
Michael Yang d2ec22371e count all vision tensors 2025-03-13 16:35:00 -07:00
Michael Yang 033cec232a count gemma3 vision tensors 2025-03-13 16:34:42 -07:00
Patrick Devine 4bed739259
add verbose mode to the show command (#9640)
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
2025-03-13 14:24:27 -07:00
Daniel Hiltgen ab39e08eb9 llm: auto detect models that require Ollama Engine (#1) 2025-03-11 14:49:20 -07:00
Patrick Devine 5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Daniel Hiltgen 1fdb351c37
New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00
Michael Yang 53d2990d9b model: add bos token if configured 2025-02-27 21:04:59 +00:00
Michael Yang b16367b4b2 fix: add back bf16 support
this was accidentally removed when moving fs/ggml from its previous
location
2025-02-25 19:26:14 +00:00
Michael Yang 58245413f4
next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00