Commit Graph

4271 Commits

Author SHA1 Message Date
Blake Mizerany 8294676150
server/internal/client/ollama: set User-Agent for registry client (#9775)
This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.

Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
2025-03-14 18:33:07 -07:00
Patrick Devine ef378ad673
gemma3 quantization (#9776) 2025-03-14 17:41:07 -07:00
Daniel Hiltgen 2d2247e59e
Align versions for local builds (#9635)
Darwin was using a different pattern for the version string
than linux or windows.
2025-03-14 15:44:08 -07:00
Jesse Gross 7bf793a600 gemma3: Allow multiple image in a single input
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.

This can no longer happen:
 - The vision encoder is now on the GPU so both images would be
   processed on the GPU.
 - We require images to be fully contained in a batch and each
   image including its special tokens is over half the batch size.
   As a result, we will never get two images in the same batch.

Fixes #9731
2025-03-14 15:38:54 -07:00
Jesse Gross 282bfaaa95 ollamarunner: Use a separate context per multimodal input
Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.
2025-03-14 15:38:54 -07:00
Jesse Gross 9679f40146 ml: Allow models to constrain inputs to a single batch
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697
2025-03-14 15:38:54 -07:00
Bruce MacDonald 3892c3a703
llm: remove internal subprocess req and resp types (#9324)
This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).
2025-03-14 15:21:53 -07:00
Blake Mizerany 4e320b8b90
server/internal/chunks: remove chunks package (#9755) 2025-03-14 08:57:59 -07:00
Blake Mizerany eb2b22b042
server/internal/client: use chunksums for concurrent blob verification (#9746)
Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.

The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.

This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.

The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
2025-03-13 22:18:29 -07:00
Michael Yang 4ea4d2b189
Merge pull request #9703 from ollama/mxyng/gemma3-memory
count gemma3 vision tensors
2025-03-13 16:56:34 -07:00
Michael Yang 8d76fa23ef count non-repeating vision layers 2025-03-13 16:53:29 -07:00
Bradley Erickson 74b44fdf8f
docs: Add OLLAMA_ORIGINS for browser extension support (#9643) 2025-03-13 16:35:20 -07:00
Michael Yang 65b88c544f fix divide by zero 2025-03-13 16:35:00 -07:00
Michael Yang a422ba39c9 roughly count gemma3 graph
the largest operation is by far (q @ k) so just count that for
simplicity
2025-03-13 16:35:00 -07:00
Michael Yang d2ec22371e count all vision tensors 2025-03-13 16:35:00 -07:00
Michael Yang 033cec232a count gemma3 vision tensors 2025-03-13 16:34:42 -07:00
Michael Yang 543240fb5f
Merge pull request #9741 from ollama/mxyng/visionless
fix: error if image requested without vision model
2025-03-13 15:03:25 -07:00
Patrick Devine 4bed739259
add verbose mode to the show command (#9640)
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
2025-03-13 14:24:27 -07:00
Patrick Devine 80c7ce381b
fix: change default context size for gemma3 (#9744) 2025-03-13 13:59:19 -07:00
Michael Yang ccfd41c4f0
Merge pull request #9742 from ollama/mxyng/engine-error-embeddings
fix: error on models that don't support embeddings
2025-03-13 13:12:33 -07:00
Michael Yang 3e102b7dad
Update model/model.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2025-03-13 13:11:52 -07:00
Michael Yang ec46f3286c engine: error on embeddings; not currently implemented 2025-03-13 11:40:55 -07:00
Michael Yang 5e2e0b46b1 fix: error if image requested without vision model 2025-03-13 10:52:09 -07:00
Michael Yang 45a13b1dec
Merge pull request #9688 from Shane-XB-Qian/debug_mistype_lld
ollama-debug.c: correct mistype
2025-03-13 10:12:44 -07:00
Parth Sareen 5c0b663969
sample: separate softmax and temperature transforms (#9732) 2025-03-13 09:53:27 -07:00
shane.xb.qian 30d7a59ba8 ollama-debug.c: change 'ld' to 'PRIi64'
* macOS has different definition per info from @mxyng
2025-03-13 17:10:37 +08:00
ParthSareen 4aeb67ef4c sample: do all sorting in topK 2025-03-12 11:59:17 -07:00
ParthSareen 3ba91634c1 sample: simplify top_k=0 sorting 2025-03-12 11:59:17 -07:00
ParthSareen 1b7433b71e sample: use container/heap for top_k 2025-03-12 11:59:17 -07:00
Bruce MacDonald a70820daa0
models/gemma3: remove final logit softcap (#9692)
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
2025-03-12 10:17:57 -07:00
Shane-XB-Qian 6b45b1d6b4
cli: adding support ctrl-n/p like general cli (#9136)
Signed-off-by: shane.xb.qian <shane.qian@foxmail.com>
2025-03-12 08:51:56 -07:00
shane.xb.qian 85ab552028 ollama-debug.c: correct mistype
Signed-off-by: shane.xb.qian <shane.qian@foxmail.com>
2025-03-12 22:32:30 +08:00
frob b3af953a55
cli: don't exit for invalid model during /load. (#9576)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-03-11 23:42:53 -07:00
Michael ad4e0bf3be
Adding Gemma 3 to readme (#9671) 2025-03-12 07:39:25 +01:00
Michael Yang aee28501b5
Merge pull request #9661 from ollama/gemma
engine: add gemma support
2025-03-11 15:07:50 -07:00
jmorganca 83f0ec8269 all: address linter errors 2025-03-11 14:49:20 -07:00
jmorganca c6b6938b3a kvcache: fix tests by adding AvgPool2D stub 2025-03-11 14:49:20 -07:00
jmorganca fb4664fcec model: add more spm tokenizer tests 2025-03-11 14:49:20 -07:00
jmorganca 20e3593863 model: validate left and right pairs before merging them 2025-03-11 14:49:20 -07:00
Michael Yang 63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
Daniel Hiltgen ab39e08eb9 llm: auto detect models that require Ollama Engine (#1) 2025-03-11 14:49:20 -07:00
jmorganca 11bfa62796 add trailing \n\n after <end_of_image> to match reference implementation 2025-03-11 14:49:20 -07:00
jmorganca f63e62e546 reduce kernel size, add TODO for loading from config 2025-03-11 14:49:20 -07:00
jmorganca 65b0f329d1 Revert "Allow models to force a new batch"
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
2025-03-11 14:49:20 -07:00
Jesse Gross 06007c0a18 Allow models to force a new batch
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross a8e83a7654 Disable causal attention based on batch index
Currently we are using positions, which are relative to a
sequence and may not be unique.
2025-03-11 14:49:20 -07:00
Jesse Gross 475005504e Restrict Gemma to a single image per request 2025-03-11 14:49:20 -07:00
Jesse Gross 2c40c4d35e Fix follow up images and images split across batches 2025-03-11 14:49:19 -07:00
Michael Yang e95278932b use non-causal mask only for image positions 2025-03-11 14:49:19 -07:00
Michael Yang 9d2a20a763 use non-causal mask for inputs with images 2025-03-11 14:49:19 -07:00