There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
1. it will not restore the current model settings if the current model is a thinking model; and
2. it will crash is the current model is a non-thinking model
This bug fix saves the current runOptions and then restores them if the model load
doesn't happen. It also fixes the crash happening for non-thinking models.
* auth: fix problems with the ollama keypairs
This change adds several fixes including:
- reading in the pubkey files correctly
- fixing the push unit test to create a keypair file in a temp directory
- not return 500 errors for normal status error
* bf16
* tests
* gpt-oss
* enable gptoss for engine
* rough estimate
* convert to mxfp4
* handle safetensors U8
* clamp glu/linear
* update tokenizer
* MXFP4 support
This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.
* Unit tests for MXFP4 support
This exercises various operations and shapes on both CPU and GPU (if detected
on the system)
* cuda graph
* unit test adjustments
* cuda: optimize memory access
Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
* mac: fix crash on old macos versions
cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors. Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.
* server: Minimum context length for gptoss
This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.
* ggml: Multiply by numParallel for gptoss sliding window
When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.
* gpt-oss integration
includes harmony parser and thinking levels, etc.
* fix sync
* fix tests
* fix lint
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
- Both `/api/generate` and `/api/chat` now accept a `"think"`
option that allows specifying whether thinking mode should be on or
not
- Templates get passed this new option so, e.g., qwen3's template can
put `/think` or `/no_think` in the system prompt depending on the
value of the setting
- Models' thinking support is inferred by inspecting model templates.
The prefix and suffix the parser uses to identify thinking support is
also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
existing API consumers. If the `"think"` option is not specified, the
behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
or `--think=false` to control thinking, or during an interactive
session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
it easy to use thinking in scripting scenarios like
`ollama run qwen3 --think --hidethinking "my question here"` where you
just want to see the answer but still want the benefits of thinking
models
* Move quantization logic to GGML via new backend
This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
* Remove "add model quantizations"
This is no longer needed now that quantization is implemented in Go+GGML code directly.
* increase default context length to 4096
We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.
We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.
We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.
* fix tests
add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.
* tweak small gpu message
* clarify context length default
also make it actually show up in `ollama serve --help`
This commit adds retry/backoff to the registry client for pull requests.
Also, revert progress indication to match original client's until we can
"get it right."
Also, make WithTrace wrap existing traces instead of clobbering them.
This allows clients to compose traces.
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
- output backend system info when initializing the backend. this ensures
this information is always present without needing to be called
explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
* Include unified vision layers in memory prediction
For newer vision models with a single gguf, include
the projection estimates.
* Adjust CLI to handle both styles of vision model metadata
* Wire up new tokenizers for new engine
If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp. This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.
This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.
* Lay foundation for auto selection of new engine
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.
In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
- Parallel processing
- Memory management for defragmentation and shifting
- Multi-modal modals
Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:
Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve
Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
* llama: wire up builtin runner
This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build. After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.
* build: Make target improvements
Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.
* Support customized CPU flags for runners
This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash. If the user builds a customized set, we omit the naming
scheme and don't check for compatibility. This avoids checking
requirements at runtime, so that logic has been removed as well. This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.
* Use relative paths
If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.
* Remove payloads from main binary
* install: clean up prior libraries
This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
After a user pushes their model it is not clear what to do next. Add a link
to the output of `ollama push` that tells the user where their model can now
be found.
In the past the ollama.com server would return a JWT that contained
information about the user being authenticated. This was used to return
different error messages to the user. This is no longer possible since the
token used to authenticate does not contain information about the user
anymore. Removing this code that no longer works.
Follow up changes will improve the error messages returned here, but good to
clean up first.
This avoids emitting the progress indicators to stderr, and the interactive
prompts to the output file or pipe. Running "ollama run model > out.txt"
now exits immediately, and "echo hello | ollama run model > out.txt"
produces zero stderr output and a typical response in out.txt