Commit Graph

4271 Commits

Author SHA1 Message Date
Bruce MacDonald 0aa8b371dd
model: add Qwen2.5-VL support (#10385) 2025-05-13 20:58:02 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
tej 0478d440f0
Fixed over vram allcation dure to small initial layer sizes.
Co-authored-by: Tej Kiran <kiran.tej@amd.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Tej Kiran <itej89@gmailcom>
2025-05-13 16:42:39 -07:00
Parth Sareen 8cc33f4c2b
llama: fix memory leak for grammar (#10696) 2025-05-13 15:39:27 -07:00
Jeffrey Morgan f46df4e5d2
llama: fix defrag patch to defragment when no slots are available (#10695) 2025-05-13 14:02:08 -07:00
Daniel Hiltgen c6bcdc4223
Revert "remove cuda v11 (#10569)" (#10692)
Bring back v11 until we can better warn users that their driver
is too old.

This reverts commit fa393554b9.
2025-05-13 13:12:54 -07:00
Jeffrey Morgan 4b903f088a
llama: fix crash on snowflake embedding model (#10690) 2025-05-13 13:11:11 -07:00
Jeffrey Morgan c7f4ae7b9c
server: add webp image input support (#10653) 2025-05-12 20:41:42 -07:00
Michael Yang 526b2ed102
fix vocabulary (#10679) 2025-05-12 17:29:46 -07:00
Bruce MacDonald a7240c6d63
models: remove unused qwen2vl processing (#10677) 2025-05-12 16:08:42 -07:00
Daniel Hiltgen 9d6df90805
Follow up to #10363 (#10647)
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
2025-05-12 15:23:31 -07:00
Jeffrey Morgan 0cefd46f23
llama: update to commit de4c07f93 (#10655) 2025-05-12 12:17:26 -07:00
Bruce MacDonald ad035ad595
convert: quantize from safetensors needs kv (#10675)
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
2025-05-12 12:04:20 -07:00
Michael Yang f95a1f2bef
feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
HardCodeDev 82a9e9462a
readme: add UnityCodeLama to community integrations (#10665) 2025-05-11 13:44:51 -07:00
HardCodeDev 76724e2f29
readme: add OllamaPlusPlus C++ library to community integrations (#10664) 2025-05-11 13:40:41 -07:00
frob ecf14a220f
llama: allocate grammar buffer based on schema length (#10649) 2025-05-10 11:57:30 -07:00
frob 69ce44b33c
envconfig: Remove no longer supported max vram var (#10623)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-05-10 11:31:04 -07:00
Michael Yang 5969674cf1
feat: add threshold to dump options (#10639)
ml.Dump will preserve default values if not specified
2025-05-10 11:27:15 -07:00
AliAhmedNada 867d75b21e
readme: add ojira to community integrations (#10648) 2025-05-10 10:36:40 -07:00
Bruce MacDonald 3fa78598a1
cmd: strip single quotes from image page (#10636) 2025-05-09 18:05:43 -07:00
Michael Yang 0d6e35d3c6
fix: stream accumulator exits early (#10593)
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
2025-05-08 13:17:30 -07:00
Michael Yang 6e9a7a2568
lint: enable usetesting, disable tenv (#10594) 2025-05-08 11:42:14 -07:00
Michael Yang b585a58121
chore: remove unused ZipReader type (#10621) 2025-05-08 11:17:41 -07:00
Jeffrey Morgan fa9973cd7f
api: remove unused sampling parameters (#10581) 2025-05-08 08:31:08 -07:00
Jesse Gross 3d9498a425 ollamarunner: Use correct constant to remove cache entries
The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.

The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.
2025-05-07 17:26:15 -07:00
Daniel Hiltgen 3098c8b29b
CI: trigger downstream release process (#10508) 2025-05-07 10:35:12 -07:00
Daniel Hiltgen 5e380c3b42
sched: fix race leading to orphaned runners (#10599)
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight.  The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance.  The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
2025-05-07 09:38:17 -07:00
Jeffrey Morgan 392de84031
api: remove unused RetrieveModelResponse type (#10603) 2025-05-06 23:08:03 -07:00
Daniel Hiltgen af31ccefc0
fix data race in WriteGGUF (#10598)
err in the go routine should not be shared with the outer scope
2025-05-06 17:36:38 -07:00
Daniel Hiltgen fa393554b9
remove cuda v11 (#10569)
This reduces the size of our Windows installer payloads by ~256M by dropping
support for nvidia drivers older than Feb 2023.  Hardware support is unchanged.

Linux default bundle sizes are reduced by ~600M to 1G.
2025-05-06 17:33:19 -07:00
Aharon Bensadoun 307e3b3e1d
readme: add Flufy to community integrations (#9719) 2025-05-06 14:47:35 -07:00
Devon Rifkin 4090aca97b
server: send 405 instead of 404 for unallowed methods (#10275)
Fixes: #5483
2025-05-06 14:45:37 -07:00
Michael Yang 92ce438de0
server: remove internal cmd (#10595) 2025-05-06 13:05:01 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Michael Yang 95e744beeb
discover: fix compiler warnings (#10572) 2025-05-06 10:49:22 -07:00
Jeffrey Morgan 3b2d2c8326
api: remove unused or unsupported api options (#10574)
Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options
2025-05-05 14:54:40 -07:00
Michael Yang d931ee8f22
create blobs in parallel (#10135)
* default max term height
* error on out of tree files
2025-05-05 11:59:26 -07:00
Jesse Gross 7073600797 ggml: Reduce log level of "key not found"
Most of the time this is not an error.
2025-05-05 11:17:32 -07:00
Daniel Hiltgen b1c40138da
win: lint fix (#10571) 2025-05-05 11:08:12 -07:00
Ashok Gelal 17466217e5
Hide empty terminal window (#8668)
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
2025-05-05 09:06:46 -07:00
Jeffrey Morgan 1703d1472e
server: fix panic when runner.Options is nil (#10566) 2025-05-05 09:01:33 -07:00
Jeffrey Morgan 913905028b
all: fix cgo compiler warnings on windows (#10563) 2025-05-05 08:02:39 -07:00
湛露先生 7e5c8eee5c
file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-05-04 15:37:59 -07:00
Daniel Hiltgen 6a74bba7e7
win: ensure ollama paths come first (#10549)
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.

Also fixes a minor build script glitch for windows rocm
2025-05-03 13:11:48 -07:00
Daniel Hiltgen 76ea735aaf
sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-05-03 12:01:56 -07:00
aritra saha dd1d4e99e7
readme: add llama 4 models (#10530) 2025-05-02 19:45:02 -07:00
Jesse Gross a6ef73f4f2 ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-05-02 13:43:25 -07:00
Jesse Gross c2f5d6662b ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-05-02 12:22:47 -07:00
Harsh Nevse 57fb759f3c
readme: update link to langchain in community integrations (#10465) 2025-05-01 23:08:51 -07:00