Commit Graph

178 Commits

Author SHA1 Message Date
Daniel Hiltgen ccd7785859
Merge pull request #5243 from dhiltgen/modelfile_use_mmap
Fix use_mmap for modefiles
2024-07-03 13:59:42 -07:00
Daniel Hiltgen 0e982bc1f4 Fix corner cases on tmp cleaner on mac
When ollama is running a long time, tmp cleaners can remove the
runners.  This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"
2024-07-03 13:10:14 -07:00
Josh Yan 33a65e3ba3 error 2024-07-01 16:04:13 -07:00
Daniel Hiltgen 97c9e11768 Switch use_mmap to a pointer type
This uses nil as undefined for a cleaner implementation.
2024-07-01 08:44:59 -07:00
Daniel Hiltgen 3518aaef33
Merge pull request #4218 from dhiltgen/auto_parallel
Enable concurrency by default
2024-07-01 08:32:29 -07:00
Blake Mizerany cb42e607c5
llm: speed up gguf decoding by a lot (#5246)
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.
2024-06-24 21:47:52 -07:00
Daniel Hiltgen 17b7186cd7 Enable concurrency by default
This adjusts our default settings to enable multiple models and parallel
requests to a single model.  Users can still override these by the same
env var settings as before.  Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s).  As before, multiple models will only load
concurrently if they fully fit in VRAM.
2024-06-21 15:45:05 -07:00
Daniel Hiltgen 5bf5aeec01 Refine mmap default logic on linux
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
2024-06-20 11:07:04 -07:00
Daniel Hiltgen 96624aa412
Merge pull request #5072 from dhiltgen/windows_path
Move libraries out of users path
2024-06-19 09:13:39 -07:00
Daniel Hiltgen 7784ca33ce Tighten up memory prediction logging
Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.
2024-06-18 09:15:35 -07:00
Daniel Hiltgen 171796791f Adjust mmap logic for cuda windows for faster model load
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
2024-06-17 16:54:30 -07:00
Daniel Hiltgen b2799f111b Move libraries out of users path
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
2024-06-17 13:12:18 -07:00
Daniel Hiltgen da3bf23354 Workaround gfx900 SDMA bugs
Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly
2024-06-14 15:38:13 -07:00
Daniel Hiltgen 6f351bf586 review comments and coverage 2024-06-14 14:55:50 -07:00
Daniel Hiltgen fc37c192ae Refine CPU load behavior with system memory visibility 2024-06-14 14:51:40 -07:00
Daniel Hiltgen 6fd04ca922 Improve multi-gpu handling at the limit
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
Craig Hughes b84aea1685
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782) 2024-06-09 10:57:09 -07:00
Michael Yang e40145a39d lint 2024-06-04 11:13:30 -07:00
Michael Yang c895a7d13f some gocritic 2024-06-04 11:13:30 -07:00
Michael Yang 829ff87bd1
revert tokenize ffi (#4761)
* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65dbb.

* Revert "vocab only"

This reverts commit bf54c845e9.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a0410.
2024-05-31 18:54:21 -07:00
Jeffrey Morgan a50a87a7b8
partial offloading: allow flash attention and disable mmap (#4734)
* partial offloading: allow flash attention and disable mmap

* allow mmap with num_gpu=0
2024-05-30 16:58:01 -07:00
Michael Yang 26a00a0410 use ffi for tokenizing/detokenizing 2024-05-29 11:26:47 -07:00
Daniel Hiltgen 92c81e8117 Give the final model loading more time
On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.
2024-05-28 09:08:10 -07:00
Lei Jitang 7487229c34
llm/server.go: Fix 2 minor typos (#4661)
Signed-off-by: Lei Jitang <leijitang@outlook.com>
2024-05-27 17:21:10 -07:00
Daniel Hiltgen 0165ba1651
Merge pull request #4638 from dhiltgen/better_error
Report better warning on client closed abort of load
2024-05-25 14:32:28 -07:00
Daniel Hiltgen c4209d6d21 Report better warning on client closed abort of load
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
2024-05-25 09:23:28 -07:00
Patrick Devine 4cc3be3035
Move envconfig and consolidate env vars (#4608) 2024-05-24 14:57:15 -07:00
Daniel Hiltgen b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
Jeffrey Morgan 38255d2af1
Use flash attention flag for now (#4580)
* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests
2024-05-22 21:52:09 -07:00
Sam e15307fdf4
feat: add support for flash_attn (#4120)
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
Patrick Devine d1692fd3e0
fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) 2024-05-15 15:43:16 -07:00
Daniel Hiltgen 853ae490e1 Sanitize the env var debug log
Only dump env vars we care about in the logs
2024-05-15 14:42:57 -07:00
Patrick Devine 6845988807
Ollama `ps` command for showing currently loaded models (#4327) 2024-05-13 17:17:36 -07:00
jmorganca 92ca2cca95 Revert "only forward some env vars"
This reverts commit ce3b212d12.
2024-05-10 22:53:21 -07:00
Daniel Hiltgen c4014e73a2 Fall back to CPU runner with zero layers 2024-05-10 15:09:48 -07:00
Jeffrey Morgan bb6fd02298
Don't clamp ctx size in `PredictServerFit` (#4317)
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
2024-05-10 10:17:12 -07:00
Michael Yang cf442cd57e fix typo 2024-05-09 16:23:37 -07:00
Michael Yang ce3b212d12 only forward some env vars 2024-05-09 15:16:09 -07:00
Michael Yang 58876091f7 log clean up 2024-05-09 14:55:36 -07:00
Daniel Hiltgen d0425f26cf
Merge pull request #4294 from dhiltgen/harden_subprocess_reaping
Harden subprocess reaping
2024-05-09 14:02:16 -07:00
Bruce MacDonald cfa84b8470
add done_reason to the api (#4235) 2024-05-09 13:30:14 -07:00
Daniel Hiltgen 84ac7ce139 Refine subprocess reaping 2024-05-09 11:21:31 -07:00
Daniel Hiltgen 920a4b0794 Merge remote-tracking branch 'upstream/main' into pr3702 2024-05-08 16:44:35 -07:00
Daniel Hiltgen ee49844d09
Merge pull request #4153 from dhiltgen/gpu_verbose_response
Add GPU usage
2024-05-08 16:39:11 -07:00
Daniel Hiltgen bee2f4a3b0 Record GPU usage information
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
Daniel Hiltgen 72700279e2 Detect noexec and report a better error
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
2024-05-07 16:46:15 -07:00
Daniel Hiltgen 380378cc80 Use our libraries first
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
2024-05-06 14:23:29 -07:00
Jeffrey Morgan ed740a2504
Fix `no slots available` error with concurrent requests (#4160) 2024-05-06 14:22:53 -07:00
Jeffrey Morgan 1b0e6c9c0e
Fix llava models not working after first request (#4164)
* fix llava models not working after first request

* individual requests only for llava models
2024-05-05 20:50:31 -07:00
Daniel Hiltgen f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
Mark Ward 321d57e1a0 Removing go routine calling .wait from load. 2024-05-01 18:51:10 +00:00
Mark Ward ba26c7aa00 it will always return an error due to Kill() discarding Wait() errors 2024-05-01 18:51:10 +00:00
Mark Ward 63c763685f log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
Mark Ward 948114e3e3 fix sched to wait for the runner to terminate to ensure following vram check will be more accurate 2024-05-01 18:51:10 +00:00
Jeffrey Morgan 7aa08a77ca
llm: dont cap context window limit to training context window (#3988) 2024-04-29 10:07:30 -04:00
Jeffrey Morgan bb31def011
return code `499` when user cancels request while a model is loading (#3955) 2024-04-26 17:38:29 -04:00
Jeffrey Morgan 993cf8bf55
llm: limit generation to 10x context size to avoid run on generations (#3918)
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
2024-04-25 19:02:30 -04:00
Daniel Hiltgen 6e76348df7
Merge pull request #3834 from dhiltgen/not_found_in_path
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
Daniel Hiltgen 58888a74bc Detect and recover if runner removed
Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
Daniel Hiltgen 34b9db5afc Request and model concurrency
This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
2024-04-22 19:29:12 -07:00
Daniel Hiltgen 8711d03df7 Report errors on server lookup instead of path lookup failure 2024-04-22 19:08:47 -07:00
Daniel Hiltgen aa72281eae Trim spaces and quotes from llm lib override 2024-04-22 17:11:14 -07:00
ManniX-ITA c496967e56
Merge branch 'ollama:main' into mannix-server 2024-04-18 18:45:15 +02:00
Michael Yang 3cf483fe48 add stablelm graph calculation 2024-04-17 13:57:19 -07:00
Michael Yang a8b9b930b4 account for all non-repeating layers 2024-04-17 11:21:21 -07:00
ManniX-ITA bd54b08261
Streamlined WaitUntilRunning 2024-04-17 17:39:52 +02:00
Michael Yang 26df674785 scale graph based on gpu count 2024-04-16 14:44:13 -07:00
Michael Yang 41a272de9f darwin: no partial offloading if required memory greater than system 2024-04-16 11:22:38 -07:00
Jeffrey Morgan a0b8a32eb4
Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653)
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
2024-04-15 12:09:32 -04:00
Michael Yang 7e33a017c0 partial offloading 2024-04-10 11:37:20 -07:00
Michael Yang 8b2c10061c refactor tensor query 2024-04-10 11:37:20 -07:00
Daniel Hiltgen c5ff443b9f Handle very slow model loads
During testing, we're seeing some models take over 3 minutes.
2024-04-09 16:35:10 -07:00
Michael Yang be517e491c no rope parameters 2024-04-05 18:05:27 -07:00
Michael Yang 12e923e158 update graph size estimate 2024-04-03 13:34:12 -07:00
Daniel Hiltgen 464d817824
Merge pull request #3464 from dhiltgen/subprocess
Fix numgpu opt miscomparison
2024-04-02 20:10:17 -07:00
Daniel Hiltgen 6589eb8a8c Revert options as a ref in the server 2024-04-02 16:44:10 -07:00
Michael Yang 80163ebcb5 fix metal gpu 2024-04-02 16:06:45 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00