Commit Graph

45 Commits

Author SHA1 Message Date
Inforithmics da65fb2722 ROCm Library is named ROCm 2025-10-06 19:26:16 +02:00
Inforithmics 0b6b03f671 Set FilteredID for Environment Filtering 2025-10-06 17:12:24 +02:00
Inforithmics 908b31814d fixed vulkan casing 2025-10-05 11:01:26 +02:00
Inforithmics 96e562f982 fixed build 2025-10-04 16:35:04 +02:00
Inforithmics 9ac9f3a952 fixed formatting 2025-10-04 16:32:39 +02:00
Inforithmics ac6ba7d44b Merge remote-tracking branch 'upstream/main' into VulkanV3Update 2025-10-04 14:53:59 +02:00
Daniel Hiltgen bc8909fb38
Use runners for GPU discovery (#12090)
This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
2025-10-01 15:12:32 -07:00
Daniel Hiltgen c86af47ac0 WIP - wire up Vulkan with the new engine based discovery
Not a complete implementation - free VRAM is better, but not accurate on
windows
2025-09-24 10:49:39 -07:00
Daniel Hiltgen 3a8ee62bd5 Merge remote-tracking branch 'inforithmics/vulkanV3' into engine_based_discovery_with_vulkan 2025-09-21 14:04:22 -07:00
Daniel Hiltgen f761292516 Use runners for GPU discovery
This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
2025-09-21 13:53:24 -07:00
Inforithmics 1cb70716bf revert debug code 2025-09-20 15:26:24 +02:00
Inforithmics d26d920fb2 Filter out already supported gpus 2025-09-20 15:18:39 +02:00
Inforithmics 15eef5cc87 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-09-17 23:06:02 +02:00
Daniel Hiltgen 9c5bf342bc
fix: multi-cuda version skew (#12318)
Ensure that in a version skewed multi-cuda setup we use the lowest version for all GPUs
2025-09-17 13:05:09 -07:00
Inforithmics d97c2ab8b9 Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-09-06 20:16:05 +02:00
Daniel Hiltgen ead4a9a1d0
Always filter devices (#12108)
* Always filter devices

Avoid crashing on unsupported AMD iGPUs

* Remove cuda device filtering

This interferes with mixed setups
2025-08-29 12:17:31 -07:00
Masato Nakasaka af5f5bdf60 Removed libcap related code
libcap is not directly related to Vulkan and should be added by its own PR. It adds additional library dependencies for building and also requires users to run setcap or run ollama as root, which is not ideal for easy use
2025-08-27 11:51:53 +09:00
Inforithmics 56050ad8ea Fix logging 2025-08-14 22:42:30 +02:00
Inforithmics d71c83f2ba Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-14 22:11:08 +02:00
Daniel Hiltgen 837379a94c
discovery: fix cudart driver version (#11614)
We prefer the nvcuda library, which reports driver versions. When we
dropped cuda v11, we added a safety check for too-old drivers.  What
we missed was the cudart fallback discovery logic didn't have driver
version wired up.  This fixes cudart discovery to expose the driver
version as well so we no longer reject all GPUs if nvcuda didn't work.
2025-08-13 15:43:33 -07:00
Inforithmics 49c4d154ae Enable Vulkan Flash attention in FlashAttentionSupported 2025-08-12 21:55:19 +02:00
Inforithmics 0c27f472e7 Remove commented out code 2025-08-11 18:52:43 +02:00
Inforithmics d1f74e17d4 Update gpu.go 2025-08-10 21:28:59 +02:00
Inforithmics f6dd7070de vk_check_flash_attention 0 means supported 2025-08-10 21:22:26 +02:00
Inforithmics ee24b967f1 fixed flash attention logic enabling 2025-08-10 19:57:14 +02:00
Inforithmics a1393414ce revert remove parenthesis 2025-08-10 17:54:13 +02:00
Inforithmics 5270c4c5f7 enable falsh attention on vulkan 2025-08-10 16:53:13 +02:00
Thomas Stocker 29b1ed0077
Revert whitespace changes in gpu.go 2025-08-09 22:30:13 +02:00
Thomas Stocker 57270767ac
Remove flashattention setting gpu.go 2025-08-09 22:26:54 +02:00
Inforithmics f8ed1541ed Merge remote-tracking branch 'upstream/main' into vulkanV3 2025-08-09 21:59:30 +02:00
Michael Yang f95a1f2bef
feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
Vadim Grinco 45dbd14645
Merged latest ollama 0.6.2 and nasrally's Flash Attention patches (#5)
* readme: add Ellama to list of community integrations (#9800)

* readme: add screenpipe to community integrations (#9786)

* Add support for ROCm gfx1151 (#9773)

* conditionally enable parallel pipelines

* sample: make mutations in transforms explicit (#9743)

* updated minP to use early exit making use of sorted tokens

* ml/backend/ggml: allocate memory with malloc when loading model (#9822)

* runner: remove cache prompt flag from ollama runner (#9826)

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

* ollamarunner: Check for minBatch of context space when shifting

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

* Applied latest patches from McBane87

See this for details: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2708820861

Signed-off-by: Vadim Grinco <vadim@grinco.eu>

* Add ability to enable flash attention on vulkan (#4)

* discover: add flash attention handling for vulkan
* envconfig: fix typo in config.go

As part of the process some code was refactored and I added a new field
FlashAttention to GpuInfo since the previous solution didn't allow for a
granular check via vulkan extensions. As a side effect, this now allows
for granular per-device FA support checking in other places

---------

Signed-off-by: Vadim Grinco <vadim@grinco.eu>
Co-authored-by: zeo <108888572+zeozeozeo@users.noreply.github.com>
Co-authored-by: Louis Beaumont <louis.beaumont@gmail.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Nikita <50599445+nasrally@users.noreply.github.com>
2025-03-23 12:27:37 +01:00
Vadim Grinco 98f699773a Applied 00-fix-vulkan-building.patch
Work done by McBane87 here: https://github.com/whyvl/ollama-vulkan/issues/7#issuecomment-2660836871

Signed-off-by: Vadim Grinco <vadim@grinco.eu>
2025-03-10 12:34:37 +01:00
pufferffish 582d41e002 Merge github.com:ollama/ollama into vulkan 2025-02-03 14:44:30 +00:00
Michael Yang 548a9f56a6 Revert "cgo: use O3"
This reverts commit bea1f1fac6.
2025-01-31 10:25:39 -08:00
Michael Yang bea1f1fac6 cgo: use O3 2025-01-30 12:21:50 -08:00
Michael Yang dcfb7a105c
next build (#8539)
* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-01-29 15:03:38 -08:00
tomaThomas 0d277d32db
Fix variable name 2025-01-25 11:23:25 +01:00
yeongbba 9ac01e88dd Merge remote-tracking branch 'upstream/vulkan' into vulkan 2025-01-19 12:49:38 +09:00
pufferffish 9ad63a747b fix conflict 2025-01-12 01:00:41 +00:00
Bruce MacDonald 2d33c4e97d discover: remove leading new-line for linter 2025-01-03 12:03:58 -08:00
Daniel Hiltgen 4879a234c4
build: Make target improvements (#7499)
* llama: wire up builtin runner

This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.

* build: Make target improvements

Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.

* Support customized CPU flags for runners

This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash.  If the user builds a customized set, we omit the naming
scheme and don't check for compatibility.  This avoids checking
requirements at runtime, so that logic has been removed as well.  This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.

* Use relative paths

If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.

* Remove payloads from main binary

* install: clean up prior libraries

This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
2024-12-10 09:47:19 -08:00
Daniel Hiltgen df011054fa
Jetpack support for Go server (#7217)
This adds support for the Jetson JetPack variants into the Go runner
2024-11-12 10:31:52 -08:00
Daniel Hiltgen 29ab9fa7d7
nvidia libs have inconsistent ordering (#7473)
The runtime and management libraries may not always have
identical ordering, so use the device UUID to correlate instead of ID.
2024-11-02 16:35:41 -07:00
Daniel Hiltgen 05cd82ef94
Rename gpu package discover (#7143)
Cleaning up go package naming
2024-10-16 17:45:00 -07:00