ollama

Commit Graph

Author	SHA1	Message	Date
Debarshi Ray	1cd53b57f1	Merge `7536f697ab` into `bc71278670`	2025-10-07 15:40:58 +02:00
Daniel Hiltgen	918231931c	win: fix build script (#12513 )	2025-10-06 14:46:45 -07:00
Daniel Hiltgen	55ca827267	AMD: block running on unsupported gfx900/gfx906 (#12481 )	2025-10-02 16:53:05 -07:00
Daniel Hiltgen	bc8909fb38	Use runners for GPU discovery (#12090 ) This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.	2025-10-01 15:12:32 -07:00
Debarshi Ray	7536f697ab	docs, scripts: Prevent useradd(8) from failing on Fedora Silverblue On OSTree based operating systems like Fedora Siverblue [1], the /usr/share directory is part of the read-only /usr mount point. This causes the install.sh script to fail when adding the 'ollama' user with its home directory at /usr/share/ollama, because useradd(8) is unable to create the directory: $ curl -fsSL https://ollama.com/install.sh \| sh >>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ############################################################### 100.0% >>> Creating ollama user... useradd: cannot create directory /usr/share/ollama >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. The /var/lib directory is an alternative for this, because /var is a read-write mount point. eg., this is used by Geoclue [2] and the GNOME Display Manager [3] for their users' home directories on Linux distributions like Arch, Fedora and Ubuntu. With this change the install.sh script is able to proceed further: $ sh scripts/install.sh >>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ############################################################### 100.0% >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink '/etc/systemd/system/default.target.wants/ollama.service' → '/etc/systemd/system/ollama.service'. >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode. The install.sh script is able to use /usr/local on Fedora Silverblue, because /usr/local is not considered part of the read-only OS image, and is a symbolic link to /var/usrlocal to make it read-write. [1] https://fedoraproject.org/silverblue/ [2] https://gitlab.freedesktop.org/geoclue/geoclue/-/wikis/home [3] https://wiki.gnome.org/Projects/GDM/ https://github.com/ollama/ollama/pull/12455	2025-09-30 22:26:20 +02:00
Daniel Hiltgen	0c3d0e7533	build: avoid unbounded parallel builds (#12319 ) With the addition of cuda v13, on a clean setup, the level of parallelism was causing docker desktop to become overwhelmed and compilers were crashing. This limits to 8 parallel per build stage, with the ability to override if you have many more cores available.	2025-09-18 14:57:01 -07:00
Daniel Hiltgen	17a023f34b	Add v12 + v13 cuda support (#12000 ) * Add support for upcoming NVIDIA Jetsons The latest Jetsons with JetPack 7 are moving to an SBSA compatible model and will not require building a JetPack specific variant. * cuda: bring back dual versions This adds back dual CUDA versions for our releases, with v11 and v13 to cover a broad set of GPUs and driver versions. * win: break up native builds in build_windows.ps1 * v11 build working on windows and linux * switch to cuda v12.8 not JIT * Set CUDA compression to size * enhance manual install linux docs	2025-09-10 12:05:18 -07:00
Daniel Hiltgen	405d2f628f	ci: rocm parallel builds on windows (#11187 ) The preset CMAKE_HIP_FLAGS isn't getting used on Windows. This passes the parallel flag in through the C/CXX flags, along with suppression for some log spew warnings to quiet down the build.	2025-06-24 15:27:09 -07:00
Daniel Hiltgen	1c6669e64c	Re-remove cuda v11 (#10694 ) * Re-remove cuda v11 Revert the revert - drop v11 support requiring drivers newer than Feb 23 This reverts commit `c6bcdc4223`. * Simplify layout With only one version of the GPU libraries, we can simplify things down somewhat. (Jetsons still require special handling) * distinct sbsa variant for linux arm64 This avoids accidentally trying to load the sbsa cuda libraries on a jetson system which results in crashes. * temporary prevent rocm+cuda mixed loading	2025-06-23 14:07:00 -07:00
Daniel Hiltgen	c6bcdc4223	Revert "remove cuda v11 (#10569 )" (#10692 ) Bring back v11 until we can better warn users that their driver is too old. This reverts commit `fa393554b9`.	2025-05-13 13:12:54 -07:00
Daniel Hiltgen	fa393554b9	remove cuda v11 (#10569 ) This reduces the size of our Windows installer payloads by ~256M by dropping support for nvidia drivers older than Feb 2023. Hardware support is unchanged. Linux default bundle sizes are reduced by ~600M to 1G.	2025-05-06 17:33:19 -07:00
Daniel Hiltgen	6a74bba7e7	win: ensure ollama paths come first (#10549 ) For all search path env vars make sure our dirs are first to avoid potentially finding other incompatible libraries on the users system. Also fixes a minor build script glitch for windows rocm	2025-05-03 13:11:48 -07:00
Daniel Hiltgen	2d2247e59e	Align versions for local builds (#9635 ) Darwin was using a different pattern for the version string than linux or windows.	2025-03-14 15:44:08 -07:00
Daniel Hiltgen	4dcf80167a	Build release for windows with local script (#9636 )	2025-03-11 08:34:20 -07:00
Michael Yang	ba7d31240e	fix: own lib/ollama directory expand backend loading error handling to catch more problems and log them instead of panicing	2025-03-03 13:01:18 -08:00
Daniel Hiltgen	688925aca9	Windows ARM build (#9120 ) * Windows ARM build Skip cmake, and note it's unused in the developer docs. * Win: only check for ninja when we need it On windows ARM, the cim lookup fails, but we don't need ninja anyway.	2025-02-27 09:02:25 -08:00
Daniel Hiltgen	e12af460ed	Add cuda Blackwell architecture for v12 (#9350 ) * Add cuda Blackwell architecture for v12 * Win: Split rocm out to separate zip file * Reduce CC matrix The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space. The 5.0 should be forward compatible with 5.2 and 5.3.	2025-02-26 09:20:52 -08:00
Daniel Hiltgen	e91ae3d47d	Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304 ) * Bump cuda and rocm versions Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8. Yum has some silent failure modes, so largely switch to dnf. * Fix windows build script	2025-02-25 13:47:36 -08:00
Michael Yang	1f766c36fb	ci: use windows-2022 to sign and bundle (#8941 ) ollama requires vcruntime140_1.dll which isn't found on 2019. previously the job used the windows runner (2019) but it explicitly installs 2022 to build the app. since the sign job doesn't actually build anything, it can use the windows-2022 runner instead.	2025-02-08 13:07:00 -08:00
Jeffrey Morgan	4759ecae19	ml/backend/ggml: fix library loading on macOS amd64 (#8827 )	2025-02-04 15:05:39 -08:00
Michael Yang	f9d2d89135	fix linux archive	2025-02-03 16:12:33 -08:00
Michael Yang	669dc31cf3	fix build	2025-02-03 15:10:51 -08:00
Michael Yang	e806184023	fix release workflow	2025-02-03 13:19:57 -08:00
Michael Yang	dcfb7a105c	next build (#8539 ) * add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-01-29 15:03:38 -08:00
Jeffrey Morgan	a72f2dce45	scripts: sign renamed macOS binary (#8131 )	2024-12-17 18:03:49 -08:00
Daniel Hiltgen	8f805dd74b	darwin: restore multiple runners for x86 (#8125 ) In 0.5.2 we simplified packaging to have avx only for macos x86. It looks like there may still be some non-AVX systems out there, so this puts back the prior logic of building no-AVX for the primary binary, and now 2 runners for avx and avx2. These will be packaged in the App bundle only, so the stand-alone binary will now be without AVX support on macos. On arm, we'll also see these runners reported as available in the log, but they're dormant and will never be used at runtime.	2024-12-16 18:45:02 -08:00
Daniel Hiltgen	4879a234c4	build: Make target improvements (#7499 ) * llama: wire up builtin runner This adds a new entrypoint into the ollama CLI to run the cgo built runner. On Mac arm64, this will have GPU support, but on all other platforms it will be the lowest common denominator CPU build. After we fully transition to the new Go runners more tech-debt can be removed and we can stop building the "default" runner via make and rely on the builtin always. * build: Make target improvements Add a few new targets and help for building locally. This also adjusts the runner lookup to favor local builds, then runners relative to the executable, and finally payloads. * Support customized CPU flags for runners This implements a simplified custom CPU flags pattern for the runners. When built without overrides, the runner name contains the vector flag we check for (AVX) to ensure we don't try to run on unsupported systems and crash. If the user builds a customized set, we omit the naming scheme and don't check for compatibility. This avoids checking requirements at runtime, so that logic has been removed as well. This can be used to build GPU runners with no vector flags, or CPU/GPU runners with additional flags (e.g. AVX512) enabled. * Use relative paths If the user checks out the repo in a path that contains spaces, make gets really confused so use relative paths for everything in-repo to avoid breakage. * Remove payloads from main binary * install: clean up prior libraries This removes support for v0.3.6 and older versions (before the tar bundle) and ensures we clean up prior libraries before extracting the bundle(s). Without this change, runners and dependent libraries could leak when we update and lead to subtle runtime errors.	2024-12-10 09:47:19 -08:00
R0CKSTAR	b7bddeebc1	env.sh: cleanup unused RELEASE_IMAGE_REPO (#6855 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-11-21 08:28:04 -08:00
frob	e66c29261a	Better error suppresion when getting terminal colours (#7739 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2024-11-19 08:33:52 -08:00
frob	5c18e66384	Notify the user if systemd is not running (#6693 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2024-11-18 15:02:41 -08:00
Daniel Hiltgen	4759d879f2	Install support for jetpacks (#7632 ) Follow up to #7217 - merge after release	2024-11-15 16:47:54 -08:00
Daniel Hiltgen	3a5239e6bf	Set macos min version for all architectures (#7579 )	2024-11-08 09:27:04 -08:00
Daniel Hiltgen	b8d5036e33	CI: omit unused tools for faster release builds (#7432 ) This leverages caching, and some reduced installer scope to try to speed up builds. It also tidies up some windows build logic that was only relevant for the older generate/cmake builds.	2024-11-02 13:56:54 -07:00
Daniel Hiltgen	b754f5a6a3	Remove submodule and shift to Go server - 0.4.0 (#7157 ) * Remove llama.cpp submodule and shift new build to top * CI: install msys and clang gcc on win Needed for deepseek to work properly on windows	2024-10-30 10:34:28 -07:00
Daniel Hiltgen	f86d00cd95	llama: add compiler tags for cpu features (#7137 ) This adds the ability to customize the default runner with user specified flags	2024-10-17 13:43:20 -07:00
Daniel Hiltgen	7d6eb0d4c3	Move macos v11 support flags to build script (#7203 ) Having v11 support hard-coded into the cgo settings causes warnings for newer Xcode versions. This should help keep the build clean for users building from source with the latest tools, while still allow us to target the older OS via our CI processes.	2024-10-16 12:49:46 -07:00
Jeffrey Morgan	96efd9052f	Re-introduce the `llama` package (#5034 ) * Re-introduce the llama package This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages: - C APIs can be called directly from Go without needing to use the previous "server" REST API - On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU) - No git submodule making it easier to clone and build from source This is a big PR, but much of it is vendor code except for: - llama.go CGo bindings - example/: a simple example of running inference - runner/: a subprocess server designed to replace the llm/ext_server package - Makefile an as minimal as possible Makefile to build the runner package for different targets (cpu, avx, avx2, cuda, rocm) Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com> * cache: Clear old KV cache entries when evicting a slot When forking a cache entry, if no empty slots are available we evict the least recently used one and copy over the KV entries from the closest match. However, this copy does not overwrite existing values but only adds new ones. Therefore, we need to clear the old slot first. This change fixes two issues: - The KV cache fills up and runs out of space even though we think we are managing it correctly - Performance gets worse over time as we use new cache entries that are not hot in the processor caches * doc: explain golang objc linker warning (#6830) * llama: gather transitive dependencies for rocm for dist packaging (#6848) * Refine go server makefiles to be more DRY (#6924) This breaks up the monolithic Makefile for the Go based runners into a set of utility files as well as recursive Makefiles for the runners. Files starting with the name "Makefile" are buildable, while files that end with ".make" are utilities to include in other Makefiles. This reduces the amount of nearly identical targets and helps set a pattern for future community contributions for new GPU runner architectures. When we are ready to switch over to the Go runners, these files should move to the top of the repo, and we should add targets for the main CLI, as well as a helper "install" (put all the built binaries on the local system in a runnable state) and "dist" target (generate the various tar/zip files for distribution) for local developer use. * llama: don't create extraneous directories (#6988) * llama: Exercise the new build in CI (#6989) Wire up some basic sanity testing in CI for the Go runner. GPU runners are not covered yet. * llama: Refine developer docs for Go server (#6842) This enhances the documentation for development focusing on the new Go server. After we complete the transition further doc refinements can remove the "transition" discussion. * runner.go: Allocate batches for all sequences during init We should tell the model that we could have full batches for all sequences. We already do this when we allocate the batches but it was missed during initialization. * llama.go: Don't return nil from Tokenize on zero length input Potentially receiving nil in a non-error condition is surprising to most callers - it's better to return an empty slice. * runner.go: Remove stop tokens from cache If the last token is EOG then we don't return this and it isn't present in the cache (because it was never submitted to Decode). This works well for extending the cache entry with a new sequence. However, for multi-token stop sequences, we won't return any of the tokens but all but the last one will be in the cache. This means when the conversation continues the cache will contain tokens that don't overlap with the new prompt. This works (we will pick up the portion where there is overlap) but it causes unnecessary cache thrashing because we will fork the original cache entry as it is not a perfect match. By trimming the cache to the tokens that we actually return this issue can be avoided. * runner.go: Simplify flushing of pending tokens * runner.go: Update TODOs * runner.go: Don't panic when processing sequences If there is an error processing a sequence, we should return a clean HTTP error back to Ollama rather than panicing. This will make us more resilient to transient failures. Panics can still occur during startup as there is no way to serve requests if that fails. Co-authored-by: jmorganca <jmorganca@gmail.com> * runner.go: More accurately capture timings Currently prompt processing time doesn't capture the that it takes to tokenize the input, only decoding time. We should capture the full process to more accurately reflect reality. This is especially true once we start processing images where the initial processing can take significant time. This is also more consistent with the existing C++ runner. * runner.go: Support for vision models In addition to bringing feature parity with the C++ runner, this also incorporates several improvements: - Cache prompting works with images, avoiding the need to re-decode embeddings for every message in a conversation - Parallelism is supported, avoiding the need to restrict to one sequence at a time. (Though for now Ollama will not schedule them while we might need to fall back to the old runner.) Co-authored-by: jmorganca <jmorganca@gmail.com> * runner.go: Move Unicode checking code and add tests * runner.go: Export external cache members Runner and cache are in the same package so the change doesn't affect anything but it is more internally consistent. * runner.go: Image embedding cache Generating embeddings from images can take significant time (on my machine between 100ms and 8s depending on the model). Although we already cache the result of decoding these images, the embeddings need to be regenerated every time. This is not necessary if we get the same image over and over again, for example, during a conversation. This currently uses a very small cache with a very simple algorithm but it is easy to improve as is warranted. * llama: catch up on patches Carry forward solar-pro and cli-unicode patches * runner.go: Don't re-allocate memory for every batch We can reuse memory allocated from batch to batch since batch size is fixed. This both saves the cost of reallocation as well keeps the cache lines hot. This results in a roughly 1% performance improvement for token generation with Nvidia GPUs on Linux. * runner.go: Default to classic input cache policy The input cache as part of the go runner implemented a cache policy that aims to maximize hit rate in both single and multi- user scenarios. When there is a cache hit, the response is very fast. However, performance is actually slower when there is an input cache miss due to worse GPU VRAM locality. This means that performance is generally better overall for multi-user scenarios (better input cache hit rate, locality was relatively poor already). But worse for single users (input cache hit rate is about the same, locality is now worse). This defaults the policy back to the old one to avoid a regression but keeps the new one available through an environment variable OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is to improve this in the future to get the best of both worlds without user configuration. For inputs that result in cache misses, on Nvidia/Linux this change improves performance by 31% for prompt processing and 13% for token generation. * runner.go: Increase size of response channel Generally the CPU can easily keep up with handling reponses that are generated but there's no reason not to let generation continue and handle things in larger batches if needed. * llama: Add CI to verify all vendored changes have patches (#7066) Make sure we don't accidentally merge changes in the vendored code that aren't also reflected in the patches. * llama: adjust clip patch for mingw utf-16 (#7065) * llama: adjust clip patch for mingw utf-16 * llama: ensure static linking of runtime libs Avoid runtime dependencies on non-standard libraries * runner.go: Enable llamafile (all platforms) and BLAS (Mac OS) These are two features that are shown on llama.cpp's system info that are currently different between the two runners. On my test systems the performance difference is very small to negligible but it is probably still good to equalize the features. * llm: Don't add BOS/EOS for tokenize requests This is consistent with what server.cpp currently does. It affects things like token processing counts for embedding requests. * runner.go: Don't cache prompts for embeddings Our integration with server.cpp implicitly disables prompt caching because it is not part of the JSON object being parsed, this makes the Go runner behavior similarly. Prompt caching has been seen to affect the results of text completions on certain hardware. The results are not wrong either way but they are non-deterministic. However, embeddings seem to be affected even on hardware that does not show this behavior for completions. For now, it is best to maintain consistency with the existing behavior. * runner.go: Adjust debug log levels Add system info printed at startup and quiet down noisier logging. * llama: fix compiler flag differences (#7082) Adjust the flags for the new Go server to more closely match the generate flow * llama: refine developer docs (#7121) * llama: doc and example clean up (#7122) * llama: doc and example clean up * llama: Move new dockerfile into llama dir Temporary home until we fully transition to the Go server * llama: runner doc cleanup * llama.go: Add description for Tokenize error case --------- Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>	2024-10-08 08:53:54 -07:00
Daniel Hiltgen	616c5eafee	CI: win arm adjustments (#6898 )	2024-09-20 16:58:56 -07:00
Daniel Hiltgen	d632e23fba	Add Windows arm64 support to official builds (#5712 ) * Unified arm/x86 windows installer This adjusts the installer payloads to be architecture aware so we can cary both amd64 and arm64 binaries in the installer, and install only the applicable architecture at install time. * Include arm64 in official windows build * Harden schedule test for slow windows timers This test seems to be a bit flaky on windows, so give it more time to converge	2024-09-20 13:09:38 -07:00
Daniel Hiltgen	7717bb6a84	CI: clean up naming, fix tagging latest (#6832 ) The rocm CI step for RCs was incorrectly tagging them as the latest rocm build. The multiarch manifest was incorrectly tagged twice (with and without the prefix "v"). Static windows artifacts weren't being carried between build jobs. This also fixes the latest tagging script.	2024-09-16 16:18:41 -07:00
Daniel Hiltgen	cd5c8f6471	Optimize container images for startup (#6547 ) * Optimize container images for startup This change adjusts how to handle runner payloads to support container builds where we keep them extracted in the filesystem. This makes it easier to optimize the cpu/cuda vs cpu/rocm images for size, and should result in faster startup times for container images. * Refactor payload logic and add buildx support for faster builds * Move payloads around * Review comments * Converge to buildx based helper scripts * Use docker buildx action for release	2024-09-12 12:10:30 -07:00
王卿	c7c845ec52	Update install.sh：Replace "command -v" with encapsulated functionality (#6035 ) Replace "command -v" with encapsulated functionality	2024-09-05 09:49:48 -07:00
Erkin Alp Güney	7d89e48f5c	install.sh: update instructions to use WSL2 (#6450 )	2024-09-04 09:34:53 -04:00
Daniel Hiltgen	a1cef4d0a5	Add findutils to base images (#6581 ) This caused missing internal files	2024-08-31 10:40:05 -07:00
Daniel Hiltgen	93ea9240ae	Move ollama executable out of bin dir (#6535 )	2024-08-27 16:19:00 -07:00
Daniel Hiltgen	a017cf2fea	Split rocm back out of bundle (#6432 ) We're over budget for github's maximum release artifact size with rocm + 2 cuda versions. This splits rocm back out as a discrete artifact, but keeps the layout so it can be extracted into the same location as the main bundle.	2024-08-20 07:26:38 -07:00
Daniel Hiltgen	88bb9e3328	Adjust layout to bin+lib/ollama	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	927d98a6cd	Add windows cuda v12 + v11 support	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	d470ebe78b	Add Jetson cuda variants for arm This adds new variants for arm64 specific to Jetson platforms	2024-08-19 09:38:53 -07:00
Daniel Hiltgen	c7bcb00319	Wire up ccache and pigz in the docker based build This should help speed things up a little	2024-08-19 09:38:53 -07:00

1 2 3 4

178 Commits