Commit Graph

145 Commits

Author SHA1 Message Date
Jeffrey Morgan 96efd9052f
Re-introduce the `llama` package (#5034)
* Re-introduce the llama package

This PR brings back the llama package, making it possible to call llama.cpp and
ggml APIs from Go directly via CGo. This has a few advantages:

- C APIs can be called directly from Go without needing to use the previous
  "server" REST API
- On macOS and for CPU builds on Linux and Windows, Ollama can be built without
  a go generate ./... step, making it easy to get up and running to hack on
  parts of Ollama that don't require fast inference
- Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
  takes <5 min on a fast CPU)
- No git submodule making it easier to clone and build from source

This is a big PR, but much of it is vendor code except for:

- llama.go CGo bindings
- example/: a simple example of running inference
- runner/: a subprocess server designed to replace the llm/ext_server package
- Makefile an as minimal as possible Makefile to build the runner package for
  different targets (cpu, avx, avx2, cuda, rocm)

Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

* cache: Clear old KV cache entries when evicting a slot

When forking a cache entry, if no empty slots are available we
evict the least recently used one and copy over the KV entries
from the closest match. However, this copy does not overwrite
existing values but only adds new ones. Therefore, we need to
clear the old slot first.

This change fixes two issues:
 - The KV cache fills up and runs out of space even though we think
   we are managing it correctly
 - Performance gets worse over time as we use new cache entries that
   are not hot in the processor caches

* doc: explain golang objc linker warning (#6830)

* llama: gather transitive dependencies for rocm for dist packaging (#6848)

* Refine go server makefiles to be more DRY (#6924)

This breaks up the monolithic Makefile for the Go based runners into a
set of utility files as well as recursive Makefiles for the runners.
Files starting with the name "Makefile" are buildable, while files that
end with ".make" are utilities to include in other Makefiles.  This
reduces the amount of nearly identical targets and helps set a pattern
for future community contributions for new GPU runner architectures.

When we are ready to switch over to the Go runners, these files should
move to the top of the repo, and we should add targets for the main CLI,
as well as a helper "install" (put all the built binaries on the local
system in a runnable state) and "dist" target (generate the various
tar/zip files for distribution) for local developer use.

* llama: don't create extraneous directories (#6988)

* llama: Exercise the new build in CI (#6989)

Wire up some basic sanity testing in CI for the Go runner.  GPU runners are not covered yet.

* llama: Refine developer docs for Go server (#6842)

This enhances the documentation for development focusing on the new Go
server.  After we complete the transition further doc refinements
can remove the "transition" discussion.

* runner.go: Allocate batches for all sequences during init

We should tell the model that we could have full batches for all
sequences. We already do this when we allocate the batches but it was
missed during initialization.

* llama.go: Don't return nil from Tokenize on zero length input

Potentially receiving nil in a non-error condition is surprising to
most callers - it's better to return an empty slice.

* runner.go: Remove stop tokens from cache

If the last token is EOG then we don't return this and it isn't
present in the cache (because it was never submitted to Decode).
This works well for extending the cache entry with a new sequence.

However, for multi-token stop sequences, we won't return any of the
tokens but all but the last one will be in the cache. This means
when the conversation continues the cache will contain tokens that
don't overlap with the new prompt.

This works (we will pick up the portion where there is overlap) but
it causes unnecessary cache thrashing because we will fork the original
cache entry as it is not a perfect match.

By trimming the cache to the tokens that we actually return this
issue can be avoided.

* runner.go: Simplify flushing of pending tokens

* runner.go: Update TODOs

* runner.go: Don't panic when processing sequences

If there is an error processing a sequence, we should return a
clean HTTP error back to Ollama rather than panicing. This will
make us more resilient to transient failures.

Panics can still occur during startup as there is no way to serve
requests if that fails.

Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: More accurately capture timings

Currently prompt processing time doesn't capture the that it takes
to tokenize the input, only decoding time. We should capture the
full process to more accurately reflect reality. This is especially
true once we start processing images where the initial processing
can take significant time. This is also more consistent with the
existing C++ runner.

* runner.go: Support for vision models

In addition to bringing feature parity with the C++ runner, this also
incorporates several improvements:
 - Cache prompting works with images, avoiding the need to re-decode
   embeddings for every message in a conversation
 - Parallelism is supported, avoiding the need to restrict to one
   sequence at a time. (Though for now Ollama will not schedule
   them while we might need to fall back to the old runner.)

Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: Move Unicode checking code and add tests

* runner.go: Export external cache members

Runner and cache are in the same package so the change doesn't
affect anything but it is more internally consistent.

* runner.go: Image embedding cache

Generating embeddings from images can take significant time (on
my machine between 100ms and 8s depending on the model). Although
we already cache the result of decoding these images, the embeddings
need to be regenerated every time. This is not necessary if we get
the same image over and over again, for example, during a conversation.

This currently uses a very small cache with a very simple algorithm
but it is easy to improve as is warranted.

* llama: catch up on patches

Carry forward solar-pro and cli-unicode patches

* runner.go: Don't re-allocate memory for every batch

We can reuse memory allocated from batch to batch since batch
size is fixed. This both saves the cost of reallocation as well
keeps the cache lines hot.

This results in a roughly 1% performance improvement for token
generation with Nvidia GPUs on Linux.

* runner.go: Default to classic input cache policy

The input cache as part of the go runner implemented a cache
policy that aims to maximize hit rate in both single and multi-
user scenarios. When there is a cache hit, the response is
very fast.

However, performance is actually slower when there is an input
cache miss due to worse GPU VRAM locality. This means that
performance is generally better overall for multi-user scenarios
(better input cache hit rate, locality was relatively poor already).
But worse for single users (input cache hit rate is about the same,
locality is now worse).

This defaults the policy back to the old one to avoid a regression
but keeps the new one available through an environment variable
OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is
to improve this in the future to get the best of both worlds
without user configuration.

For inputs that result in cache misses, on Nvidia/Linux this
change improves performance by 31% for prompt processing and
13% for token generation.

* runner.go: Increase size of response channel

Generally the CPU can easily keep up with handling reponses that
are generated but there's no reason not to let generation continue
and handle things in larger batches if needed.

* llama: Add CI to verify all vendored changes have patches (#7066)

Make sure we don't accidentally merge changes in the vendored code
that aren't also reflected in the patches.

* llama: adjust clip patch for mingw utf-16 (#7065)

* llama: adjust clip patch for mingw utf-16

* llama: ensure static linking of runtime libs

Avoid runtime dependencies on non-standard libraries

* runner.go: Enable llamafile (all platforms) and BLAS (Mac OS)

These are two features that are shown on llama.cpp's system info
that are currently different between the two runners. On my test
systems the performance difference is very small to negligible
but it is probably still good to equalize the features.

* llm: Don't add BOS/EOS for tokenize requests

This is consistent with what server.cpp currently does. It affects
things like token processing counts for embedding requests.

* runner.go: Don't cache prompts for embeddings

Our integration with server.cpp implicitly disables prompt caching
because it is not part of the JSON object being parsed, this makes
the Go runner behavior similarly.

Prompt caching has been seen to affect the results of text completions
on certain hardware. The results are not wrong either way but they
are non-deterministic. However, embeddings seem to be affected even
on hardware that does not show this behavior for completions. For
now, it is best to maintain consistency with the existing behavior.

* runner.go: Adjust debug log levels

Add system info printed at startup and quiet down noisier logging.

* llama: fix compiler flag differences (#7082)

Adjust the flags for the new Go server to more closely match the
generate flow

* llama: refine developer docs (#7121)

* llama: doc and example clean up (#7122)

* llama: doc and example clean up

* llama: Move new dockerfile into llama dir

Temporary home until we fully transition to the Go server

* llama: runner doc cleanup

* llama.go: Add description for Tokenize error case

---------

Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
2024-10-08 08:53:54 -07:00
Daniel Hiltgen e9e9bdb8d9
CI: Fix win arm version defect (#6940)
write-host in powershell writes directly to the console and will not be picked
up by a pipe.  Echo, or write-output will.
2024-09-24 15:18:10 -07:00
Daniel Hiltgen 2a038c1d7e
CI: win arm artifact dist dir (#6900)
The upload artifact is missing the dist prefix since all
payloads are in the same directory, so restore the prefix
on download.
2024-09-20 19:16:18 -07:00
Daniel Hiltgen 616c5eafee
CI: win arm adjustments (#6898) 2024-09-20 16:58:56 -07:00
Daniel Hiltgen f5ff917b1d
CI: adjust step ordering for win arm to match x64 (#6895) 2024-09-20 14:20:57 -07:00
Daniel Hiltgen d632e23fba
Add Windows arm64 support to official builds (#5712)
* Unified arm/x86 windows installer

This adjusts the installer payloads to be architecture aware so we can cary
both amd64 and arm64 binaries in the installer, and install only the applicable
architecture at install time.

* Include arm64 in official windows build

* Harden schedule test for slow windows timers

This test seems to be a bit flaky on windows, so give it more time to converge
2024-09-20 13:09:38 -07:00
Daniel Hiltgen 8f9ab5e14d
CI: dist directories no longer present (#6834)
The new buildx based build no longer leaves the dist/linux-* directories
around, so we don't have to clean them up before uploading.
2024-09-16 17:31:37 -07:00
Daniel Hiltgen 7717bb6a84
CI: clean up naming, fix tagging latest (#6832)
The rocm CI step for RCs was incorrectly tagging them as the latest rocm build.
The multiarch manifest was incorrectly tagged twice (with and without the
prefix "v").  Static windows artifacts weren't being carried between build
jobs.  This also fixes the latest tagging script.
2024-09-16 16:18:41 -07:00
Daniel Hiltgen 0ec2915ea7
CI: set platform build build_linux script to keep buildx happy (#6829)
The runners don't have emulation set up so the default multi-platform build
wont work.
2024-09-16 14:07:29 -07:00
Daniel Hiltgen cd5c8f6471
Optimize container images for startup (#6547)
* Optimize container images for startup

This change adjusts how to handle runner payloads to support
container builds where we keep them extracted in the filesystem.
This makes it easier to optimize the cpu/cuda vs cpu/rocm images for
size, and should result in faster startup times for container images.

* Refactor payload logic and add buildx support for faster builds

* Move payloads around

* Review comments

* Converge to buildx based helper scripts

* Use docker buildx action for release
2024-09-12 12:10:30 -07:00
Daniel Hiltgen a017cf2fea
Split rocm back out of bundle (#6432)
We're over budget for github's maximum release artifact size with rocm + 2 cuda
versions.  This splits rocm back out as a discrete artifact, but keeps the layout so it can
be extracted into the same location as the main bundle.
2024-08-20 07:26:38 -07:00
Daniel Hiltgen 19e5a890f7
CI: remove directories from dist dir before upload step (#6429) 2024-08-19 15:19:21 -07:00
Daniel Hiltgen f91c9e3709
CI: handle directories during checksum (#6427) 2024-08-19 13:48:45 -07:00
Daniel Hiltgen d8be22e47d Fix overlapping artifact name on CI 2024-08-19 12:07:18 -07:00
Daniel Hiltgen f9e31da946 Review comments 2024-08-19 10:36:15 -07:00
Daniel Hiltgen 927d98a6cd Add windows cuda v12 + v11 support 2024-08-19 09:38:53 -07:00
Daniel Hiltgen 74d45f0102 Refactor linux packaging
This adjusts linux to follow a similar model to windows with a discrete archive
(zip/tgz) to cary the primary executable, and dependent libraries. Runners are
still carried as payloads inside the main binary

Darwin retain the payload model where the go binary is fully self contained.
2024-08-19 09:38:53 -07:00
Daniel Hiltgen feedf49c71 Go back to a pinned Go version
Go version 1.22.6 is triggering AV false positives, so go back to 1.22.5
2024-08-13 11:45:44 -07:00
Michael Yang b732beba6a lint 2024-08-01 17:06:06 -07:00
Daniel Hiltgen 5d604eec5b Bump Go patch version 2024-07-22 16:16:28 -07:00
Daniel Hiltgen 224337b32f Bump linux ROCm to 6.1.2 2024-07-15 15:10:22 -07:00
Daniel Hiltgen 1f50356e8e Bump ROCm on windows to 6.1.2
This also adjusts our algorithm to favor our bundled ROCm.
I've confirmed VRAM reporting still doesn't work properly so we
can't yet enable concurrency by default.
2024-07-10 11:01:22 -07:00
Daniel Hiltgen b51e3b63ac Statically link c++ and thread lib
This makes sure we statically link the c++ and thread library on windows
to avoid unnecessary runtime dependencies on non-standard DLLs
2024-07-09 11:34:30 -07:00
jmorganca c12f1c5b99 release: move mingw library cleanup to correct job 2024-07-06 16:12:29 -04:00
jmorganca a08f20d910 release: remove unwanted mingw dll.a files 2024-07-06 15:21:15 -04:00
jmorganca 6cea036027 Revert "llm: only statically link libstdc++"
This reverts commit 5796bfc401.
2024-07-06 15:10:48 -04:00
jmorganca 5796bfc401 llm: only statically link libstdc++ 2024-07-06 14:06:20 -04:00
Daniel Hiltgen 9d30f9f8b3 Always go build in CI generate steps
With the recent cgo changes, bugs can sneak through
if we don't make sure to `go build` all the permutations
2024-07-05 15:31:52 -07:00
Daniel Hiltgen a12283e2ff Implement custom github release action
This implements the release logic we want via gh cli
to support updating releases with rc tags in place and retain
release notes and other community reactions.
2024-06-15 11:36:56 -07:00
Daniel Hiltgen 26ab67732b Bump ROCm linux to 6.1.1 2024-06-14 15:37:54 -07:00
Michael Yang 6297f85606 gofmt, goimports 2024-06-04 13:20:24 -07:00
Jeffrey Morgan 7ca9605f54
speed up tests by only building static lib (#4740) 2024-05-30 21:43:15 -07:00
Michael Yang 98085015d5 only generate on relevant changes 2024-05-30 16:54:11 -07:00
Jeffrey Morgan afd2b058b4
set codesign timeout to longer (#4605) 2024-05-23 22:46:23 -07:00
Michael Yang 4d4f75a8a8
Revert "fix golangci workflow missing gofmt and goimports (#4190)"
This reverts commit 04f971c84b.
2024-05-07 10:35:44 -07:00
alwqx 04f971c84b
fix golangci workflow missing gofmt and goimports (#4190) 2024-05-07 09:49:40 -07:00
Michael Yang 7fea1ecdf6
Merge pull request #3958 from ollama/mxyng/fix-workflow
use merge base for diff-tree
2024-04-26 14:17:56 -07:00
Blake Mizerany 054894271d
.github/workflows/test.yaml: add in-flight cancellations on new push (#3956)
Also, remove a superfluous 'go get'
2024-04-26 13:54:24 -07:00
Michael Yang 6fef042f0b use merge base for diff-tree 2024-04-26 13:54:15 -07:00
Daniel Hiltgen 8feb97dc0d Move cuda/rocm dependency gathering into generate script
This will make it simpler for CI to accumulate artifacts from prior steps
2024-04-25 22:38:44 -07:00
Daniel Hiltgen 8589d752ac Fix release CI
download-artifact path was being used incorrectly.  It is where to
extract the zip not the files in the zip to extract.  Default is
workspace dir which is what we want, so omit it
2024-04-25 17:27:11 -07:00
Daniel Hiltgen 058f6cd2cc Move nested payloads to installer and zip file on windows
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
2024-04-23 16:14:47 -07:00
Daniel Hiltgen 939d6a8606 Make CI lint verbvose 2024-04-23 10:17:42 -07:00
jmorganca c8afe7168c use correct extension for feature and model request issue templates 2024-04-17 15:18:40 -04:00
jmorganca 28d3cd0148 simpler feature and model request forms 2024-04-17 15:17:08 -04:00
jmorganca eb5554232a simpler feature and model request forms 2024-04-17 15:14:49 -04:00
jmorganca 2bdc320216 add descriptions to issue templates 2024-04-17 15:08:36 -04:00
jmorganca 32561aed09 simplify github issue templates a bit 2024-04-17 15:07:03 -04:00
Michael Yang 2b4ca6cf36 fix ci 2024-04-10 11:35:12 -07:00
Blake Mizerany 1524f323a3
Revert "build.go: introduce a friendlier way to build Ollama (#3548)" (#3564) 2024-04-09 15:57:45 -07:00
Blake Mizerany fccf3eecaa
build.go: introduce a friendlier way to build Ollama (#3548)
This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.

This script also provides nicer feedback to the user about what is
happening during the build process.

At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).
2024-04-09 14:18:47 -07:00
Michael Yang cb8352d6b4 ci: use go-version-file 2024-04-09 09:50:12 -07:00
Jeffrey Morgan b0e7d35db8
use an older version of the mac os sdk in release (#3484) 2024-04-04 09:48:54 -07:00
Daniel Hiltgen 883ec4d1ef CI missing archive 2024-04-04 07:23:27 -07:00
Daniel Hiltgen 08600d5bec CI subprocess path fix 2024-04-03 19:12:53 -07:00
Daniel Hiltgen e4a7e5b2ca Fix CI release glitches
The subprocess change moved the build directory
arm64 builds weren't setting cross-compilation flags when building on x86
2024-04-03 16:41:40 -07:00
Jeffrey Morgan cd135317d2
Fix macOS builds on older SDKs (#3467) 2024-04-03 10:45:54 -07:00
Daniel Hiltgen 841adda157 Fix windows lint CI flakiness 2024-04-02 12:22:16 -07:00
Daniel Hiltgen 58d95cc9bd Switch back to subprocessing for llama.cpp
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
2024-04-01 16:48:18 -07:00
Michael Yang 1ec0df1069 fix generate output 2024-04-01 13:47:34 -07:00
Jeffrey Morgan 06a1508bfe
Update 90_bug_report.yml 2024-03-29 10:11:17 -04:00
Daniel Hiltgen 97ae517fbf
Merge pull request #3398 from dhiltgen/release_latest
CI automation for tagging latest images
2024-03-28 16:25:54 -07:00
Daniel Hiltgen 44b813e459
Merge pull request #3377 from dhiltgen/rocm_v6_bump
Bump ROCm to 6.0.2 patch release
2024-03-28 16:07:54 -07:00
Daniel Hiltgen 539043f5e0 CI automation for tagging latest images 2024-03-28 16:07:37 -07:00
Daniel Hiltgen c91a4ebcff Bump ROCm to 6.0.2 patch release 2024-03-28 15:58:57 -07:00
Daniel Hiltgen b79c7e4528 CI windows gpu builds
If we're doing generate, test windows cuda and rocm as well
2024-03-28 14:39:10 -07:00
Michael Yang 5255d0af8a fix: workflows 2024-03-27 16:30:01 -07:00
Michael Yang 8838ae787d stub stub 2024-03-27 13:59:12 -07:00
Michael Yang db75402ade mangle arch 2024-03-27 13:44:50 -07:00
Michael Yang 1e85a140a3 only generate on changes to llm subdirectory 2024-03-27 12:45:35 -07:00
Michael Yang 5b0c48d29e only generate cuda/rocm when changes to llm detected 2024-03-27 12:23:09 -07:00
Daniel Hiltgen f83e4db365 Switch runner for final release job
The manifest and tagging step use a lot of disk space
2024-03-25 20:51:40 -07:00
Daniel Hiltgen 540f4af45f Wire up more complete CI for releases
Flesh out our github actions CI so we can build official releaes.
2024-03-15 12:37:36 -07:00
Blake Mizerany 8546dd3d72
.github: fix model and feature request yml (#3155) 2024-03-14 15:26:06 -07:00
Blake Mizerany 87100be5e0
.github: add issue templates (#3143) 2024-03-14 15:19:10 -07:00
Michael Yang 2cb74e23fb fix ci 2024-03-07 11:33:49 -08:00
Daniel Hiltgen 3c8df3808b
Merge pull request #2885 from dhiltgen/rocm_v6_only
Revamp ROCm support
2024-03-07 10:51:00 -08:00
Michael Yang 72431031d9 no ci test on docs, examples 2024-03-07 10:44:48 -08:00
Daniel Hiltgen 6c5ccb11f9 Revamp ROCm support
This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed.  It also cleans up after itself.

We now build only a single ROCm version (latest major) on both windows
and linux.  Given the large size of ROCms tensor files, we split the
dependency out.  It's bundled into the installer on windows, and a
separate download on windows.  The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.

For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us.  For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.
2024-03-07 10:36:50 -08:00
Jeffrey Morgan d481fb3cc8
update go to 1.22 in other places (#2975) 2024-03-07 07:39:49 -08:00
Michael Yang 46c847c4ad enable rocm builds 2024-02-06 13:36:13 -08:00
Michael Yang 92b1a21f79 use linux runners 2024-02-06 13:36:04 -08:00
Michael Yang f06b99a461 disable rocm builds 2024-02-06 09:29:42 -08:00
Michael Yang a8c5413d06 only generate gpu libs 2024-01-25 15:41:56 -08:00
Michael Yang 5580de4571 archive ollama binaries 2024-01-25 15:40:16 -08:00
Michael Yang 946431d5b0 build cuda and rocm 2024-01-25 15:40:15 -08:00
Michael Yang 0610126049 remove env setting 2024-01-25 15:39:43 -08:00
Michael Yang 8e5d359a03 stub generate outputs for lint 2024-01-24 17:36:10 -08:00
Michael Yang e299831e2c
Merge pull request #1958 from purificant/ci
ci: update setup-go action
2024-01-18 14:53:36 -08:00
Daniel Hiltgen ecbfc0182f Go bump to v1.21 to pick up slog 2024-01-18 14:12:57 -08:00
Daniel Hiltgen b992bf65fc Disable arm64 for test phase
The runners are x86 so we can only run binaries that match.
2024-01-17 19:26:13 -08:00
Daniel Hiltgen 1b249748ab Add multiple CPU variants for Intel Mac
This also refines the build process for the ext_server build.
2024-01-17 15:08:54 -08:00
Daniel Hiltgen b3035112a1 Add macos cross-compile CI coverage 2024-01-14 10:38:59 -08:00
purificant 6a5bfc2ed6 update actions/setup-go 2024-01-12 22:27:25 +00:00
Michael Yang 997253143f add lint and test on pull_request 2024-01-09 09:36:58 -08:00