History

jmorganca b42970063d kvcache: Add check for values that fall out of sliding window cache The sliding window cache trims entries that are outside the window for the latest token. This works when we are extending the cache, such as when the conversation continues. However, if we have a partial overlap in conversation (including the BOS tokens), then we resume from a past point in the conversation and the needed tokens are no longer stored in memory. This verifies that the new window overlaps with the old one before reusing the cache. Co-authored-by: Jesse Gross <jesse@ollama.com>		2025-04-02 11:55:48 -07:00
..
common	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
llamarunner	runner: clear cache when shift is not possible (#9433 )	2025-03-31 12:54:45 -07:00
ollamarunner	kvcache: Add check for values that fall out of sliding window cache	2025-04-02 11:55:48 -07:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding