History

Jesse Gross 8111c35be8 llm: New memory management This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.		2025-08-20 16:56:54 +02:00
..
common	chore: fix some inconsistent function name in comment	2025-08-20 16:41:49 +02:00
llamarunner	llm: New memory management	2025-08-20 16:56:54 +02:00
ollamarunner	llm: New memory management	2025-08-20 16:56:54 +02:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding