ollama/runner
Jesse Gross 4372d0bfef llamarunner: Respect device ordering for offloaded layers
We used to control the way that llama.cpp saw devices using
CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers
offloaded to a device were actually the ones intended. This is
particularly important because we might reorder devices based on
free memory or performance.

When we started explicitly scheduling layers, this logic went
away but the llamarunner didn't have any way to set the correct
order of devices. This meant that the correct number of layers
would be assigned to a device but not necessarily the layers
that were expected. This change sets up the devices correctly
based on the offload information.
2025-11-11 13:11:08 -08:00
..
common server: add logprobs and top_logprobs support to Ollama's API (#12899) 2025-11-11 08:49:50 -08:00
llamarunner llamarunner: Respect device ordering for offloaded layers 2025-11-11 13:11:08 -08:00
ollamarunner server: add logprobs and top_logprobs support to Ollama's API (#12899) 2025-11-11 08:49:50 -08:00
README.md
runner.go

README.md

runner

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding