Void is an open source AI code editor and Cursor alternative that supports
Ollama. It's built on VS Code and allows users to connect directly to Ollama
for private LLM usage without going through a middleman backend.
Key features:
- Open source Cursor alternative
- Direct Ollama integration
- VS Code fork with full compatibility
- Agent mode and MCP support
- Works with any open source model
Fixes#12919
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
* build: optimize dockerfile context for iterating
This moves the copy of the source into the layer AFTER
doing software installs so we don't have to go through
the RPM install for cuda, etc. every time you touch a
source file.
* amd: implement linux sysfs based VRAM lookup
This adds a C++ implementation of sysfs DRM VRAM discovery
for more accurate free VRAM data on linux for AMD GPUs.
This change adds a basic benchmarking test framework for Ollama which can
be used to determine the prefill, eval, load duration, and total duration
for running a given model or models.
Many failed GPU discovery issues recently can be traced to incorrect override settings.
This extra logging should help quickly spot these and guide users to try unsetting them first.
* docs: vulkan information
* Revert "CI: Set up temporary opt-out Vulkan support (#12614)"
This reverts commit 8b6e5baee7.
* vulkan: temporary opt-in for Vulkan support
Revert this once we're ready to enable by default.
* win: add vulkan CI build
Some route endpoints return an empty response with a 200 OK. These should be documented in the OpenAPI doc. Note that the previous deletion response was not correct.
We currently assign model layers to GPUs according to free VRAM,
which assumes that GPU performance is roughly equal. This does not
work well for mixed dGPU and iGPU systems because iGPUs typically
use system memory which is large but their performance is slow.
This instead assigns layers to dGPUs first and then iGPUs.
In the future, this could be generalized to have a more fine grained
notion of GPU performance but dGPU vs. iGPU performance is the most
extreme.
Originally, llamaServer represented old memory estimates, which
could be used with either the old or new engine. ollamaServer was
used only for the new estimates and new engine. Since these
implementations did not map directly to engine, there was engine-
specific code in common code paths.
Now that new estimates are always used for the new engine, there is
a direct mapping between server type and engine. This separates out
most of the engine-specific code into the correct implementation
to make things easier to understand.
Currently for both the old and new engines, there is code to
calculate how much memory is required for a model and lay out
the layers onto GPUs. This reuses the new engine's lay out code
for the old engine as well, bringing them closer together. The
old engine continues to use its current method of estimating
required memory.
This reduces maintainence effort and improves consistency, as new
features only need to be implemented in one place. The newer code
is also more accurate, especially with multiple GPUs.
We used to control the way that llama.cpp saw devices using
CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers
offloaded to a device were actually the ones intended. This is
particularly important because we might reorder devices based on
free memory or performance.
When we started explicitly scheduling layers, this logic went
away but the llamarunner didn't have any way to set the correct
order of devices. This meant that the correct number of layers
would be assigned to a device but not necessarily the layers
that were expected. This change sets up the devices correctly
based on the offload information.
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position
Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
* doc: re-add login autostart faq
This appears to have been accidentally dropped during the doc migration.
* docs: GPU updates lost on the doc update
* review comments: improve windows login disable instructions