* flash attn: add auto mode for llama engine
If the user does not specify fa in the environment, use auto-mode.
* review comments
* ensure kv cache quantized types have FA explicitly enabled
additional review comments
This changes the default behavior to use the Ollama engine for supported
models, while retaining the ability to disable the Ollama engine and
fall back to the Llama engine. Models in the OllamaEngineRequired list
will always run on the Ollama engine.
* docs: add docs for v1/responses and rework openai compat section
I reworked the examples to be separated by topic and to be fully
runnable (i.e., they now log output instead of just suggesting how a
call might be made).
We now use `<CodeGroup>`s so that each example has a dropdown on the
docs site for users to choose, which makes the examples a lot more
digestible (since you only see approx 1/3 of the code you used to).
I also added a new tool to extract code examples into files so that it's
easier to actually run them and check that they work.
## Example
```shell
go run docs/tools/extract-examples/main.go docs/api/openai-compatibility.mdx
```
Output:
```
Extracting code examples to: /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
- 01_basic.py
- 01_basic.js
- 01_basic.sh
- 02_responses.py
- 02_responses.js
- 02_responses.sh
- 03_vision.py
- 03_vision.js
- 03_vision.sh
Extracted 9 file(s) to /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
To run examples:
cd /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
npm install # for JS examples
then run individual files with `node file.js`, `python file.py`, `bash file.sh`
```
In the future we should consider actually running the examples in CI and
having some sort of acceptance test so we can automatically detect when
our examples break. So this is just a start in that direction.
* Update docs/api/openai-compatibility.mdx
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
* Update docs/api/openai-compatibility.mdx
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938#13054
Co-authored-by: Jesse Gross <jesse@ollama.com>
* feat: Bump llama.cpp to the latest master (17f7f4b)
This brings in significant improvements to prefill performance for all
models using the SSM_CONV and SSM_SCAN ops (granite4, jamba, falcon-h,
nemotron-h, Qwen3 Next) on Apple Metal.
See https://github.com/ggml-org/llama.cpp/pull/17876
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches 1-4
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Update patches 5-12
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches 13-18
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patch 20
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches 21-31
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Sync vendored code
The two files I'm not sure about here are the swap from gemma3-iswa.cpp to
gemma3.cpp (I chose to include this because I think it's required), and the
inclusion of `ggml-zendnn.h` which I chose to omit.
Branch: LlamaCPPMetalSSMImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.
This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.
We currently use cache padding of 32 when not using flash attention
and 256 with flash attention, which is based on the historic alignment
requirements of these kernels. The restrictions have since been
loosened but there are still performance benefits, such as better
CUDA graph reuse.
Since the requirement is no longer kernel-specific, set the padding
uniformly to 256, as llama.cpp has.
* cmd/bench: support writing benchmark output to file
This changes Ollama to allow the bench command to write benchmark
results to a user-specified output file instead of stdout when the
--output flag is provided.
---------
Co-authored-by: Patrick Devine <patrick@infrahq.com>
* Revert "vulkan: temporary cary of vulkan fixes (#12971)"
This reverts commit 3a9e8e9fd4.
* ggml update to b7087
* fix argsort on metal
* update to b7108
* fix bakllava regression
This model lacks the metadata for the projector type.
* update to b7209
* fix TopK perf
* only build arm code on arm
This fixes a bug where disabling thinking on deepseek-v3.1 did not stop the model from thinking.
When thinking is not defined it should not be sent to the server since this will cause error responses in some cases where the model does not support thinking. However if it is defined as false it should still be sent.
We now do a deeper probe of CUDA devices to verify the library version has
the correct compute capability coverage for the device. Due to ROCm also
interpreting the CUDA env var to filter AMD devices, we try to avoid setting
it which leads to problems in mixed vendor systems. However without setting
it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs
and on systems with lots of GPUs, this can lead to hitting timeouts. The fix is
to turn on the CUDA visibility env var just for this deeper probe use-case.
Model eviction happens when we have at least one other model
loaded and are unable to load all layers into VRAM. However, on
CPU-only systems we can never load layers into VRAM, so this
constantly triggered eviction.
Fixes#13227
This change:
* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
If the user has somehow installed another GGML based app which places a
ggml-base lib somewhere in their PATH, we can experience runtime problems
due to incompatibilities. This change adds a warning message if we detect
a ggml-base outside of our install location to aid in troubleshooting.