Commit Graph

4884 Commits

Author SHA1 Message Date
Daniel Hiltgen bd6c1d6b49
flash attn: add auto mode for llama engine (#13052)
* flash attn: add auto mode for llama engine

If the user does not specify fa in the environment, use auto-mode.

* review comments

* ensure kv cache quantized types have FA explicitly enabled

additional review comments
2025-12-12 13:27:19 -08:00
Jeffrey Morgan 3af5d3b738
model: force rope factor 1.0 for Gemma 3 (#13445) 2025-12-12 13:27:08 -08:00
Daniel Hiltgen 7730895158
Enable Ollama engine by default (#13443)
This changes the default behavior to use the Ollama engine for supported
models, while retaining the ability to disable the Ollama engine and
fall back to the Llama engine.  Models in the OllamaEngineRequired list
will always run on the Ollama engine.
2025-12-12 11:48:43 -08:00
Eva H de9ecfd01c
tidy up lint warnings on windows (#13430) 2025-12-12 11:43:35 -05:00
Eva H 95fdd8d619
fix: select and update models folder in settings (#13412) 2025-12-12 11:09:37 -05:00
Devon Rifkin 9f7822851c
docs: add docs for v1/responses and rework openai compat section (#13416)
* docs: add docs for v1/responses and rework openai compat section

I reworked the examples to be separated by topic and to be fully
runnable (i.e., they now log output instead of just suggesting how a
call might be made).

We now use `<CodeGroup>`s so that each example has a dropdown on the
docs site for users to choose, which makes the examples a lot more
digestible (since you only see approx 1/3 of the code you used to).

I also added a new tool to extract code examples into files so that it's
easier to actually run them and check that they work.

## Example

```shell
go run docs/tools/extract-examples/main.go docs/api/openai-compatibility.mdx
```

Output:

```
Extracting code examples to: /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368

  - 01_basic.py
  - 01_basic.js
  - 01_basic.sh
  - 02_responses.py
  - 02_responses.js
  - 02_responses.sh
  - 03_vision.py
  - 03_vision.js
  - 03_vision.sh

Extracted 9 file(s) to /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368

To run examples:

  cd /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
  npm install   # for JS examples

then run individual files with `node file.js`, `python file.py`, `bash file.sh`
```

In the future we should consider actually running the examples in CI and
having some sort of acceptance test so we can automatically detect when
our examples break. So this is just a start in that direction.

* Update docs/api/openai-compatibility.mdx

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

* Update docs/api/openai-compatibility.mdx

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

---------

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2025-12-11 17:39:40 -08:00
Parth Sareen 9b2035d194
openai: add tool call appending to previous assistant message (#13434)
* openai: add tool call appending to previous asst message

* add tests for thinking appending
2025-12-11 17:30:12 -08:00
Alexander Gusak 93d45d7a04
docs: fix link to modelfile.mdx (#13220) 2025-12-11 16:14:45 -08:00
JJ 709f842457
Update README.md (#13373)
Correct Markdown syntax for Swollama GitHub and DocC documentation links
2025-12-11 16:08:57 -08:00
Jeffrey Morgan 2dfb74410d
model: fix rotary embeddings for ministral 3 (#13432) 2025-12-11 16:02:05 -08:00
Devon Rifkin 1eb5e75972
openai: add v1/responses support (#13351)
Only supporting the stateless part of the API.

Doc updates to come once this is shipped.

Closes: #9659
2025-12-11 15:37:10 -08:00
nicole pardal 3475d915cb
embeddings: modified batch size (#13429)
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938 #13054

Co-authored-by: Jesse Gross <jesse@ollama.com>
2025-12-11 15:36:31 -08:00
Jeffrey Morgan 48e78e9be1
template: add yesterdayDate helper function (#13431) 2025-12-11 14:47:55 -08:00
Jeffrey Morgan a838421ea3
model: conversion and hyperparameter fixes for ministral and devstral (#13424) 2025-12-11 13:04:00 -08:00
EasonLin 1c4e85b4df
routes: add logprobs in tool calls (#13238) 2025-12-10 17:28:41 -08:00
Eloi Torrents dac4f17fea
cmd/bench: fix binary name in README (#13276) 2025-12-10 14:16:58 -08:00
Julia Scheaffer 56b8fb024c
cmd/bench: fix options table in cmd/bench/README.md (#13216) 2025-12-10 14:07:48 -08:00
Gabe Goodhart b95693056c
feat: llama.cpp bump (17f7f4) for SSM performance improvements (#13408)
* feat: Bump llama.cpp to the latest master (17f7f4b)

This brings in significant improvements to prefill performance for all
models using the SSM_CONV and SSM_SCAN ops (granite4, jamba, falcon-h,
nemotron-h, Qwen3 Next) on Apple Metal.

See https://github.com/ggml-org/llama.cpp/pull/17876

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 1-4

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update patches 5-12

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 13-18

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patch 20

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches 21-31

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Sync vendored code

The two files I'm not sure about here are the swap from gemma3-iswa.cpp to
gemma3.cpp (I chose to include this because I think it's required), and the
inclusion of `ggml-zendnn.h` which I chose to omit.

Branch: LlamaCPPMetalSSMImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-12-10 12:59:27 -08:00
Eva H c34fc64688
app/ui: use requestAnimationFrame to prevent bottom line cutoff in streaming thinking display (#13137) 2025-12-10 15:29:48 -05:00
Eva H 7cf6f18c1f
app/ui: refactor to use Ollama endpoints for user auth and health checks (#13081) 2025-12-10 15:24:31 -05:00
Eva H bbbb6b2a01
app/ui: fix model capabilities not updating after download completion (#13179) 2025-12-10 14:40:02 -05:00
nicole pardal 76f88caf43
nomic-embed-text:v2: model implementation (#13162) 2025-12-09 14:24:51 -08:00
Parth Sareen 2bccf8c624
renderers/parsers: olmo3 instruct (#13383) 2025-12-09 11:12:27 -08:00
Parth Sareen 0c5e5f6630
parsers/renderers: olmo3 think (#13290) 2025-12-09 10:41:47 -08:00
Michael Yang d475d1f081 fix: qwen2.5vl metal argsort 2025-12-08 17:18:24 -08:00
Jeffrey Morgan d2f334c1f7
model: add rnj-1 inference support (#13354) 2025-12-08 16:49:17 -08:00
Michael Yang 603ceefaa6 refactor rope
change to a flatter directory structure and group the options with the
function

update models to call rope in one place
2025-12-08 14:42:22 -08:00
nicole pardal e082d60a24
truncation: fixed runner truncation logic + removed server truncation (#12839)
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
2025-12-08 11:20:28 -08:00
Daniel Hiltgen 5dae738067
CI: use vendor base commit in cache keys (#13348)
Prevent CGO from accidentally reusing old object files from the cache
across vendor updates
2025-12-08 09:48:49 -08:00
JJ 0c78723174
readme: fix broken Swollama link in community integrations (#13370) 2025-12-07 21:49:52 -08:00
Jeffrey Morgan 5a41d69b2a
fs/ggml: write int32 and int64 values to gguf files (#13335) 2025-12-07 21:49:14 -08:00
Daniel Hiltgen c146a138e3
ggml: handle all streams (#13350)
Follow up from #12992

Free all streams, and keep the alloc logic aligned across streams.
2025-12-05 16:10:33 -08:00
Sos Pogosyan 31b8c6a214
fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279)
---------

Co-authored-by: Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local>
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-12-04 21:33:07 -08:00
Jesse Gross 9191dfaf05 llm: Enable flash attention for mistral3 by default 2025-12-04 15:19:06 -08:00
Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders
Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.
2025-12-04 15:19:06 -08:00
Jesse Gross 7837a5bc7e ggml: Always set cache padding to 256
We currently use cache padding of 32 when not using flash attention
and 256 with flash attention, which is based on the historic alignment
requirements of these kernels. The restrictions have since been
loosened but there are still performance benefits, such as better
CUDA graph reuse.

Since the requirement is no longer kernel-specific, set the padding
uniformly to 256, as llama.cpp has.
2025-12-04 15:19:06 -08:00
Patrick Devine 0a844f8e96
convert: add deepseek converter (#12980)
This change adds the ability for `ollama create` to convert models that use
the DeepSeek2 architecture (specifically DeepSeekV3 and DeepSeek-R1).
2025-12-04 13:49:30 -08:00
Eloi Torrents a03223b86f
cmd/bench: support writing benchmark output to file (#13263)
* cmd/bench: support writing benchmark output to file

This changes Ollama to allow the bench command to write benchmark
results to a user-specified output file instead of stdout when the
--output flag is provided.

---------

Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-12-04 13:22:41 -08:00
Daniel Hiltgen 0cf7794b16
ggml update to b7108 (#12992)
* Revert "vulkan: temporary cary of vulkan fixes (#12971)"

This reverts commit 3a9e8e9fd4.

* ggml update to b7087

* fix argsort on metal

* update to b7108

* fix bakllava regression

This model lacks the metadata for the projector type.

* update to b7209

* fix TopK perf

* only build arm code on arm
2025-12-03 19:43:29 -08:00
Jeffrey Morgan 854d40edc5
ci: restore previous linter rules (#13322) 2025-12-03 18:55:02 -08:00
Bruce MacDonald 84a2cedf18
app: relay thinking false to server (#13319)
This fixes a bug where disabling thinking on deepseek-v3.1 did not stop the model from thinking.

When thinking is not defined it should not be sent to the server since this will cause error responses in some cases where the model does not support thinking. However if it is defined as false it should still be sent.
2025-12-03 15:06:55 -08:00
Daniel Hiltgen 3f30836734
CUDA: filter devices on secondary discovery (#13317)
We now do a deeper probe of CUDA devices to verify the library version has
the correct compute capability coverage for the device.  Due to ROCm also
interpreting the CUDA env var to filter AMD devices, we try to avoid setting
it which leads to problems in mixed vendor systems.  However without setting
it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs
and on systems with lots of GPUs, this can lead to hitting timeouts.  The fix is
to turn on the CUDA visibility env var just for this deeper probe use-case.
2025-12-03 12:58:16 -08:00
Nathan Hook cc9555aff0
Update user message format for temperature query (#13256) 2025-12-02 15:08:39 -08:00
hello_world 20aee96706
Add Vulkan GPU support instructions in development.md (#13265)
Added Vulkan SDK installation instructions and environment variable setup for building with Vulkan support.
2025-12-02 13:37:32 -08:00
Daniel Hiltgen 18b5958d46
test: avoid ministral tools test on low vram (#13302)
Avoid hitting test timeouts
2025-12-02 13:18:55 -08:00
Jesse Gross 5317202c38 llm: Don't always evict models on CPU-only systems
Model eviction happens when we have at least one other model
loaded and are unable to load all layers into VRAM. However, on
CPU-only systems we can never load layers into VRAM, so this
constantly triggered eviction.

Fixes #13227
2025-12-02 10:58:08 -08:00
Daniel Hiltgen d771043e88
test: add ministral-3 (#13300) 2025-12-02 09:52:16 -08:00
Daniel Hiltgen f8f1071818
CUDA: verify CC is supported by target library (#13298) 2025-12-02 09:28:41 -08:00
Patrick Devine d3e0a0dee4
model: ministral w/ llama4 scaling (#13292)
This change:

* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2025-12-01 23:20:14 -08:00
Daniel Hiltgen 554172759c
win: warn if ggml-base detected in PATH (#13289)
If the user has somehow installed another GGML based app which places a
ggml-base lib somewhere in their PATH, we can experience runtime problems
due to incompatibilities.  This change adds a warning message if we detect
a ggml-base outside of our install location to aid in troubleshooting.
2025-12-01 15:36:47 -08:00