Commit Graph

942 Commits

Author SHA1 Message Date
russcoss 05d53457af
refactor: use the built-in max/min to simplify the code (#12280)
Signed-off-by: russcoss <russcoss@outlook.com>
2025-09-16 17:14:21 -07:00
Devon Rifkin fa1c987a29
Merge pull request #12248 from ollama/drifkin/qwen3-coder-parsing
add qwen3-coder tool support
2025-09-16 10:21:43 -07:00
Michael Yang 3f6642f6fc
model: implement bert in ollama engine (#9080)
* fix truncate

* s/SentencePieceModel/SentencePiece/

* bert

* wordpiece

* refactor pooling

* more tokenizers

* normalize embeddings
2025-09-15 15:35:59 -07:00
Devon Rifkin 472feec2ff address comments 2025-09-15 11:46:25 -07:00
Devon Rifkin 47991940d4 add qwen3-coder tool support
The format qwen3-coder uses is relatively unique, both in rendering and
in parsing. To implement parsing, I wrote a custom parser in similar
style to harmony. For the rendering, I found that the logic would be
much more difficult to follow in a template, so I introduced the concept
of a built-in renderer that uses go code, rather than a template to
generate prompts.

I set us up for future built-in parsers and renderers by making it so
they can be specified in a Modelfile like so:

```
RENDERER "qwen3-coder"
PARSER "qwen3-coder"
```

These need to be provided explicitly because the architecture alone is
not enough to understand what format the model expects to receive, and
what format we expect it to output (e.g., qwen3-coder is `qwen3moe`,
which includes other qwen3-family models as well)

I haven't converted harmony to be one of these "built-ins" yet, since
some of it is in flux with the changes @ParthSareen has been making to
move harmony to the runner. It is likely that many other built-ins will
need to move to the runner as well, but I'm able to slightly defer that
decision since qwen3-coder doesn't have thinking (and therefore doesn't
need to be in the runner to make structured outputs work). I expect to
unify harmony with this approach very soon.

Whether a particular model supports tools or thinking was previously
inferred from templates, but without a template we now also use the
parser itself to declare what it supports. If we have future models that
re-use the same parsing format, but have different capabilities, we'll
want to parameterize them and give them different names to be specified
as a `PARSER`.

Misc changes:

- I worked on the renderer by diffing outputs from the reference
  implementation and ours. To make it easier to do this, I extended
  <https://github.com/ollama/ollama/pull/11875> to also support
  returning the prompt via the openai compat layer
2025-09-15 11:33:47 -07:00
jmorganca 92b96d54ef Revert "runner: move harmony to runner (#12052)"
This reverts commit 1a558f98e2.
2025-09-12 20:40:14 -03:00
jmorganca 9d56e63dbf Revert "runner: simplify parser entrypoints in runner (#12233)"
This reverts commit 8d6fffaead.
2025-09-12 20:40:14 -03:00
Michael Yang feb18cd710
feat: add dimensions field to embed requests (#12242)
* feat: add field to truncate embeddings

* add openai embeddings for dimensions
2025-09-11 10:36:10 -07:00
Parth Sareen 8d6fffaead
runner: simplify parser entrypoints in runner (#12233) 2025-09-10 11:24:42 -07:00
Parth Sareen 1a558f98e2
runner: move harmony to runner (#12052) 2025-09-08 15:07:59 -07:00
Michael Yang 1081532430
fix keep alive (#12041) 2025-08-27 11:51:25 -07:00
Jeffrey Morgan 4be4dc8717
server: skip parsing initial <think> if provided in the prompt (#12024) 2025-08-22 12:00:16 -07:00
Parth Sareen 7cce5aac76
harmony: move harmony parsing into a package (#12016) 2025-08-21 13:56:22 -07:00
Devon Rifkin 048bd4472a harmony: convert fn names to be valid ts identifiers
In <https://github.com/ollama/ollama/issues/11704#issuecomment-3177380197>
I noticed that hyphens in function names could possibly cause the model
to become confused. Later in that issue I found other explanations, but
at a minimum tool names with spaces in them are confusing to the model
because of the prompt format.

In this change I create a mapper that converts arbitrary tool names into
valid typescript identifiers. It's a little overly strict in that it
doesn't allow all unicode characters that might be valid in ts
identifiers, but it's still very permissive. Since mappings aren't
reversible, we must temporarily store this mapping in order to unmap it
if the model comes back with a call. We also handle the case where
multiple mappings collide into the same mapping and append a counter to
the end to make them unique
2025-08-18 14:05:16 -07:00
Devon Rifkin 8de1da4767 server: add debug option for printing out prompt instead of calling model 2025-08-15 13:52:50 -07:00
Jesse Gross d5a0d8d904 llm: New memory management
This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).

It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.
2025-08-14 15:24:01 -07:00
Michael Yang 1a19df1f3a
update vendored llama.cpp and ggml (#11823)
* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch

This will be redone once my branch is merged upstream in llama.cpp

* feat: Update all patches

There are a number that are no longer needed at all:

- 0003-embeddings: Embeddings entirely overhauled on master
- 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely
    overhauled on master
- 0019-metal-add-mean-kernel-14267: Merged upstream
- 0020-CUDA-add-mean-operation-14313: Merged upstream

* feat: Sync llama.cpp and ggml

* fix: Update rsync-filter for all moved/new/removed files

* fix: Add files missing from sync

* fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs

* fix: Add ggml files missing from sync

* fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files

* fix: Remove mtmd main cpp files

* fix: Add missing include in sampling_ext.cpp

* fix: Update llama.go to use mtmd instead of clip/llava

* fix: Add patch for mtmd_input_text

* chore: Ignore *.patched in the patch directory

* fix: Fix support for arch-specific ggml-cpu source files with new arrangement

In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific
implementations were split out into a nested tree structure under
ggml-cpu/arch. This conflicts with standard CGO layout where all
arch-specific source files are expected to live in the same directory as
the parent go module and use suffixes based on GOOS and GOARCH. As such,
there were really two options for getting this to work:

1. Add a patch on top of the GGML sync to rearrange the files to match the
GO layout convention
2. Use CGO directives to conditionally include the nested source files in
the compilation units

This commit does (2) in order to minimize the set of changes needed on top
of the upstream file layout. To get this to work, there are two key things
needed:

1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in
the preprocessor directives
2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to
explicitly include the .c|.cpp files for the given architecture from the
nested directory

* fix: Use mtmd_helper to correctly load the bitmap for the image

* fix: Apply patch for mtmd_text_input

* fix: Add missing stb to llama.cpp rsync-filter

* fix: Add sync'ed stb vendored header

* fix: Use c++17 and include vendor for go wrapper modules

* fix: Update patch 0015 for upstream implementation of uuid

* feat: Bump to the latest tip of the branch

* fix: Update patches for bump

* feat: Bump back to the cenral repo and point at the latest master

This includes granite 4 and a number of other model architectures!

* fix: Revert changes to ggml export GPU UUID patch

* fix: Add patch for GGML_VERSION and GGML_COMMIT constants

* feat: Sync all patched code

* build: Include cmake/common.cmake in ggml sync

* build: Add top-level include for GNUINstallDirs in CMakeLists.txt

This is used to populate CMAKE_INSTALL_BINDIR

* fix: Add a patch to avoid power throttling API on non-msvc windows builds

* fix: Sync patch changes for ggml-cpu.c

* feat: Bump llama.cpp to 4a4f42

This picks up support for Kimi K2 and PLaMO-2

* feat: Sync llama.cpp

* fix: Handle multi-chunk image encodings from mtmd

* fix: Re-number patches after merge with `main`

* feat: Bump to 41e78c in the makefile

* fix: Fix Solar and argsort/copy patches after bump

* fix: Remove Gemma3n CUDA Graphs patch

It was implemented upstream:
https://github.com/ggml-org/llama.cpp/pull/14741

* feat: Sync llama.cpp / ggml after latest bump

* build: Remove unnecessary CFLAGS definitions in cpu.go

* fix: Remove unnecessary additions in the rsync-filter

* fix: Remove unused vendored code for chat template parsing

* Revert "fix: Remove Gemma3n CUDA Graphs patch"

This reverts commit d724caced3.

* fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes

https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394

* fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n

* unwind mxfp4 patch

Prepare to bump ggml with their impl for mxfp4

* bump

* fix windows build error

* Convert tensors at load time

Repack the mxfp4 tensors as ggmls kernels expect them to be.

* convert mlp bf16 to f32

* buffer the conversion better

* reshape earlier

* openai swiglu

* add ids

* split qkv, gate_up

* fix nested alt tags

* fast attention

* remove debug messages

* fix lint

* remove redundant test

* remap values only if source/target are different

* add back i32->i32 copy

* refactor cpu quants

* clean up vendor

* update patch instructions

* clean up patches

* remove webgpu

* update mem

* also handle gpt-oss

* revert convert changes

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-08-14 14:42:58 -07:00
youzichuan bb71654ebe chore: fix some inconsistent function name in comment
Signed-off-by: youzichuan <youzichuan6@outlook.com>
2025-08-13 09:50:27 -07:00
Michael Yang d0cf6c8281
fix(openai): handle reasoning_effort (#11868) 2025-08-12 11:02:01 -07:00
Devon Rifkin dbfd7bd027
Merge pull request #11861 from ollama/drifkin/fix-parsing-error
server: fix error when parsing bad harmony tool calls
2025-08-11 14:59:57 -07:00
Devon Rifkin ee04dbba51 server: fix error when parsing bad harmony tool calls
Thanks @moll for reporting!

Fixes: #11781
2025-08-11 14:09:13 -07:00
Daniel Andersen ea7657b54a
sched: Add support for grouping GPUs (#10678)
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
2025-08-11 13:59:38 -07:00
Jesse Gross f2e9c9aff5 server: Reduce gpt-oss context length for small VRAM GPUs
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
2025-08-07 14:23:55 -07:00
Devon Rifkin 30f8a68c4c tools: support anyOf types
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.

anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.

We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
2025-08-05 16:46:24 -07:00
Michael Yang fa7776fd24
gpt-oss (#11672)
* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2025-08-05 12:21:16 -07:00
minxinyi 1e6eab5c33
server: use slices.Equal to simplify code (#11502) 2025-07-23 14:25:39 -07:00
Patrick Devine 3bac5cba60
Fix GetModelInfo (#11496)
---------

Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-07-22 13:40:47 -07:00
Daniel Hiltgen 20c3266e94
Reduce default parallelism to 1 (#11330)
The current scheduler algorithm of picking the paralellism based on available
VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
changes the default to 1, with the intent going forward that parallelism is
explicit and will no longer be dynamically determined.  Removal of the dynamic
logic will come in a follow up.
2025-07-08 12:08:37 -07:00
Daniel Hiltgen 34088dbcfb
API/CLI context enhancements (#11331)
* API: expose context size of loaded models

* CLI: add context UX

This adds a column in the ps output to show the models context size.
2025-07-08 11:59:06 -07:00
Michael Yang d0b32def60
skip quantizing per_layer_token_embd (#11207)
this tensor isn't compatible with cuda when quantized to q4_K so skip it
2025-06-26 21:49:35 -07:00
Devon Rifkin b2b270ad5d Merge branch 'main' into drifkin/array-head-count-simple 2025-06-23 10:37:31 -07:00
Michael Yang 0a066cfd91
Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119)
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbdf2.

* fix older ggufs
2025-06-20 11:11:40 -07:00
Jeffrey Morgan a6e64fbdf2
Revert "feat: incremental gguf parser (#10822)" (#11114)
This reverts commit 6b04cad7e8.
2025-06-18 05:42:44 -07:00
曹家巧 60cfa2a203
cache: fix comment function name in cache.go (#11110) 2025-06-18 05:21:45 -07:00
Jeffrey Morgan 9f8a18ec05
tools: loosen tool parsing to allow for more formats (#11030) 2025-06-12 14:18:54 -07:00
Michael Yang 6b04cad7e8
feat: incremental gguf parser (#10822)
* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-06-12 11:04:11 -07:00
Jeffrey Morgan 09d308d6b6
Revert "server: add model capabilities to the list endpoint (#10174)" (#11004)
This reverts commit 0943001193.
2025-06-06 23:29:14 -04:00
Devon Rifkin a3b6886b7d
move thinking logic into its own package (#10990)
move thinking logic into its own package
2025-06-06 12:02:20 -07:00
Devon Rifkin 0683efa637 export ThinkingParser 2025-06-05 10:22:32 -07:00
JasonHonKL 0943001193
server: add model capabilities to the list endpoint (#10174) 2025-06-04 11:39:48 -07:00
Devon Rifkin 5f57b0ef42
add thinking support to the api and cli (#10584)
- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
  it easy to use thinking in scripting scenarios like
  `ollama run qwen3 --think --hidethinking "my question here"` where you
  just want to see the answer but still want the benefits of thinking
  models
2025-05-28 19:38:52 -07:00
Kyle Steere 9239a254e0
server: abort download on empty digest
Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>
2025-05-27 11:28:48 -07:00
frob eda472df1b
server: add hint to the error message when model path access fails (#10843) 2025-05-24 13:17:04 -07:00
Parth Sareen e8b981fa5d
tools: refactor tool call parsing and enable streaming (#10415) 2025-05-23 14:19:31 -07:00
Daniel Hiltgen d950ff12c0
sched: fix runner leak during reloading unload (#10819)
When the same model is being reloaded rapidly with client connections
being canceled before the model finishes loading, the queued unload
event could cause a leak of runners by deleting a different runner from
the loaded list.
2025-05-22 14:31:36 -07:00
Bruce MacDonald fbe6ae285a
server: improve tensor quantization fallback logic (#10806)
Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
2025-05-22 10:48:08 -07:00
Michael Yang 61aeaf7e81
remove support for multiple ggufs in a single file (#10722)
* remove support for multiple ggufs in a single file

this was an attempt to make it easier to import multimodal models into
ollama. this was rarely used and error prone so remove it

* fix: create fused model from blob
2025-05-21 13:55:31 -07:00
Daniel Hiltgen 1a0cfd080a
avoid kv truncation during create (#10761) 2025-05-19 13:54:54 -07:00
Jesse Gross 94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Daniel Hiltgen ff80718e9c
fix crash in old clients with quantization progress (#10710)
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
2025-05-14 14:54:18 -07:00
Michael Yang 23125648b8
chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Jeffrey Morgan c7f4ae7b9c
server: add webp image input support (#10653) 2025-05-12 20:41:42 -07:00
Daniel Hiltgen 9d6df90805
Follow up to #10363 (#10647)
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
2025-05-12 15:23:31 -07:00
Bruce MacDonald ad035ad595
convert: quantize from safetensors needs kv (#10675)
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
2025-05-12 12:04:20 -07:00
Michael Yang f95a1f2bef
feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
Michael Yang 0d6e35d3c6
fix: stream accumulator exits early (#10593)
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
2025-05-08 13:17:30 -07:00
Devon Rifkin 20c5fd39c8
Merge branch 'main' into drifkin/array-head-count-simple 2025-05-08 11:46:52 -07:00
Michael Yang 6e9a7a2568
lint: enable usetesting, disable tenv (#10594) 2025-05-08 11:42:14 -07:00
Daniel Hiltgen 5e380c3b42
sched: fix race leading to orphaned runners (#10599)
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight.  The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance.  The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
2025-05-07 09:38:17 -07:00
Jeffrey Morgan 392de84031
api: remove unused RetrieveModelResponse type (#10603) 2025-05-06 23:08:03 -07:00
Devon Rifkin 4090aca97b
server: send 405 instead of 404 for unallowed methods (#10275)
Fixes: #5483
2025-05-06 14:45:37 -07:00
Michael Yang 92ce438de0
server: remove internal cmd (#10595) 2025-05-06 13:05:01 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Jeffrey Morgan 1703d1472e
server: fix panic when runner.Options is nil (#10566) 2025-05-05 09:01:33 -07:00
Daniel Hiltgen 76ea735aaf
sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-05-03 12:01:56 -07:00
frob e6d2d04121
image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-05-01 16:50:20 -07:00
Devon Rifkin ad3c7c9bda
strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-04-30 13:57:45 -07:00
Daniel Hiltgen 415c8fcc3d
Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-04-30 11:26:52 -07:00
Devon Rifkin fe5b9bb21b
lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-04-29 02:04:14 -07:00
Devon Rifkin dd93e1af85
Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-04-28 16:54:11 -07:00
Devon Rifkin d2ee599dcf load arrays with up to 1024 elements when estimating
This mirrors the old behavior before #10382
2025-04-27 13:45:13 -07:00
Michael Yang 340448d2d1 explicitly decode maxarraysize 1024 2025-04-25 16:59:01 -07:00
Michael Yang 214a7678ea fix superfluous call to WriteHeader
the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:

http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
2025-04-25 16:58:49 -07:00
Devon Rifkin 424f648632
increase default context length to 4096 (#10364)
* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`
2025-04-22 16:33:24 -07:00
Michael Yang 88738b357b create tempdir in models directory
the models directory should have plenty of storage and also ensure
there's no cross-device copy
2025-04-18 18:13:05 -07:00
Blake Mizerany 4e535e6188
server/internal/registry: make pull send errors with Error field (#10326)
Previously, the pull handler would send an error message in the Status
field, this prevented the client from using the message as a signal to
stop. In the case of the "run" command, it would follow the pull with a
"show" which would print a nearly identical "not found" message for
unresolved models.

Fixes #10307
2025-04-18 18:12:28 -07:00
Blake Mizerany 1d99451ad7
server/internal/client/ollama: handle some network errors gracefully (#10317) 2025-04-17 12:43:09 -07:00
Blake Mizerany 369de832cd
server/internal/registry: remove superfluous progress bar flush (#10303)
This removes the extra flushProgress() at the end of handlePull. It is
unnecessary because final progress updates are flushed in all cases of
the main select loop.
2025-04-16 14:43:07 -07:00
Blake Mizerany 3457a315b2
server/internal/client/ollama: cleanup use of multiple counters (#10304)
The completed and received counters must work in tandem and the code
should better reflect that. Previously, the act of updating them was 2-3
lines of code duplicated in multiple places. This consolidates them into
a single update closure for easy reading and maintenance.

This also simplifies error handling in places where we can use a return
parameter and defer to handle the error case for updates.

Also, remove the old Layer field from the trackingReader struct.
2025-04-16 14:33:40 -07:00
Daniel Hiltgen 56dc316a57
Give tests more time to run (#10306)
Fix flake failures on windows
2025-04-16 13:37:00 -07:00
Blake Mizerany 1e7f62cb42
cmd: add retry/backoff (#10069)
This commit adds retry/backoff to the registry client for pull requests.

Also, revert progress indication to match original client's until we can
"get it right."

Also, make WithTrace wrap existing traces instead of clobbering them.
This allows clients to compose traces.
2025-04-15 23:24:44 -07:00
Devon Rifkin 97fe45e36d server: add `OpenAI-Beta` header to CORS safelist
alphabetized the compat list and then added a single header

fixes: #9801
2025-04-14 15:36:10 -07:00
Tom Sheffler ef65174df2
types: include the 'items' and '$defs' fields to properly handle "array" types (#10091)
---------

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2025-04-09 17:45:49 -07:00
Ire Gaddr 42ecb9f138
fix(scheduler): make model unload order deterministic (#10185) 2025-04-09 16:01:02 -07:00
Parth Sareen 6747099d71
types: add any type and validation for ToolFunction enum (#10166) 2025-04-08 15:05:38 -07:00
Alex Rozgo 2f723ac2d6
types: allow tool function parameters with a single type or an array of types (#9434) 2025-04-07 14:27:01 -07:00
Bruce MacDonald e53b3cbd0c
llm: set done reason at server level (#9830)
No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.
2025-04-03 10:19:24 -07:00
Bruce MacDonald 9876c9faa4
chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
Bruce MacDonald e172f095ba
api: return model capabilities from the show endpoint (#10066)
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
2025-04-01 15:21:46 -07:00
Blake Mizerany ef27d52e79
server/internal/client/ollama: cache completed chunks (#9933)
This change adds tracking of download chunks during the pull process so
that subsequent pulls can skip downloading already completed chunks.
This works across restarts of ollama.

Currently, download state will be lost if a prune is triggered during a
pull (e.g. restart or remove). This issue should be addressed in a
follow-up PR.
2025-03-30 23:54:54 -07:00
CYJiang 0bd0454ea7
server: organize error types (#9465)
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-03-28 11:50:22 -07:00
Jesse Gross f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Blake Mizerany ce929984a3
server/internal/client/ollama: fix file descriptor management in Pull (#9931)
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.

Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.

Also rename 'total' to 'received' for clarity.
2025-03-21 16:16:38 -07:00
Blake Mizerany c794fef2f2
server/internal/client/ollama: persist through chunk download errors (#9923) 2025-03-21 13:03:43 -07:00
Patrick Devine f8c3dbe5b5
templates: add autotemplate for gemma3 (#9880)
This change allows the gemma3 template to be autodetected during `ollama
create`.
2025-03-20 00:15:30 -07:00
Blake Mizerany 2ddacd7516
server/internal/client/ollama: confirm all chunksums were received (#9893)
If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.

It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.
2025-03-19 14:59:57 -07:00
Blake Mizerany 8294676150
server/internal/client/ollama: set User-Agent for registry client (#9775)
This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.

Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
2025-03-14 18:33:07 -07:00
Jesse Gross 7bf793a600 gemma3: Allow multiple image in a single input
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.

This can no longer happen:
 - The vision encoder is now on the GPU so both images would be
   processed on the GPU.
 - We require images to be fully contained in a batch and each
   image including its special tokens is over half the batch size.
   As a result, we will never get two images in the same batch.

Fixes #9731
2025-03-14 15:38:54 -07:00
Blake Mizerany 4e320b8b90
server/internal/chunks: remove chunks package (#9755) 2025-03-14 08:57:59 -07:00
Blake Mizerany eb2b22b042
server/internal/client: use chunksums for concurrent blob verification (#9746)
Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.

The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.

This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.

The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
2025-03-13 22:18:29 -07:00