Commit Graph

4901 Commits

Author SHA1 Message Date
Parth Sareen d828517e78
docs: update readme and links (#12809) 2025-10-28 16:20:02 -07:00
Daniel Hiltgen 14977a9350
Fix vulkan PCI ID and ID handling (#12775)
* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments
2025-10-28 15:15:35 -07:00
Patrick Devine 29f63f37c8
Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
Parth Sareen 3d99d9779a
docs: add docs for docs.ollama.com (#12805) 2025-10-28 13:18:48 -07:00
Parth Sareen 6d02a43a75
docs: rename to mdx to setup docs site (#12804) 2025-10-28 13:04:31 -07:00
Parth Sareen 5483497d7a
Revert "docs: add reference to docs.ollama.com (#12800)" (#12803)
This reverts commit 934dd9e196.
2025-10-28 12:52:49 -07:00
Parth Sareen 934dd9e196
docs: add reference to docs.ollama.com (#12800) 2025-10-28 12:44:02 -07:00
Michael Yang 1188f408dd
s/From*Slice/From*s/ (#12255) 2025-10-28 12:08:49 -07:00
nicole pardal 15c7d30d9a
embedding tests: added check against exact base64 string (#12790) 2025-10-28 10:37:20 -07:00
Devon Rifkin 9862317174
Merge pull request #12793 from ollama/drifkin/12792_renderer-parser-from
create: inherit FROM model's renderer/parser
2025-10-28 00:15:46 -07:00
Michael Yang ec9eb28f4c
gemma3: make embedding non-causal (#12297) 2025-10-27 19:54:08 -07:00
Devon Rifkin 1bdd816910 create: inherit FROM model's renderer/parser
On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.

Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.

Fixes: https://github.com/ollama/ollama/issues/12792
2025-10-27 15:14:19 -07:00
nicole pardal 5d347f6d6f
server: Consolidate embedding truncation in runner (#12730)
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
2025-10-27 11:59:12 -07:00
Patrick Devine b97eb2b858
cloud: set the proxy content-type to the same as local models (#12759) 2025-10-25 10:57:10 -07:00
Jesse Gross ad6f6a1d29 llm: Change memory allocation backoff from exponential to incremental
If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).
2025-10-23 12:58:31 -07:00
Vinh Nguyen 6723a40be6
readme: add VT Code project to terminal community integrations (#12749) 2025-10-23 12:29:50 -07:00
Daniel Hiltgen 3258a89b6e
DRY out the runner lifecycle code (#12540)
* DRY out the runner lifecycle code

Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo

* win: make incremental builds better

Place build artifacts in discrete directories so incremental builds don't have to start fresh

* Adjust sort order to consider iGPUs

* handle cpu inference oom scenarios

* review comments
2025-10-23 11:20:02 -07:00
Jesse Gross 1c093e97af kvcache: Remove special case for reservation mask
We currently short circuit generation of the cache mask and just
generate an empty tensor of the correct size. However, in some
cases, this can also skip a cast operation. This can result in the
worst case graph being not fully worst case.

We don't actually need the fast path for mask generation, so it's
better to just use the normal code path.
2025-10-22 17:38:04 -07:00
Jesse Gross a8d9c2648e llamarunner: Record the time for all batches during prompt processing
Currently, we only record the time for the last batch when processing
the prompt. This results in unrealistically high numbers for the
old llama runner.

Before:
total duration:       31.273112939s
load duration:        4.97054657s
prompt eval count:    32768 token(s)
prompt eval duration: 235.137439ms
prompt eval rate:     139356.80 tokens/s
eval count:           1873 token(s)
eval duration:        18.173182374s
eval rate:            103.06 tokens/s

After:
total duration:       30.024798033s
load duration:        4.758588663s
prompt eval count:    32768 token(s)
prompt eval duration: 7.779621548s
prompt eval rate:     4212.03 tokens/s
eval count:           1769 token(s)
eval duration:        17.148014223s
eval rate:            103.16 tokens/s
2025-10-22 13:52:58 -07:00
frob 0334e67ffd
tools: parse tool calls that don't conform to ("name": name, "arguments": args} (#12738) 2025-10-22 11:34:27 -07:00
nicole pardal e0ead1adee
embeddings: base64 encoding fix (#12715) 2025-10-22 11:27:44 -07:00
Patrick Devine d515aed6c3
cloud: don't error sending empty messages (#12724) 2025-10-21 18:12:14 -07:00
Jeffrey Morgan 5fe7ba1b9b
runner: always truncate embeddings requests (#12714) 2025-10-20 16:47:05 -07:00
Michael Yang d2b63c19b3
fs(ggml): fill in arch prefix if necessary (#12646) 2025-10-20 16:42:18 -07:00
Jeffrey Morgan 94f110b35a
model/parsers: remove warning for missing <think> tag for qwen3-vl (#12713) 2025-10-20 16:03:43 -07:00
Daniel Hiltgen 5d22953ba7
cuda: get driver version after props (#12707)
Users on Windows without GPUs are reporting errors relating to
cudaDriverGetVersion with the device set to -1.  This ensures we only grab the
driver once we're enumerating actual devices.
2025-10-20 10:57:27 -07:00
Daniel Hiltgen d245dffed8
rocm: give it more time to bootstrap (#12681)
Some users are hitting timeouts.  We'd like to make this faster, but for now make sure we don't timeout too aggressively.
2025-10-20 09:43:05 -07:00
Daniel Hiltgen bc1a818fdc
contiguous input per layer (#12686)
Co-authored-by: Michael Yang <git@mxy.ng>
2025-10-17 18:39:18 -07:00
Daniel Hiltgen ba2253dc30
win: more verbose load failures (#12683)
When loading the dynamic libraries, if something goes wrong report some
details.  Unfortunately this wont explain which dependencies are missing,
but this breadcrumb in the logs should help us diagnose GPU discovery
failures.
2025-10-17 17:13:16 -07:00
Daniel Hiltgen 68e04c7ff8
test: harden scheduler tests (#12662)
* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU
2025-10-17 08:56:44 -07:00
Daniel Hiltgen 270679932f
cuda: tidy up CC settings (#12668)
8.7 is Jetpack only, so no need on x86 builds
10.3 covers [G]B300
2025-10-16 16:39:30 -07:00
Jeffrey Morgan 65fb3ff49d
renderers: add global flag for setting [img] tags (#12669)
Adds a temporary global flag to renderers that causes renderers to always
render images as [img]. In a follow up change, we will consider making this
the default, and this flag could eventually be removed
2025-10-16 16:37:32 -07:00
Grace e2a0b24435
Grace/qwen3 thinking (#12647)
* changing initial status to take into consideration prefill

* Add seperate strings for content and thinking builder

* thinking tests

* remove white space from string before closing think tag
2025-10-16 15:29:41 -07:00
Daniel Hiltgen 1813ff85a0
cuda: bring back CC 5.2 (#12666)
Forward compat on the newer driver doesn't seem to be working.
This should get 5.2 working on newer drivers again.
2025-10-16 13:07:41 -07:00
Daniel Hiltgen b531777a66
test: add a few missing embedding models (#12661) 2025-10-16 09:36:25 -07:00
Daniel Hiltgen fe3ec8dbf0
Revert "Workaround broken NVIDIA iGPU free VRAM data (#12490)" (#12642)
The workaround has been moved into the underlying C++ code.

This reverts commit e4340667e3.
2025-10-16 09:09:48 -07:00
Thomas Stocker c744134287
vulkan: Get FilterID from Backend for Vulkan (#12655)
* vulkan: Get FilterID from Backend for Vulkan

* Fixing patch
2025-10-16 09:07:35 -07:00
weedge 4be41d2d45
readme: add achatbot-go to community integrations (#12629) 2025-10-15 21:54:15 -07:00
zhetaicheleba de670570c9
fs/ggml: fix function name in comment (#12630) 2025-10-15 21:53:38 -07:00
Devon Rifkin 201d93716e
Merge pull request #12651 from ollama/drifkin/oai-conversion
openai: make tool call conversion fns public
2025-10-15 21:10:30 -07:00
Devon Rifkin 160cecc8e2 openai: make tool call conversion fns public 2025-10-15 20:54:58 -07:00
Daniel Hiltgen 8b6e5baee7
CI: Set up temporary opt-out Vulkan support (#12614)
Initially Vulkan support in Ollama will require building from source.  Once it is
more thoroughly tested and we have fixed any critical bugs, then we can
bundle Vulkan into the official binary releases.
2025-10-15 14:18:01 -07:00
Daniel Hiltgen 75d17fc6c2
perf: backport cuda iGPU sched spin (#12641) 2025-10-15 11:52:14 -07:00
Santosh Bhavani 8fafc8af77
ml/backend/ggml: NVML fallback for unified memory GPUs (#12619)
* Simplify NVML fallback for unified memory GPUs

Remove device-specific checks and environment variable dependency for
NVML_ERROR_NOT_SUPPORTED fallback. When NVML doesn't support memory
queries, unconditionally use /proc/meminfo instead of checking device
names or OLLAMA_UNIFIED_MEMORY environment variable.

This provides better memory reporting by using MemAvailable which
accounts for reclaimable memory, avoiding the underreporting issue
described in NVIDIA support article a_id/5728.

Tested on NVIDIA GB10 unified memory iGPU with consistent and accurate
memory reporting across multiple model load/unload cycles.

* Add NVML fallback patch for unified memory GPUs
2025-10-15 11:40:06 -07:00
Jesse Gross c3c85aa06c llm: Enable flash attention by default for gemma3 2025-10-15 10:42:12 -07:00
Jeffrey Morgan 0d713051a2
envconfig: default to port 443 when connecting to ollama.com (#12617) 2025-10-14 23:38:24 -07:00
Parth Sareen c4c5a4a01e
types: send index for tool calls (#12625) 2025-10-14 19:35:15 -07:00
Jesse Gross 3dcfd5f69e llm: Perform eviction when num_gpu is set with new estimates
Currently, if you set num_gpu then this forces the model to
load with that number of layers in the current configuration.
This is done regardless of any other information, which means
that no eviction is performed even if another model is loaded.

This behavior is different from the old estimates (and still
happens for models that runs on the llama engine). In those
cases, models would be evicted if needed to load at the requested
number of layers. That behavior is more useful and less surprising,
so this changes the new estimates to match.

Fixes #12580
2025-10-14 17:46:36 -07:00
Devon Rifkin 53a969d509
Merge pull request #12621 from ollama/drifkin/any-of
qwen3-coder: support anyOf when parsing tool calls
2025-10-14 15:51:24 -07:00
Devon Rifkin 08fbb60bb2 qwen3-coder: support anyOf when parsing tool calls 2025-10-14 15:33:05 -07:00