Commit Graph

80 Commits

Author SHA1 Message Date
nicole pardal 3475d915cb
embeddings: modified batch size (#13429)
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938 #13054

Co-authored-by: Jesse Gross <jesse@ollama.com>
2025-12-11 15:36:31 -08:00
nicole pardal e082d60a24
truncation: fixed runner truncation logic + removed server truncation (#12839)
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
2025-12-08 11:20:28 -08:00
Daniel Hiltgen 18b5958d46
test: avoid ministral tools test on low vram (#13302)
Avoid hitting test timeouts
2025-12-02 13:18:55 -08:00
Daniel Hiltgen d771043e88
test: add ministral-3 (#13300) 2025-12-02 09:52:16 -08:00
Parth Sareen c114987523
logprob: add bytes to logprobs (#13068) 2025-11-13 13:49:25 -08:00
Baptiste Jamin 59241c5bee
server: add logprobs and top_logprobs support to Ollama's API (#12899)
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position

Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
2025-11-11 08:49:50 -08:00
nicole pardal 7dd4862a89
embeddings: removed redundant TestAPIEmbeddings test (#12863)
This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go
2025-10-30 17:12:33 -07:00
Patrick Devine 76eb7d0fff
testing: test more models with tool calling (#12867) 2025-10-30 13:19:21 -07:00
Daniel Hiltgen c88647104d
int: harden server lifecycle (#12835)
this should reduce zombies during integration runs
2025-10-29 11:50:56 -07:00
Patrick Devine 05aff4a4f1
tests: fix embeddinggemma integration test (#12830) 2025-10-29 11:07:28 -07:00
Michael Yang 7d25b9e194
feat(model): add qwen3vl (#12665) 2025-10-28 17:39:47 -07:00
Patrick Devine 36d64fb531
embed: add distance correlation test for library embed models (#12796) 2025-10-28 16:57:27 -07:00
Patrick Devine 29f63f37c8
Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
nicole pardal 5d347f6d6f
server: Consolidate embedding truncation in runner (#12730)
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
2025-10-27 11:59:12 -07:00
Jeffrey Morgan 5fe7ba1b9b
runner: always truncate embeddings requests (#12714) 2025-10-20 16:47:05 -07:00
Daniel Hiltgen 68e04c7ff8
test: harden scheduler tests (#12662)
* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU
2025-10-17 08:56:44 -07:00
Daniel Hiltgen b531777a66
test: add a few missing embedding models (#12661) 2025-10-16 09:36:25 -07:00
Daniel Hiltgen 4e5d862ec4
Integration test tuning (#12492)
Remove some flaky scenarios, and switch to chat for better reliability
2025-10-08 09:51:25 -07:00
Daniel Hiltgen c68f367ef6
Update GGML to b6646 (#12245)
Notable EOLs with this change:
- MacOS v12 and v13 are no longer supported (v14+ required)
- AMD gfx900 and gfx906 are no longer supported
2025-10-02 14:47:10 -07:00
Daniel Hiltgen bc8909fb38
Use runners for GPU discovery (#12090)
This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
2025-10-01 15:12:32 -07:00
Daniel Hiltgen c23e6f4cae
tests: add single threaded history test (#12295)
* tests: add single threaded history test

Also tidies up some existing tests to handle more model output variation

* test: add support for testing specific architectures
2025-09-22 11:23:14 -07:00
Michael Yang ceac416ec2
fix(integration): check truncated length (#12337) 2025-09-18 14:00:21 -07:00
Daniel Hiltgen 44a6792873
tests: tighten up a few flaky tests (#12271)
Sometimes the context test results are pure emoji's
Thanksgiving has too much variability, so swap for a more straight forward prompt.
2025-09-12 13:59:34 -07:00
Parth Sareen 20b53eaa72
tests: add tool calling integration test (#12232) 2025-09-09 14:01:11 -07:00
Daniel Hiltgen 6745182885
tests: reduce stress on CPU to 2 models (#12161)
* tests: reduce stress on CPU to 2 models

This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently

* tests: allow slow systems to pass on timeout

If a slow system is still streaming a response, and the response
will pass validation, don't fail just because the system is slow.

* test: unload embedding models more quickly
2025-09-09 09:32:15 -07:00
Jesse Gross e119783e66 llm: Clamp batch size to context size
The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)
2025-09-08 20:40:11 -07:00
Daniel Hiltgen 517807cdf2
perf: build graph for next batch async to keep GPU busy (#11863)
* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.
2025-08-29 14:20:28 -07:00
Daniel Hiltgen d6f7233a1c
test: improve scheduler/concurrency stress tests (#11906)
* test: improve scheduler/concurrency stress tests

The scheduler test used to use approximate memory figures and would often
over or under shoot a systems capcity leading to flaky test results.
This should improve the reliability of this scenario by leveraging
ps output to determinie exactly how many models it takes to
trigger thrashing.

The concurrency test is also refined to target num_parallel + 1 and handle
timeouts better.

With these refinements, TestMultiModelConcurrency was redundant

* test: add parallel generate with history

TestGenerateWithHistory will help verify caching and context
are properly handled while making requests

* test: focus embed tests on embedding models

remove non-embedding models from the embedding tests
2025-08-15 14:37:54 -07:00
Daniel Hiltgen c385ca8672
test: add valid responses (#11902)
some of the new models need a few more valid responses to pass
2025-08-14 11:07:13 -07:00
Daniel Hiltgen a24f90604f
int: adjust a few models for integration tests (#11872) 2025-08-13 15:42:36 -07:00
Daniel Hiltgen 114c3f2265
tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Daniel Hiltgen f8a6e88819
Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-07-11 12:21:54 -07:00
Daniel Hiltgen 4f473e224c
int: add performance integration tests (#11173)
usage example:
  go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log
  cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv
  cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
2025-07-05 16:07:09 -07:00
Daniel Hiltgen f2527b08fb
int: add coverage for older models (#11137)
Verified these fail on 0.9.1 and pass on HEAD.
2025-06-19 12:10:19 -07:00
Daniel Hiltgen 2307fc2bcd
tests: drop llama3.2-vision embedding tests (#10837) 2025-05-24 13:17:53 -07:00
Daniel Hiltgen fdd4d479a3
integration: add qwen2.5-vl (#10815)
Replace the older llava model with qwen2.5 for vision tests
Skip split-batch test on small VRAM systems to avoid excessive test time
2025-05-22 09:12:32 -07:00
Daniel Hiltgen 424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
湛露先生 7e5c8eee5c
file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-05-04 15:37:59 -07:00
Daniel Hiltgen 7bec2724a5
integration: fix embedding tests error handling (#10478)
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
2025-04-29 11:57:54 -07:00
Daniel Hiltgen ed4e139314
Integration test improvements (#9654)
Add some new test coverage for various model architectures,
and switch from orca-mini to the small llama model.
2025-04-16 14:25:55 -07:00
CYJiang e7019c9455
fix(integration): move waitgroup Add(1) outside goroutine to avoid potential issue (#10070)
Signed-off-by: googs1025 <googs1025@gmail.com>
2025-04-08 15:17:40 -07:00
Bruce MacDonald 9876c9faa4
chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
Jesse Gross 9679f40146 ml: Allow models to constrain inputs to a single batch
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697
2025-03-14 15:38:54 -07:00
Stefan Weil abfdc4710f
all: fix typos in documentation, code, and comments (#7021) 2024-12-10 12:58:06 -08:00
Daniel Hiltgen f0a351810c
tests: fix max queue integration test (#7782)
This had fallen out of sync with the envconfig behavior, where max queue default was not zero.
2024-11-22 08:05:45 -08:00
Jesse Gross 7121dfa309 runner.go: Retry decoding after defragmentation if needed
Fragmentation of the KV cache can occur due to cache shifting or
different sequences getting processed. Decode uses a heuristic to
decide if it should defrag. However, this heuristic isn't 100%
accurate, so decoding can sometimes fail by surprise.

For these cases, if decode indicates that there is no KV cache space,
we should defrag and then try again.
2024-11-20 12:49:24 -08:00
Daniel Hiltgen 8a9bb0d000
Add basic mllama integration tests (#7455) 2024-10-31 17:25:48 -07:00
Daniel Hiltgen 921779bb10
Give unicode test more time to run (#7437)
* Give unicode test more time to run

Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test

* Give more time for concurrency test

CPU inference can be very slow under stress
2024-10-31 13:35:31 -07:00
Jesse Gross 078f666f73 tests: Add test for Unicode processing 2024-10-28 18:12:29 -07:00
Daniel Hiltgen dc6fe82051
integration: harden embedding test (#7306)
Use cosine similarity to make the embeddings tests more robust
2024-10-22 15:25:22 -07:00