ollama

Commit Graph

Author	SHA1	Message	Date
nicole pardal	3475d915cb	embeddings: modified batch size (#13429 ) This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch. Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash. This change ensures all tokens stay in one batch and prevents crashes. Fixes: #12938 #13054 Co-authored-by: Jesse Gross <jesse@ollama.com>	2025-12-11 15:36:31 -08:00
Jeffrey Morgan	48e78e9be1	template: add yesterdayDate helper function (#13431 )	2025-12-11 14:47:55 -08:00
Jeffrey Morgan	a838421ea3	model: conversion and hyperparameter fixes for ministral and devstral (#13424 )	2025-12-11 13:04:00 -08:00
EasonLin	1c4e85b4df	routes: add logprobs in tool calls (#13238 )	2025-12-10 17:28:41 -08:00
Eloi Torrents	dac4f17fea	cmd/bench: fix binary name in README (#13276 )	2025-12-10 14:16:58 -08:00
Julia Scheaffer	56b8fb024c	cmd/bench: fix options table in cmd/bench/README.md (#13216 )	2025-12-10 14:07:48 -08:00
Gabe Goodhart	b95693056c	feat: llama.cpp bump (17f7f4) for SSM performance improvements (#13408 ) * feat: Bump llama.cpp to the latest master (17f7f4b) This brings in significant improvements to prefill performance for all models using the SSM_CONV and SSM_SCAN ops (granite4, jamba, falcon-h, nemotron-h, Qwen3 Next) on Apple Metal. See https://github.com/ggml-org/llama.cpp/pull/17876 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 1-4 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update patches 5-12 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 13-18 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Update patch 20 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Update patches 21-31 Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Sync vendored code The two files I'm not sure about here are the swap from gemma3-iswa.cpp to gemma3.cpp (I chose to include this because I think it's required), and the inclusion of `ggml-zendnn.h` which I chose to omit. Branch: LlamaCPPMetalSSMImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-10 12:59:27 -08:00
Eva H	c34fc64688	app/ui: use requestAnimationFrame to prevent bottom line cutoff in streaming thinking display (#13137 )	2025-12-10 15:29:48 -05:00
Eva H	7cf6f18c1f	app/ui: refactor to use Ollama endpoints for user auth and health checks (#13081 )	2025-12-10 15:24:31 -05:00
Eva H	bbbb6b2a01	app/ui: fix model capabilities not updating after download completion (#13179 )	2025-12-10 14:40:02 -05:00
nicole pardal	76f88caf43	nomic-embed-text:v2: model implementation (#13162 )	2025-12-09 14:24:51 -08:00
Parth Sareen	2bccf8c624	renderers/parsers: olmo3 instruct (#13383 )	2025-12-09 11:12:27 -08:00
Parth Sareen	0c5e5f6630	parsers/renderers: olmo3 think (#13290 )	2025-12-09 10:41:47 -08:00
Michael Yang	d475d1f081	fix: qwen2.5vl metal argsort	2025-12-08 17:18:24 -08:00
Jeffrey Morgan	d2f334c1f7	model: add rnj-1 inference support (#13354 )	2025-12-08 16:49:17 -08:00
Michael Yang	603ceefaa6	refactor rope change to a flatter directory structure and group the options with the function update models to call rope in one place	2025-12-08 14:42:22 -08:00
nicole pardal	e082d60a24	truncation: fixed runner truncation logic + removed server truncation (#12839 ) This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.	2025-12-08 11:20:28 -08:00
Daniel Hiltgen	5dae738067	CI: use vendor base commit in cache keys (#13348 ) Prevent CGO from accidentally reusing old object files from the cache across vendor updates	2025-12-08 09:48:49 -08:00
JJ	0c78723174	readme: fix broken Swollama link in community integrations (#13370 )	2025-12-07 21:49:52 -08:00
Jeffrey Morgan	5a41d69b2a	fs/ggml: write int32 and int64 values to gguf files (#13335 )	2025-12-07 21:49:14 -08:00
Daniel Hiltgen	c146a138e3	ggml: handle all streams (#13350 ) Follow up from #12992 Free all streams, and keep the alloc logic aligned across streams.	2025-12-05 16:10:33 -08:00
Sos Pogosyan	31b8c6a214	fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279 ) --------- Co-authored-by: Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local> Co-authored-by: Patrick Devine <patrick@infrahq.com>	2025-12-04 21:33:07 -08:00
Jesse Gross	9191dfaf05	llm: Enable flash attention for mistral3 by default	2025-12-04 15:19:06 -08:00
Jesse Gross	1108d8b34e	ggml: Enable flash attention for vision encoders Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.	2025-12-04 15:19:06 -08:00
Jesse Gross	7837a5bc7e	ggml: Always set cache padding to 256 We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.	2025-12-04 15:19:06 -08:00
Patrick Devine	0a844f8e96	convert: add deepseek converter (#12980 ) This change adds the ability for `ollama create` to convert models that use the DeepSeek2 architecture (specifically DeepSeekV3 and DeepSeek-R1).	2025-12-04 13:49:30 -08:00
Eloi Torrents	a03223b86f	cmd/bench: support writing benchmark output to file (#13263 ) * cmd/bench: support writing benchmark output to file This changes Ollama to allow the bench command to write benchmark results to a user-specified output file instead of stdout when the --output flag is provided. --------- Co-authored-by: Patrick Devine <patrick@infrahq.com>	2025-12-04 13:22:41 -08:00
Daniel Hiltgen	0cf7794b16	ggml update to b7108 (#12992 ) * Revert "vulkan: temporary cary of vulkan fixes (#12971)" This reverts commit `3a9e8e9fd4`. * ggml update to b7087 * fix argsort on metal * update to b7108 * fix bakllava regression This model lacks the metadata for the projector type. * update to b7209 * fix TopK perf * only build arm code on arm	2025-12-03 19:43:29 -08:00
Jeffrey Morgan	854d40edc5	ci: restore previous linter rules (#13322 )	2025-12-03 18:55:02 -08:00
Bruce MacDonald	84a2cedf18	app: relay thinking false to server (#13319 ) This fixes a bug where disabling thinking on deepseek-v3.1 did not stop the model from thinking. When thinking is not defined it should not be sent to the server since this will cause error responses in some cases where the model does not support thinking. However if it is defined as false it should still be sent.	2025-12-03 15:06:55 -08:00
Daniel Hiltgen	3f30836734	CUDA: filter devices on secondary discovery (#13317 ) We now do a deeper probe of CUDA devices to verify the library version has the correct compute capability coverage for the device. Due to ROCm also interpreting the CUDA env var to filter AMD devices, we try to avoid setting it which leads to problems in mixed vendor systems. However without setting it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs and on systems with lots of GPUs, this can lead to hitting timeouts. The fix is to turn on the CUDA visibility env var just for this deeper probe use-case.	2025-12-03 12:58:16 -08:00
Nathan Hook	cc9555aff0	Update user message format for temperature query (#13256 )	2025-12-02 15:08:39 -08:00
hello_world	20aee96706	Add Vulkan GPU support instructions in development.md (#13265 ) Added Vulkan SDK installation instructions and environment variable setup for building with Vulkan support.	2025-12-02 13:37:32 -08:00
Daniel Hiltgen	18b5958d46	test: avoid ministral tools test on low vram (#13302 ) Avoid hitting test timeouts	2025-12-02 13:18:55 -08:00
Jesse Gross	5317202c38	llm: Don't always evict models on CPU-only systems Model eviction happens when we have at least one other model loaded and are unable to load all layers into VRAM. However, on CPU-only systems we can never load layers into VRAM, so this constantly triggered eviction. Fixes #13227	2025-12-02 10:58:08 -08:00
Daniel Hiltgen	d771043e88	test: add ministral-3 (#13300 )	2025-12-02 09:52:16 -08:00
Daniel Hiltgen	f8f1071818	CUDA: verify CC is supported by target library (#13298 )	2025-12-02 09:28:41 -08:00
Patrick Devine	d3e0a0dee4	model: ministral w/ llama4 scaling (#13292 ) This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2025-12-01 23:20:14 -08:00
Daniel Hiltgen	554172759c	win: warn if ggml-base detected in PATH (#13289 ) If the user has somehow installed another GGML based app which places a ggml-base lib somewhere in their PATH, we can experience runtime problems due to incompatibilities. This change adds a warning message if we detect a ggml-base outside of our install location to aid in troubleshooting.	2025-12-01 15:36:47 -08:00
Bruce MacDonald	5b6a8e6001	api/client: handle non-json streaming errors (#13007 ) While processing the response stream during a chat or generation if an error is occurred it is parsed and returned to the user. The issue with the existing code is that this assumed the response would be valid JSON, which is not a safe assumption and caused cryptic error messages to be displayed due to parsing failures: `invalid character 'i' looking for beginning of value` This change updates the stream function to return the raw error string if it cant be parsed as JSON. This should help with debugging issues by making sure the actual error reaches the user.	2025-12-01 15:10:16 -08:00
Daniel Hiltgen	467bbc0dd5	jetpack: require exact match or skip cuda_jetpack* (#13288 ) The cuda_jetpack libs will enumerate discrete GPUs on SBSA systems which leads to runtime failures of missing kernels. This fix requires an exact match to enable jetpacks instead of relying on enumeration to filter out supported libraries.	2025-12-01 12:48:16 -08:00
Jeffrey Morgan	6d9f9323c5	.gitattributes: add app/webview to linguist-vendored (#13274 )	2025-11-29 23:46:10 -05:00
Ondrej Kokes	0c2489605d	docs: fix output formatting in faq.mdx (#13231 ) There were a few Markdown typos in one FAQ answer. It now renders as a proper ascii table.	2025-11-28 19:19:21 -05:00
EntropyYue	8b1b89a984	docs: remove deprecated parameters (#13237 )	2025-11-26 11:03:09 +09:00
Eva H	47e272c35a	app/cmd: update ollama help to navigate to ollama doc instead of github page (#13174 )	2025-11-20 16:30:35 -05:00
Jeffrey Morgan	417a81fda3	app: open app instead of always navigating to / on connect (#13164 )	2025-11-20 12:59:17 -08:00
Daniel Hiltgen	dba62ff3a5	discovery: fix cuda overlap case (#13176 ) Recent refactoring introduced a regression for filtering cuda overlap to favor newest supported version.	2025-11-20 12:15:37 -08:00
Grace	d70e935526	Parser for Cogito v2 (#13145 )	2025-11-19 17:21:07 -08:00
Michael Yang	5c1063df7f	deepseek2: upgrade to run v3+ models (#13166 ) the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility	2025-11-19 17:05:39 -08:00
Jesse Gross	cb485b2019	kvcache: Run tests both with and without PermutedV The causal cache can store data differently depending on what is best for the backend. We should run tests both ways.	2025-11-19 16:45:30 -08:00

1 2 3 4 5 ...

4873 Commits All Branches Search

4873 Commits

All Branches