ollama

Commit Graph

Author	SHA1	Message	Date
Michael Yang	f6a016f49d	revert granite-embedding (#13505 )	2025-12-16 15:44:52 -08:00
Michael Yang	903b1fc97f	use ollama engine for bert models (#13501 ) register bpe tokenizer which enables granite-embedding	2025-12-16 11:29:19 -08:00
Michael Yang	971d62595a	fix: qwen2.5 vl rope (#13486 ) * qwen25vl: bump max pixels * qwen25vl: mrope fix qwen2.5vl window * qwen25vl: vision rope	2025-12-15 17:30:33 -08:00
Parth Sareen	ffbe8e076d	model: add olmo3 and olmo3.1 (#13415 )	2025-12-15 15:20:04 -08:00
Jeffrey Morgan	4ff8a691bc	model: default gemma 3 rope scale to 1.0, apply corrections based on layer counts (#13453 )	2025-12-12 17:51:56 -08:00
Jeffrey Morgan	1b308e1d2a	model: fix global layer rope scale values for gemma 3 (#13452 )	2025-12-12 16:29:01 -08:00
Jeffrey Morgan	3af5d3b738	model: force rope factor 1.0 for Gemma 3 (#13445 )	2025-12-12 13:27:08 -08:00
Jeffrey Morgan	2dfb74410d	model: fix rotary embeddings for ministral 3 (#13432 )	2025-12-11 16:02:05 -08:00
Jeffrey Morgan	a838421ea3	model: conversion and hyperparameter fixes for ministral and devstral (#13424 )	2025-12-11 13:04:00 -08:00
nicole pardal	76f88caf43	nomic-embed-text:v2: model implementation (#13162 )	2025-12-09 14:24:51 -08:00
Jeffrey Morgan	d2f334c1f7	model: add rnj-1 inference support (#13354 )	2025-12-08 16:49:17 -08:00
Michael Yang	603ceefaa6	refactor rope change to a flatter directory structure and group the options with the function update models to call rope in one place	2025-12-08 14:42:22 -08:00
Patrick Devine	d3e0a0dee4	model: ministral w/ llama4 scaling (#13292 ) This change: * fixes rope scaling in the mistral converter * updates ministral to include llama4 scaling * includes a new ministral parser for parsing reasoning and tool calling --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2025-12-01 23:20:14 -08:00
Michael Yang	5c1063df7f	deepseek2: upgrade to run v3+ models (#13166 ) the check for mla omits v3 and r1 which should not return unsupported. instead check the tokenizer for compatibility	2025-11-19 17:05:39 -08:00
Patrick Devine	604e43b28d	models: enable deepseek2 (deepseek v3.1 w/ MLA) on the new engine (#13151 )	2025-11-18 22:03:50 -08:00
nicole pardal	8de30b568a	nomic-embed-text model implementation (#13071 )	2025-11-18 18:28:10 -08:00
Michael Yang	92981ae3f2	deepseekocr	2025-11-18 16:11:37 -08:00
Grace	584e2d646f	Add deepseek v3.1 (#13063 ) * Add mla for flash attention * Revert to using chunks	2025-11-17 18:03:21 -08:00
Michael Yang	333203d871	chore: update models to use slice/chunk/chunksections (#12934 ) * use slice/chunks * bert * llama4 * gemma3n * gptoss * mistral3 * qwen3vl * qwen25vl * deepseek2 * remove unused ops	2025-11-13 15:20:12 -08:00
Daniel Hiltgen	544b6739dd	ggml update to b6840 (#12791 )	2025-11-06 10:19:22 -08:00
Michael Yang	ce3eb0a315	chore(gptoss): cleanup dead code (#12932 )	2025-11-03 11:27:15 -08:00
Michael Yang	f67a6df110	interleaved mrope (#12807 ) * ml(ggml): mrope * interleave mrope	2025-10-30 11:29:00 -07:00
Michael Yang	d432ade714	fix: qwen2.5vl, qwen3vl composite image (#12841 ) this change fixes images with an alpha channel by overlaying the image onto a white background	2025-10-30 10:33:19 -07:00
Michael Yang	7d25b9e194	feat(model): add qwen3vl (#12665 )	2025-10-28 17:39:47 -07:00
Michael Yang	1188f408dd	s/FromSlice/Froms/ (#12255 )	2025-10-28 12:08:49 -07:00
Michael Yang	ec9eb28f4c	gemma3: make embedding non-causal (#12297 )	2025-10-27 19:54:08 -07:00
Daniel Hiltgen	bc1a818fdc	contiguous input per layer (#12686 ) Co-authored-by: Michael Yang <git@mxy.ng>	2025-10-17 18:39:18 -07:00
Michael Yang	6c833d5f8d	fix(qwen3): deepseek distill deepseek's qwen3 distill uses a different rope scheme so support both	2025-10-13 13:30:30 -07:00
shengxinjing	47298fce39	refactor: use builtin max and min	2025-10-09 16:17:52 -07:00
shengxinjing	4a48937ef1	refactor: use builtin max and min	2025-10-09 16:17:52 -07:00
Grace	33801c1597	Fixed Deepseek2 adding nil tensor error	2025-10-03 14:20:06 -07:00
Grace	fbd82ba5bb	Grace/deepseek v3 migration (#12385 ) * init deepseek model file * temp removal of flash attention implementation * shapes and proper, can make a pass * query, key, value have good cosine similarity, but the max diff is a bit high * Attention block is working! ** with eager for now, have not added the mask line * Attention block is working! ** with eager for now, have not added the mask line * working MoE at around 0.95 cosine sim * added cosine similarity function * Starting end to end structure * Trying (and failing) to get rope to work, going to test full thing on tater * running on tater36... just not the right outputs * we have the right values for rope... but its still not working? * chnage Extrapolation Factor to 1 * removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer * Temporary modelfiles for cpu * change kpass intermediate step to kv, two layer outputs [0,1] look fine * this calls for 16 chicken nuggets * whoops * cleaning up code * delete stuff we dont need * getting rid of debug statements for llama cpp * working with long contexts * fix long context view error * reverting some changes I made for files that are not apart of pr * Added proper tokenizer for deeepseek3 * clean up model and go test * remove Modelfile * not passing the tests * whoops * how to pass the ci tests * resolving some of the comments * rename * linted and renamed deepseek3 -> deepseek2 * remove name go * addressed changes - main change was adopting qwen3 naming scheme * I cannot with linters * clean up logs * clean up logs --------- Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain> Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local> Co-authored-by: graceguo <graceguo@tater36.localdomain>	2025-09-24 15:19:47 -07:00
Michael Yang	bf78ed6ee9	add pre:, suf: to tags (#12274 )	2025-09-23 16:08:57 -07:00
Michael Yang	a40d427bce	multi-regexp pretokenizer (#12325 )	2025-09-23 13:21:47 -07:00
Patrick Devine	dba39b2eee	gemma: fix rope scaling for qat models (#12348 ) * gemma: fix rope scaling for qat models * gofumpt yourself	2025-09-19 15:04:40 -07:00
Michael Yang	7460259eb3	feat: qwen3 embed (#12301 ) * cleanup * use pooling.TypeNone * pooling test * qwen3 embed	2025-09-18 15:50:32 -07:00
Michael Yang	564b558c92	fix(llama): other llama flavours (#12308 ) * fix(llama): rope scale * spm llama * skip moe models * cleanup	2025-09-17 12:12:21 -07:00
Michael Yang	ad95d5b30b	use split activations when possible (#12293 ) * use ggml__split activations when possible forward qkv	2025-09-16 09:51:19 -07:00
Michael Yang	c253433d68	embed: cleanup (#12299 ) * cleanup * use pooling.TypeNone * pooling test	2025-09-16 09:48:42 -07:00
Michael Yang	3f6642f6fc	model: implement bert in ollama engine (#9080 ) * fix truncate * s/SentencePieceModel/SentencePiece/ * bert * wordpiece * refactor pooling * more tokenizers * normalize embeddings	2025-09-15 15:35:59 -07:00
Michael Yang	6f7117145f	batch: use tensors for outputs (#12185 ) this cleans up the model interface slightly without too much impact in other areas	2025-09-15 14:33:06 -07:00
Michael Yang	5994e8e8fd	embedding gemma model (#12181 ) * ollama: add embeddings	2025-09-04 09:09:07 -07:00
Daniel Hiltgen	517807cdf2	perf: build graph for next batch async to keep GPU busy (#11863 ) * perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.	2025-08-29 14:20:28 -07:00
Michael Yang	30fb7e19f8	remove extra field attr (#11205 )	2025-08-25 09:58:16 -07:00
Michael Yang	1a19df1f3a	update vendored llama.cpp and ggml (#11823 ) * TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore .patched in the patch directory fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c\|cpp, use an #ifdef \| #elif defined \| #endif chain to explicitly include the .c\|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit `d724caced3`. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-08-14 14:42:58 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00
Oliver Simons	ea85e27bbd	Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525 ) * Enable CUDA Graphs for gemma3n. Similar to https://github.com/ggml-org/llama.cpp/pull/14741, though ollama has a slightly different model graph than llama.cpp which requires different workaround checks. * Remove residual check by reshaping differently in gemma3n model This should make the heuristics more robust	2025-07-29 12:37:06 -07:00
Daniel Hiltgen	f8a6e88819	Only load supported models on new engine (#11362 ) * Only load supported models on new engine Verify the model is supported before trying to load * int: testcase for all library models	2025-07-11 12:21:54 -07:00
Michael Yang	4129af9205	chore: cleanup comments + unused vars (#11225 )	2025-06-27 11:45:33 -07:00
Michael Yang	73b642e6f3	add new gemma model (#11204 ) * update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n	2025-06-25 21:47:09 -07:00

1 2 3

118 Commits