ollama

History

Jesse Gross 53985b3c4d kvcache: Use SetRows to store cache data We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in `544b673` "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112		2025-11-18 20:42:28 -08:00
..
backend	kvcache: Use SetRows to store cache data	2025-11-18 20:42:28 -08:00
nn	Add deepseek v3.1 (#13063 )	2025-11-17 18:03:21 -08:00
backend.go	kvcache: Use SetRows to store cache data	2025-11-18 20:42:28 -08:00
device.go	llm: Prefer dedicated GPUs over iGPUs when allocating memory	2025-11-11 13:11:08 -08:00
path.go	cpu: always ensure LibOllamaPath included (#12890 )	2025-10-31 14:37:29 -07:00