ollama

History

Jesse Gross 53985b3c4d kvcache: Use SetRows to store cache data We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in `544b673` "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112	2025-11-18 20:42:28 -08:00
..
ggml	kvcache: Use SetRows to store cache data	2025-11-18 20:42:28 -08:00
backend.go	next ollama runner (#7913 )	2025-02-13 16:31:21 -08:00

Jesse Gross 53985b3c4d kvcache: Use SetRows to store cache data

We currently copy data into the KV cache in contiguous buffers using
ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation
so that contiguous buffers are no longer required. The direct primary
benefit of this is that we no longer need to perform defragmentation.

However, GGML recently removed an optimization for ggml_cpy() and
we picked it up in 544b673 "ggml update to b6840 (#12791)". This
caused a roughly 40% drop in token generation performance on CUDA
due to CUDA graphs no longer being used. By switching to
ggml_set_rows(), the original optimization is no longer necessary
and CUDA performance is restored.

Fixes #13112

2025-11-18 20:42:28 -08:00

ggml kvcache: Use SetRows to store cache data 2025-11-18 20:42:28 -08:00

backend.go next ollama runner (#7913 ) 2025-02-13 16:31:21 -08:00