valkey

Commit Graph

Author	SHA1	Message	Date
Viktor Söderqvist	ab3c953b80	Release notes and version file for 9.0.1 Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-12-09 19:03:34 +01:00
bandalgomsu	5d6bc52c92	Fix CLUSTER SLOTS crash when called from module timer callback (#2915 ) The CLUSTER SLOTS reply depends on whether the client is connected over IPv6, but for a fake client there is no connection and when this command is called from a module timer callback or other scenario where no real client is involved, there is no connection to check IPv6 support on. This fix handles the missing case by returning the reply for IPv4 connected clients. Fixes #2912. --------- Signed-off-by: Su Ko <rhtn1128@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Su Ko <rhtn1128@gmail.com> Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-12-09 19:03:34 +01:00
Binbin	d2ffea53df	Fix commandlog large-reply when using reply copy avoidance (#2652 ) In #2078, we did not report large reply when copy avoidance is allowed. This results in replies larger than 16384 not being recorded in the commandlog large-reply. This 16384 is controlled by the hidden config min-string-size-avoid-copy-reply. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-12-09 19:03:34 +01:00
Ouri Half	52547bff18	Fix deadlock in IO thread shutdown during panic (#2898 ) ## Problem IO thread shutdown can deadlock during server panic when the main thread calls `pthread_cancel()` while the IO thread holds its mutex, preventing the thread from observing the cancellation. ## Solution Release the IO thread mutex before cancelling to ensure clean thread termination. ## Testing Reproducer: ``` bash ./src/valkey-server --io-threads 2 --enable-debug-command yes ./src/valkey-cli debug panic ``` Before: Server hangs indefinitely After: Server terminates cleanly Signed-off-by: Ouri Half <ourih@amazon.com>	2025-12-09 19:03:34 +01:00
Viktor Söderqvist	60a5401e34	Fix persisting missing make variables (#2881 ) Persist USE_FAST_FLOAT and PROG_SUFFIX to prevent a complete rebuild next time someone types make or make test without specifying variables. Fixes #2880 Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-12-09 19:03:34 +01:00
Binbin	4fe8d66c2f	Fix discarded-qualifiers warnings reported by fedorarawhide (#2874 ) fedorarawhide CI reports these warnings: ``` networking.c: In function 'afterErrorReply': networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers] 821 \| char *spaceloc = memchr(s, ' ', len < 32 ? len : 32); ``` Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-12-09 19:03:34 +01:00
Ran Shidlansik	3148c93c38	Align the complexity description for all multi field HFE commands docs (#2875 ) When we added the Hash Field Expiration feature in Valkey 9.0, some of the new command docs included complexity description of O(1) even tough they except multiple arguments. (see discussion in https://github.com/valkey-io/valkey/pull/2851#discussion_r2535684985) This PR does: 1. align all the commands to the same description 2. fix the complexity description of some commands (eg HSETEX and HGETEX) --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>	2025-12-09 19:03:34 +01:00
Leon Anavi	00eccbc721	Fix build on 32-bit ARM by only using NEON on AArch64 (#2873 ) Only enable `HAVE_ARM_NEON` on AArch64 because it supports vaddvq and all needed compiler intrinsics. Fixes the following error when building for machine `qemuarm` using the Yocto Project and OpenEmbedded: ``` \| bitops.c: In function 'popcountNEON': \| bitops.c:219:23: error: implicit declaration of function 'vaddvq_u16'; did you mean 'vaddq_u16'? [-Wimplicit-function-declaration] \| 219 \| uint32_t t1 = vaddvq_u16(sc); \| \| ^~~~~~~~~~ \| \| vaddq_u16 \| bitops.c:225:14: error: implicit declaration of function 'vaddvq_u8'; did you mean 'vaddq_u8'? [-Wimplicit-function-declaration] \| 225 \| t += vaddvq_u8(vcntq_u8(vld1q_u8(p))); \| \| ^~~~~~~~~ \| \| vaddq_u8 ``` More details are available in the following log: https://errors.yoctoproject.org/Errors/Details/889836/ Signed-off-by: Leon Anavi <leon.anavi@konsulko.com>	2025-12-09 19:03:34 +01:00
aradz44	4e335f14ed	deflake "Hash field TTL and active expiry propagates correctly" (#2856 ) Fix a little miss in "Hash field TTL and active expiry propagates correctly through chain replication" test in `hashexpire.tcl`. The test did not wait for the initial sync of the chained replica and thus made the test flakey Signed-off-by: Arad Zilberstein <aradz@amazon.com>	2025-12-09 19:03:34 +01:00
Sarthak Aggarwal	d25401e0d0	Fixes test-freebsd workflow in daily (package lang/tclX) (#2832 ) This PR fixes the freebsd daily job that has been failing consistently for the last days with the error "pkg: No packages available to install matching 'lang/tclx' have been found in the repositories". The package name is corrected from `lang/tclx` to `lang/tclX`. The lowercase version worked previously but appears to have stopped working in an update of freebsd's pkg tool to 2.4.x. Example of failed job: https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193499 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>	2025-12-09 19:03:34 +01:00
Harkrishn Patro	cbe6923fcb	[flaky-failure-fix] Increase the cluster-node-timeout to have longer delay between failover of each shard (#2793 )	2025-12-09 19:03:34 +01:00
yzc-yzc	55b47270db	Fix flaky DBSIZE test for atomic slot migration (#2805 ) Related test failures: *** [err]: Replica importing key containment (slot 0 from node 0 to 2) - DBSIZE command excludes importing keys in tests/unit/cluster/cluster-migrateslots.tcl Expected '1' to match '0' (context: type eval line 2 cmd {assert_match "0" [R $node_idx DBSIZE]} proc ::test) The reason is that we don't wait for the primary-replica synchronization to complete before starting the next testcase. --------- Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-12-09 19:03:34 +01:00
harrylin98	1e3bc3d2fc	Fix: ltrim should not call signalModifiedKey when no elements are removed (#2787 ) There’s an issue with the LTRIM command. When LTRIM does not actually modify the key — for example, with `LTRIM key 0 -1` — the server.dirty counter is not updated because both ltrim and rtrim values are 0. As a result, the command is not propagated. However, `signalModifiedKey` is still called regardless of whether server.dirty changes. This behavior is unexpected and can cause a mismatch between the source and target during propagation, since the LTRIM command is not sent. Signed-off-by: Harry Lin <harrylhl@amazon.com> Co-authored-by: Harry Lin <harrylhl@amazon.com>	2025-12-09 19:03:34 +01:00
Harkrishn Patro	35352000fd	Send duplicate multi meet packet only for node which supports it (#2840 ) This prevents crashes on the older nodes in mixed clusters where some nodes are running 8.0 or older. Mixed clusters often exist temporarily during rolling upgrades. Fixes: #2341 Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2025-12-09 19:03:34 +01:00
Daniil Kashapov	032e3cb797	Cluster: Avoid usage of light weight messages to nodes with not ready bidirectional links (#2817 ) After network failure nodes that come back to cluster do not always send and/or receive messages from other nodes in shard, this fix avoids usage of light weight messages to nodes with not ready bidirectional links. When a light message comes before any normal message, freeing of cluster link is happening because on the just established connection link->node is not assigned yet. It is assigned in getNodeFromLinkAndMsg right after the condition if (is_light). So on a cluster with heavy pubsub load a long loop of disconnects is possible, and we got this. 1. node A establishes cluster link to node B 2. node A propagates PUBLISH to node B 3. node B frees cluster link because of link->node == null as it has not received non-light messages yet 4. go to 1. During this loop subscribers of node B does not receive any messages published to node A. So here we want to make sure that PING was sent (and link->node was initialized) on this connection before using lightweight messages. --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com> Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>	2025-12-09 19:03:34 +01:00
Simon Baatz	bf03b0c856	Sentinel: fix regression requiring "+failover" ACL in failover path (#2780 ) Since Valkey Sentinel 9.0, sentinel tries to abort an ongoing failover when changing the role of a monitored instance. Since the result of the command is ignored, the "FAILOVER ABORT" command is sent irrespective of the actual failover status. However, when using the documented pre 9.0 ACLs for a dedicated sentinel user, the FAILOVER command is not allowed and _all_ failover cases fail. (Additionally, the necessary ACL adaptation was not communicated well.) Address this by: - Updating the documentation in "sentinel.conf" to reflect the need for an adapted ACL - Only abort a failover when sentinel detected an ongoing (probably stuck) failover. This means that standard failover and manual failover continue to work with unchanged pre 9.0 ACLs. Only the new "SENTINEL FAILOVER COORDINATED" requires to adapt the ACL on all Valkey nodes. - Actually use a dedicated sentinel user and ACLs when testing standard failover, manual failover, and manual coordinated failover. Fixes #2779 Signed-off-by: Simon Baatz <gmbnomis@gmail.com>	2025-12-09 19:03:34 +01:00
xbasel	1ca63e1113	Bug fix: reset io_last_written on c->buf resize to prevent stale pointers (#2786 ) Fixes an assert crash in _writeToClient(): serverAssert(c->io_last_written.data_len == 0 \|\| c->io_last_written.buf == c->buf); The issue occurs when clientsCronResizeOutputBuffer() grows or reallocates c->buf while io_last_written still points to the old buffer and data_len is non-zero. On the next write, both conditions in the assertion become false. Reset io_last_written when resizing the output buffer to prevent stale pointers and keep state consistent. fixes https://github.com/valkey-io/valkey/issues/2769 Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>	2025-12-09 19:03:34 +01:00
Jacob Murphy	b828044511	Authenticate slot migration client on source node to internal user (#2785 ) Just setting the authenticated flag actually authenticates to the default user in this case. The default user may be granted no permission to use CLUSTER SYNCSLOTS. Instaed, we now authenticate to the NULL/internal user, which grants access to all commands. This is the same as what we do for replication: `864de555ce/src/replication.c (L4717)` Add a test for this case as well. Closes #2783 Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-12-09 19:03:34 +01:00
Mahesh Cherukumilli	5018b12b0d	Release notes for 9.0.0 GA Signed-off-by: cherukum-amazon <cherukum@amazon.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-10-21 09:05:42 -07:00
Binbin	28cf7ba66a	Initialize the lua attributes of the luaFunction script (#2750 ) This was introduced in #1826. This create an `Uninitialised value was created by a heap allocation` in the CI. Signed-off-by: Binbin <binloveplay1314@qq.com> (cherry picked from commit `5d3cb3d04c`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Jacob Murphy	c87e47ed81	Fix invalid memory address caused by hashtable shrinking during safe iteration (#2753 ) Safe iterators pause rehashing, but don't pause auto shrinking. This allows stale bucket references which then cause use after free (in this case, via compactBucketChain on a deleted bucket). This problem is easily reproducible via atomic slot migration, where we call delKeysInSlot which relies on calling delete within a safe iterator. After the fix, it no longer causes a crash. Since all cases where rehashing is paused expect auto shrinking to also be paused, I am making this happen automatically as part of pausing reshashing. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit `1cf0df9fc3`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Jacob Murphy	3f243ff2a7	Fix incorrect kvstore size and BIT accounting after completed migration (#2749 ) When working on #2635 I errorneously duplicated the setSlotImportingStateInAllDbs call for successful imports. This resulted in us doubling the key count in the kvstore. This results in DBSIZE reporting an incorrect sum, and also causes BIT corruption that can eventually result in a crash. The solution is: 1. Only call setSlotImportingStateInAllDbs once (in our finishSlotMigrationJob function) 2. Make setSlotImportingStateInAllDbs idempotent by checking if the delete from the kvstore importing hashtable is a no-op This also fixes a bug where the number of importing keys is not lowered after the migration, but this is less critical since it is only used when resizing the dictionary on RDB load. However, it could result in un-loadable RDBs if the importing key count gets large enough. Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit `2a914aa521`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Harkrishn Patro	937976c053	Bump old engine version(s) for compatibility test (#2741 ) Signed-off-by: Harkrishn Patro <harkrisp@amazon.com> (cherry picked from commit `95154feaa1`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Roshan Khatri	4e113d28ce	Deflake Psync established within grace period (#2743 ) increased the wait time to a total of 10 seconds where we check the log for `Done loading RDB` message Fixes https://github.com/valkey-io/valkey/issues/2694 CI run (100 times): https://github.com/roshkhatri/valkey/actions/runs/18576201712/job/52961907806 Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> (cherry picked from commit `898172bc9c`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Binbin	791b792d78	FUNCTION FLUSH re-create lua VM, fix flush not gc, fix flush async + load crash (#1826 ) There will be two issues in this test: ``` test {FUNCTION - test function flush} { for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush sync set after_flush_memory [s used_memory_vm_functions] puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush async set after_flush_memory [s used_memory_vm_functions] puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } puts "Test done" } ``` The first one is the test output, we can see that after executing FUNCTION FLUSH, used_memory_vm_functions has not changed at all: ``` flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432 flush async, before_flush_memory: 4504576, after_flush_memory: 4504576 ``` The second one is there is a crash when loading the functions during the async flush: ``` === VALKEY BUG REPORT START: Cut & paste starting from here === # valkey 255.255.255 crashed by signal: 11, si_code: 2 # Accessing address: 0xe0429b7100000a3c # Crashed running the instruction at: 0x102e0b09c ------ STACK TRACE ------ EIP: 0 valkey-server 0x0000000102e0b09c luaH_getstr + 52 Backtrace: 0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56 1 valkey-server 0x0000000102e01054 luaD_precall + 96 2 valkey-server 0x0000000102e01b10 luaD_call + 104 3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76 4 valkey-server 0x0000000102e01e3c luaD_pcall + 60 5 valkey-server 0x0000000102dfc630 lua_pcall + 300 6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708 7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104 8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088 9 valkey-server 0x0000000102f70898 functionLoadCommand + 312 10 valkey-server 0x0000000102e3978c call + 416 11 valkey-server 0x0000000102e3b5b8 processCommand + 3340 12 valkey-server 0x0000000102e563cc processInputBuffer + 520 13 valkey-server 0x0000000102e55808 readQueryFromClient + 92 14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180 15 valkey-server 0x0000000102e20480 aeProcessEvents + 372 16 valkey-server 0x0000000102e4aad0 main + 26412 17 dyld 0x000000018acab154 start + 2476 ------ STACK TRACE DONE ------ ``` The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc, it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH. The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH, loading the functions during a FUNCTION FLUSH ASYNC will result a crash because lua engine state is not thread-safe. The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ricardo Dias <ricardo.dias@percona.com> Co-authored-by: Ricardo Dias <ricardo.dias@percona.com> (cherry picked from commit `b4c93cc9c2`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Viktor Söderqvist	11f47a107c	Fix double MOVED reply on unblock at failover (#2734 ) #2329 intoduced a bug that causes a blocked client in cluster mode to receive two MOVED redirects instead of one. This was not seen in tests, except in the reply schema validator. The fix makes sure the client's pending command is cleared after sending the MOVED redirect, to prevent if from being reprocessed. Fixes #2676. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> (cherry picked from commit `54da8344c1`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Jacob Murphy	6bb61d175b	Stop using DEBUG LOADAOF on replica in ASM tests (#2719 ) DEBUG LOADAOF sometimes works but it results in -LOADING responses to the primary so there are lots of race conditions. It isn't something we should be doing anyways. To test, I just disconnect the replica before loading the AOF, then reconnect it. Fixes #2712 Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit `dbcf022480`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Jacob Murphy	626b3653f2	Deflake atomic slot migration client flag test (#2720 ) This test was failing, and causing the next test to throw an exception. It is failing since we never waited for the slot migration to connect before proceeding. Fixes #2692 Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit `19474c867a`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Jacob Murphy	b2dc84b07e	Fix crash that occurs sometimes when aborting a slot migration while child snapshot is active (#2721 ) The race condition causes the client to be used and subsequently double freed by the slot migration read pipe handler. The order of events is: 1. We kill the slot migration child process during CANCELSLOTMIGRATIONS 1. We then free the associated client to the target node 1. Although we kill the child process, it is not guaranteed that the pipe will be empty from child to parent 1. If the pipe is not empty, we later will read that out in the slotMigrationPipeReadHandler 1. In the pipe read handler, we attempt to write to the connection. If writing to the connection fails, we will attempt to free the client 1. However, the client was already freed, so this a double free Notably, the slot migration being aborted doesn't need to be triggered by `CANCELSLOTMIGRATIONS`, it can be any failure. To solve this, we simply: 1. Set the slot migration pipe connection to NULL whenever it is unlinked 2. Bail out early in slot migration pipe read handler if the connection is NULL I also consolidate the killSlotMigrationChild call to one code path, which is executed on client unlink. Before, there were two code paths that would do this twice (once on slot migration job finish, and once on client unlink). Sending the signal twice is fine, but inefficient. Also, add a test to cancel during the slot migration snapshot to make sure this case is covered (we only caught it during the module test). --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> (cherry picked from commit `28e5dcce2c`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Ran Shidlansik	e76143329b	HSETEX with FXX should not create an object if it does not exist (#2716 ) When the hash object does not exist FXX should simply fail the check without creating the object while FNX should be trivial and succeed. Note - also fix a potential compilation warning on some COMPILERS doing constant folding of variable length array when size is const expression. Signed-off-by: Ran Shidlansik <ranshid@amazon.com> (cherry picked from commit `8182f4a0b9`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Harkrishn Patro	8253c38ce2	Add compatibility test with Valkey 7.2/8.0 (#2342 ) * Add cross version compatibility test to run with Valkey 7.2 and 8.0 * Add mechanism in TCL test to skip tests dynamically - #2711 --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> (cherry picked from commit `18214be490`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Harkrishn Patro	010fa64bcf	Fix memory leak with CLIENT LIST/KILL duplicate filters (#2362 ) With #1401, we introduced additional filters to CLIENT LIST/KILL subcommand. The intended behavior was to pick the last value of the filter. However, we introduced memory leak for all the preceding filters. Before this change: ``` > CLIENT LIST IP 127.0.0.1 IP 127.0.0.1 id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client\|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0 ``` Leak: ``` Direct leak of 11 byte(s) in 1 object(s) allocated from: #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d) #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156 #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200 #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113 #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264 #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600 #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772 #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434 #8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571 #9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702 #10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812 #11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79 #12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301 #13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486 #14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543 #15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319 #16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139) ``` Note: For filter ID / NOT-ID we group all the option and perform filtering whereas for remaining filters we only pick the last filter option. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com> (cherry picked from commit `155b0bb821`) Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Mahesh Cherukumilli	f667f4ec98	Bump version to 9.0.0 GA Signed-off-by: cherukum-amazon <cherukum@amazon.com>	2025-10-21 09:05:42 -07:00
Sarthak Aggarwal	6700272f31	Deflake replica selection test by relaxing cluster configurations (#2672 ) We have relaxed the `cluster-ping-interval` and `cluster-node-timeout` so that cluster has enough time to stabilize and propagate changes. Fixes this test occasional failure when running with valgrind: [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl #10 didn't became slave of #5 Backported to the 9.0 branch in #2731. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>	2025-10-13 21:47:32 +02:00
Jacob Murphy	128f30bab7	Reduce flakiness of atomic slot migration AOF test (#2705 ) If we don't wait for the replica to resync, the migration may be cancelled by the time the replica resyncs, resulting in a test failure when we can't find the migration on the replica. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-10-08 13:19:23 -07:00
Jacob Murphy	40a257d3ac	Release notes for 9.0.0-rc3 Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	32e4a0bbfb	Use correct arguments in LOLWUT test (#2708 ) Seeing test failures due to this on the 9.0.0 branch: ``` [exception]: Executing test client: ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary]. ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary] ``` It turns out we are just providing the version as an argument, instead of specifying which version to run on Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	b90ac09720	Introduce SYNCSLOTS CAPA for forwards compatibility (#2688 ) For now, introduce this and have it do nothing. Eventually, we can use this to negotiate supported capabilities on either end. Right now, there is nothing to send or support, so it just accepts it and doesn't reply. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	68a666c698	Prevent exposure of importing keys on replicas during atomic slot migration (#2635 ) # Problem In the current slot migration design, replicas are completely unaware of the slot migration. Because of this, they do not know to hide importing keys, which results in exposure of these keys to commands like KEYS, SCAN, RANDOMKEY, and DBSIZE. # Design The main part of the design is that we will now listen for and process the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS ESTABLISH` command is received from the primary, we begin tracking a new slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. Replicas use this state to track the import, and await for a future `SYNCSLOTS FINISH` message that tells them the import is successful/failed. ## Success Case ``` Source Target Target Replica \| \| \| \|------------ SYNCSLOTS ESTABLISH -------------->\| \| \| \|----- SYNCSLOTS ESTABLISH ------>\| \|<-------------------- +OK ----------------------\| \| \| \| \| \|~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>\| \| \| \|~~~~~~ forward snapshot ~~~~~~~~>\| \|----------- SYNCSLOTS SNAPSHOT-EOF ------------>\| \| \| \| \| \|<----------- SYNCSLOTS REQUEST-PAUSE -----------\| \| \| \| \| \|~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>\| \| \| \|~~~~~~ forward changes ~~~~~~~~~>\| \|--------------- SYNCSLOTS PAUSED -------------->\| \| \| \| \| \|<---------- SYNCSLOTS REQUEST-FAILOVER ---------\| \| \| \| \| \|---------- SYNCSLOTS FAILOVER-GRANTED --------->\| \| \| \| \| \| (performs takeover & \| \| propagates topology) \| \| \| \| \| \|------- SYNCSLOTS FINISH ------->\| (finds out about topology \| \| change & marks migration done) \| \| \| \| \| ``` ## Failure Case ``` Source Target Target Replica \| \| \| \|------------ SYNCSLOTS ESTABLISH -------------->\| \| \| \|----- SYNCSLOTS ESTABLISH ------>\| \|<-------------------- +OK ----------------------\| \| ... ... ... \| \| \| \| <FAILURE> \| \| \| \| \| (performs cleanup) \| \| \| ~~~~~~ UNLINK <key> ... ~~~~~~~>\| \| \| \| \| \| ------ SYNCSLOTS FINISH ------->\| \| \| \| ``` ## Full Sync, Partial Sync, and RDB In order to ensure replicas that resync during the import are still aware of the import, the slot import is serialized to a new `cluster-slot-imports` aux field. The encoding includes the job name, the source node name, and the slot ranges being imported. Upon loading an RDB with the `cluster-slot-imports` aux field, replicas will add a new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state. It's important to note that a previously saved RDB file can be used as the basis for partial sync with a primary. Because of this, whenever we load an RDB file with the `cluster-slot-imports` aux field, even from disk, we will still add a new migration to track the import. If after loading the RDB, the Valkey node is a primary, it will cancel the slot migration. Having this tracking state loaded on primaries will ensure that replicas partial syncing to a restarted primary still get their `SYNCSLOTS FINISH` message in the replication stream. ## AOF Since AOF cannot be used as the basis for a partial sync, we don't necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH` commands to the AOF. However, considering there is work to change this (#59 #1901) this design doesn't make any assumptions about this. We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and ensure that they can be properly replayed on AOF load to get to the right state. Similar to RDB, if there are any pending "ESTABLISH" commands that don't have a "FINISH" afterwards upon becoming primary, we will make sure to fail those in `verifyClusterConfigWithData`. Additionally, there was a bug in the existing slot migration where slot import clients were not having their commands persisted to AOF. This has been fixed by ensuring we still propagate to AOF even for slot import clients. ## Promotion & Demotion Since the primary is solely responsible for cleaning up unowned slots, primaries that are demoted will not clean up previously active slot imports. The promoted replica will be responsible for both cleaning up the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS FINISH`. # Other Options Considered I also considered tracking "dirty" slots rather than using the slot import state machine. In this setup, primaries and replicas would simply mark each slot's hashtable in the kvstore as dirty when something is written to it and we do not currently own that slot. This approach is simpler, but has a problem in that modules loaded on the replica would still not get slot migration start/end notifications. If the modules on the replica do not get such notifications, they will not be able to properly contain these dirty keys during slot migration events. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Madelyn Olson	b25f87be77	Fix format issues with CVE fix (#2679 ) The CVE fixes had a formatting and external test issue that wasn't caught because private branches don't run those CI steps. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Madelyn Olson	61cac56d0c	Merge commit from fork Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	21a70f6eea	Add slot migration client flags and module context flags (#2639 ) New client flags in reported by CLIENT INFO and CLIENT LIST: * `i` for atomic slot migration importing client * `E` for atomic slot migration exporting client New flags in return value of `ValkeyModule_GetContextFlags`: * `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client attached to this context is the slot import client. * `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client attached to this context is the slot export client. Users could use this to monitor the underlying client info of the slot migration, and more clearly understand why they see extra clients during the migration. Modules can use these to detect keyspace notifications on import clients. I am also adding export flags for symmetry, although there should not be keyspace notifications. But they would potentially be visible in command filters or in server events triggered by that client. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Alina Liu	35e052ee5e	Defrag if slab 1/8 full to fix defrag didn't stop issue (#2656 ) Issue History: 1. The flaky test issue "defrag didn't stop" was originally detected in February 2025: https://github.com/valkey-io/valkey/issues/1746 Solution for 1746: https://github.com/valkey-io/valkey/pull/1762 2. Similar issue occurred recently: https://github.com/valkey-io/valkey/actions/runs/16585350083/job/46909359496#step:5:7640 Investigation: 1. First, the issue occurs specifically to Active Defrag stream in cluster mode. 2. After investigating `test_stream` in `memefficiency.tcl`, I found the root cause is in defrag logic rather than the test itself - There are still failed tests with the same error even if I tried different parameters for the test. 3. Then I looked at malloc-stats and identified potential defrag issues, particularly in the 80B bin where utilization only reaches ~75% after defrag instead of the expected near 100%, while other bins show proper defrag behavior - 80B actually is the size of a new stream(confirmed in `t_stream.c`) that we add during test. 4. For 80B, after adding 200000 streams and fragmenting, `curregs `= 100030, after a lot of defrag cycles, there are still 122 nonfull-slabs/511 slabs with the remaining 446 items not defragged (average 4/nonfull-slab). Detailed malloc-stats: - Total slabs: 511 - Non-full slabs: 122 - Full slabs: 511-122=389 - Theoretical maximum per slab: 256 items - Allocated items in non-full slabs: 100030-389256=446 - Average items per non-full slab: 446/122=3.66 Root Cause:* There are some immovable items which prevent complete defrag Problems in old defrag logic: 1. The previous condition (we don't defrag if slab utilization > 'avg utilization' * 1.125), the 12.5% threshold doesn’t work well with low utilizations. - Let's imagine we have 446 items in 122 nonfull-slabs (avg 3.66 items/nonfull-slab), let's say, e.g. we have 81 slabs with 5 items each +41 slabs with 1 item each) - 12.5% threshold: 3.661.125=4.11 - If those 41 single items are immovable, they actually lower the average, so the rest 81 slabs will be above the threshold (5>4.11) and will not be defragged - defrag didn't stop. 2. Distribution of immovable items across slabs was causing inconsistent defragmentation and flaky test outcome. - If those 41 single items are movable, they will be moved and the avg will be 5, then 12.5% threshold: 51.125=5.625, so the rest 81 slabs will fall below the threshold (5<5.625) and will be defragged - defrag success. - This can explain why we got flaky defrag tests. Final solution : 1. Add one more condition before the old logic in `makeDefragDecision `to trigger defragmentation when slab is less than 1/8 full (1/8 threshold (12.5%) chosen to align with existing utilization threshold factor) - Ensures no low-utilization slabs left without defragged, and stabilize the defrag behavior. 2. The reason why we have immovable items and how to handle them is going to be investigate later. 3. Be sure to rebuild Valkey before testing it. Local test result: - Before fix: pass rate 80.8% (63/78) - After fix: Test only stream: pass rate 100% (200/200) Test the whole memefficiency.tcl: pass rate 100% (100/100) Resolves #2398 , the "defrag didn't stop" issue, with help from @JimB123 @madolson --------- Signed-off-by: Alina Liu <liusalisa6363@gmail.com> Signed-off-by: asagegeLiu <liusalisa6363@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Madelyn Olson	83592593fa	Implement a lolwut for version 9 (#2646 ) As requested, here is a version of lolwut for 9 that visualizes a Julia set with ASCII art. Example: ``` 127.0.0.1:6379> lolwut version 9 ............. ...................... ............................ ......:::--:::::::::::::::....... .....:::=+@@@=--------=+===--::.... ....:::-+@@@@&+=====+%@@@@@@@@=-::.... .....:::-=+@@@@@%%++@@@@@@@@@&=--::.... .....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::.... ......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::... ......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::.... .......::-=+%@@@@@@@@@@@@@@@@#%%+++++%@+-:..... .......::-=@&@@@@@@@@@@@@@@@@&++=====---::..... .......:::--@@@@@@@@@@@@@@@@@%++===----::::..... ........::::-=+%&@@@@@@@@@&&&%+==----:::::...... ........::::--=+@@@@@@@@@@&##%++==---:::::....... .......:::::---=+#@@@@@@@@&&&#%+==---:::::....... ........:::::---=++%%#&&@@@@@@@@@+=---::::........ .......:::::----=++%##&@@@@@@@@@@%+=--::::....... ......::::-----==++#@@@@@@@@@@@@@&%+=-:::........ ......:::---====++@@@@@@@@@@@@@@@@@@+-:::....... .....::-=++==+++*%@@@@@@@@@@@@@@@@#=--::....... ....:-%@@%**%###&@@@@@@@@@@@@@@@@&+--:....... ....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::...... ...::+@@@@@@@@@@@@@@@&&@@@@@@@@%@+-::..... ....::-=+%#@@@@@@@@@&%%%&@@@@@@==-:::..... ....::--+%@@@@@@@%++==++#@@@@&=-:::.... ....:::-@@@+==----==%@@@@+-:::.... .....:::---::::::::--=+@=--::..... .........::::::::::::::....... ......................... .................. ... Ascii representation of Julia set with constant 0.41 + 0.29i Don't forget to have fun! Valkey ver. 255.255.255 ``` You can pass in arbitrary rows and colums (it's best when rows is 2x number of columns) and an arbitrary julia constant so it is repeatable. Worst case it takes about ~100us on my m2 macbook, which should be fine to make sure it's not taking too many system resources. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Ran Shidlansik	aab4af58f7	Fix module key memory usage accounting (#2661 ) Make objectComputeSize account for the key size as well when the key is a module datatype fixes: https://github.com/valkey-io/valkey/issues/2660 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	5bda19a575	Fix atomic slot migration snapshot never proceeding with hz 1 (#2636 ) The problem is that ACKs run on a set loop (once every second) and this will happen every loop with hz 1. Instead, we can do the ACK after running the main logic. We can also do an additional optimization where we don't send an ACK from source to target if we already sent some data this cron loop, since the target will reset the ack timer on any data over the connection. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
cjx-zar	d3a6c6c819	Redirect blocked clients after failover (#2329 ) In standalone mode, after a switch over, the command that was originally blocking on primary returns -REDIRECT instead of -UNBLOCKED when the client has the redirect capability. Similarly, in cluster mode, after a switch over, the blocked commands receive a -MOVED redirect instead of -UNBLOCKED. After this fix, the handling of blocked connections during a switch over between standalone and cluster is nearly identical. This can be summarized as follows: Standalone: 1. Client that has the redirect capability blocked on the key on the primary node will receive a -REDIRECT after the switch over completes instead of -UNBLOCKED. 2. Readonly clients blocked on the primary or replica node will remain blocked throughout the switch over. Cluster: 1. Client blocked on the key served by the primary node will receive a -MOVED instead of a probabilistic -UNBLOCKED error. 2. Readonly clients blocked on the key served by primary or replica node will remain blocked throughout the switch over. --------- Signed-off-by: cjx-zar <56825069+cjx-zar@users.noreply.github.com> Co-authored-by: Simon Baatz <gmbnomis@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Binbin	bb44da0885	Minor fix for dual rdb channel connection conn error log (#2658 ) This should be server.repl_rdb_transfer_s Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Jacob Murphy	9e02deebac	Add atomic slot migration test for unblock on migration complete (#2637 ) This is already handled by `clusterRedirectBlockedClientIfNeeded`. With the work we are doing in #2329, it makes sense to have an explicit test here to prevent regression. Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00
Sarthak Aggarwal	4e8cebae7c	Increasing retries to allow succcessful meet in Valgrind (#2644 ) There is a daily test failure in valgrind, which looks like an issue related to slowness in valgrind mode. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2025-10-08 13:19:23 -07:00

1 2 3 4 5 ...

13398 Commits All Branches Search

13398 Commits

All Branches