The CLUSTER SLOTS reply depends on whether the client is connected over
IPv6, but for a fake client there is no connection and when this command
is called from a module timer callback or other scenario where no real
client is involved, there is no connection to check IPv6 support on.
This fix handles the missing case by returning the reply for IPv4
connected clients.
Fixes#2912.
---------
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Problem
IO thread shutdown can deadlock during server panic when the main thread
calls `pthread_cancel()` while the IO thread holds its mutex, preventing
the thread from observing the cancellation.
## Solution
Release the IO thread mutex before cancelling to ensure clean thread
termination.
## Testing
Reproducer:
```
bash
./src/valkey-server --io-threads 2 --enable-debug-command yes
./src/valkey-cli debug panic
```
Before: Server hangs indefinitely
After: Server terminates cleanly
Signed-off-by: Ouri Half <ourih@amazon.com>
Persist USE_FAST_FLOAT and PROG_SUFFIX to prevent a complete rebuild
next time someone types make or make test without specifying variables.
Fixes#2880
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
fedorarawhide CI reports these warnings:
```
networking.c: In function 'afterErrorReply':
networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
821 | char *spaceloc = memchr(s, ' ', len < 32 ? len : 32);
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
When we added the Hash Field Expiration feature in Valkey 9.0, some of
the new command docs included complexity description of O(1) even tough
they except multiple arguments.
(see discussion in
https://github.com/valkey-io/valkey/pull/2851#discussion_r2535684985)
This PR does:
1. align all the commands to the same description
2. fix the complexity description of some commands (eg HSETEX and
HGETEX)
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Only enable `HAVE_ARM_NEON` on AArch64 because it supports vaddvq and
all needed compiler intrinsics.
Fixes the following error when building for machine `qemuarm` using the
Yocto Project and OpenEmbedded:
```
| bitops.c: In function 'popcountNEON':
| bitops.c:219:23: error: implicit declaration of function 'vaddvq_u16'; did you mean 'vaddq_u16'? [-Wimplicit-function-declaration]
| 219 | uint32_t t1 = vaddvq_u16(sc);
| | ^~~~~~~~~~
| | vaddq_u16
| bitops.c:225:14: error: implicit declaration of function 'vaddvq_u8'; did you mean 'vaddq_u8'? [-Wimplicit-function-declaration]
| 225 | t += vaddvq_u8(vcntq_u8(vld1q_u8(p)));
| | ^~~~~~~~~
| | vaddq_u8
```
More details are available in the following log:
https://errors.yoctoproject.org/Errors/Details/889836/
Signed-off-by: Leon Anavi <leon.anavi@konsulko.com>
Fix a little miss in "Hash field TTL and active expiry propagates
correctly through chain replication" test in `hashexpire.tcl`.
The test did not wait for the initial sync of the chained replica and thus made the test flakey
Signed-off-by: Arad Zilberstein <aradz@amazon.com>
This PR fixes the freebsd daily job that has been failing consistently
for the last days with the error "pkg: No packages available to install
matching 'lang/tclx' have been found in the repositories".
The package name is corrected from `lang/tclx` to `lang/tclX`. The
lowercase version worked previously but appears to have stopped working
in an update of freebsd's pkg tool to 2.4.x.
Example of failed job:
https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193499
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Related test failures:
*** [err]: Replica importing key containment (slot 0 from node 0 to 2) - DBSIZE command excludes importing keys in tests/unit/cluster/cluster-migrateslots.tcl
Expected '1' to match '0' (context: type eval line 2 cmd {assert_match "0" [R $node_idx DBSIZE]} proc ::test)
The reason is that we don't wait for the primary-replica synchronization
to complete before starting the next testcase.
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There’s an issue with the LTRIM command. When LTRIM does not actually
modify the key — for example, with `LTRIM key 0 -1` — the server.dirty
counter is not updated because both ltrim and rtrim values are 0. As a
result, the command is not propagated. However, `signalModifiedKey` is
still called regardless of whether server.dirty changes. This behavior
is unexpected and can cause a mismatch between the source and target
during propagation, since the LTRIM command is not sent.
Signed-off-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
This prevents crashes on the older nodes in mixed clusters where some
nodes are running 8.0 or older. Mixed clusters often exist temporarily
during rolling upgrades.
Fixes: #2341
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
After network failure nodes that come back to cluster do not always send
and/or receive messages from other nodes in shard, this fix avoids usage
of light weight messages to nodes with not ready bidirectional links.
When a light message comes before any normal message, freeing of cluster
link is happening because on the just established connection link->node
is not assigned yet. It is assigned in getNodeFromLinkAndMsg right after
the condition if (is_light).
So on a cluster with heavy pubsub load a long loop of disconnects is
possible, and we got this.
1. node A establishes cluster link to node B
2. node A propagates PUBLISH to node B
3. node B frees cluster link because of link->node == null as it has not
received non-light messages yet
4. go to 1.
During this loop subscribers of node B does not receive any messages
published to node A.
So here we want to make sure that PING was sent (and link->node was
initialized) on this connection before using lightweight messages.
---------
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Since Valkey Sentinel 9.0, sentinel tries to abort an ongoing failover
when changing the role of a monitored instance. Since the result of the
command is ignored, the "FAILOVER ABORT" command is sent irrespective of
the actual failover status.
However, when using the documented pre 9.0 ACLs for a dedicated sentinel
user, the FAILOVER command is not allowed and _all_ failover cases fail.
(Additionally, the necessary ACL adaptation was not communicated well.)
Address this by:
- Updating the documentation in "sentinel.conf" to reflect the need for
an adapted ACL
- Only abort a failover when sentinel detected an ongoing (probably
stuck) failover. This means that standard failover and manual failover
continue to work with unchanged pre 9.0 ACLs. Only the new "SENTINEL
FAILOVER COORDINATED" requires to adapt the ACL on all Valkey nodes.
- Actually use a dedicated sentinel user and ACLs when testing standard
failover, manual failover, and manual coordinated failover.
Fixes#2779
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Fixes an assert crash in _writeToClient():
serverAssert(c->io_last_written.data_len == 0 ||
c->io_last_written.buf == c->buf);
The issue occurs when clientsCronResizeOutputBuffer() grows or
reallocates c->buf while io_last_written still points to the old buffer
and data_len is non-zero. On the next write, both conditions in the
assertion become false.
Reset io_last_written when resizing the output buffer to prevent stale
pointers and keep state consistent.
fixes https://github.com/valkey-io/valkey/issues/2769
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Just setting the authenticated flag actually authenticates to the
default user in this case. The default user may be granted no permission
to use CLUSTER SYNCSLOTS.
Instaed, we now authenticate to the NULL/internal user, which grants
access to all commands. This is the same as what we do for replication:
864de555ce/src/replication.c (L4717)
Add a test for this case as well.
Closes#2783
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This was introduced in #1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.
Signed-off-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit 5d3cb3d04c)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
Safe iterators pause rehashing, but don't pause auto shrinking. This
allows stale bucket references which then cause use after free (in this
case, via compactBucketChain on a deleted bucket).
This problem is easily reproducible via atomic slot migration, where we
call delKeysInSlot which relies on calling delete within a safe
iterator. After the fix, it no longer causes a crash.
Since all cases where rehashing is paused expect auto shrinking to also
be paused, I am making this happen automatically as part of pausing
reshashing.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
(cherry picked from commit 1cf0df9fc3)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
When working on #2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.
The solution is:
1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op
This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
(cherry picked from commit 2a914aa521)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
There will be two issues in this test:
```
test {FUNCTION - test function flush} {
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
set before_flush_memory [s used_memory_vm_functions]
r function flush sync
set after_flush_memory [s used_memory_vm_functions]
puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
set before_flush_memory [s used_memory_vm_functions]
r function flush async
set after_flush_memory [s used_memory_vm_functions]
puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
puts "Test done"
}
```
The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```
The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
# valkey 255.255.255 crashed by signal: 11, si_code: 2
# Accessing address: 0xe0429b7100000a3c
# Crashed running the instruction at: 0x102e0b09c
------ STACK TRACE ------
EIP:
0 valkey-server 0x0000000102e0b09c luaH_getstr + 52
Backtrace:
0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56
1 valkey-server 0x0000000102e01054 luaD_precall + 96
2 valkey-server 0x0000000102e01b10 luaD_call + 104
3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76
4 valkey-server 0x0000000102e01e3c luaD_pcall + 60
5 valkey-server 0x0000000102dfc630 lua_pcall + 300
6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708
7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9 valkey-server 0x0000000102f70898 functionLoadCommand + 312
10 valkey-server 0x0000000102e3978c call + 416
11 valkey-server 0x0000000102e3b5b8 processCommand + 3340
12 valkey-server 0x0000000102e563cc processInputBuffer + 520
13 valkey-server 0x0000000102e55808 readQueryFromClient + 92
14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180
15 valkey-server 0x0000000102e20480 aeProcessEvents + 372
16 valkey-server 0x0000000102e4aad0 main + 26412
17 dyld 0x000000018acab154 start + 2476
------ STACK TRACE DONE ------
```
The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.
The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.
The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Co-authored-by: Ricardo Dias <ricardo.dias@percona.com>
(cherry picked from commit b4c93cc9c2)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
#2329 intoduced a bug that causes a blocked client in cluster mode to
receive two MOVED redirects instead of one. This was not seen in tests,
except in the reply schema validator.
The fix makes sure the client's pending command is cleared after sending
the MOVED redirect, to prevent if from being reprocessed.
Fixes#2676.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
(cherry picked from commit 54da8344c1)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to
the primary so there are lots of race conditions. It isn't something we
should be doing anyways. To test, I just disconnect the replica before
loading the AOF, then reconnect it.
Fixes#2712
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
(cherry picked from commit dbcf022480)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
This test was failing, and causing the next test to throw an exception.
It is failing since we never waited for the slot migration to connect
before proceeding.
Fixes#2692
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
(cherry picked from commit 19474c867a)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:
1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free
Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.
To solve this, we simply:
1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL
I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.
Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
(cherry picked from commit 28e5dcce2c)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
When the hash object does not exist FXX should simply fail the check
without creating the object while FNX should be trivial and succeed.
Note - also fix a potential compilation warning on some COMPILERS doing
constant folding of variable length array when size is const expression.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
(cherry picked from commit 8182f4a0b9)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0
* Add mechanism in TCL test to skip tests dynamically - #2711
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
(cherry picked from commit 18214be490)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
With #1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.
Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
#0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
#1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
#2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
#3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
#4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
#5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
#6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
#7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```
Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
(cherry picked from commit 155b0bb821)
Signed-off-by: cherukum-amazon <cherukum@amazon.com>
We have relaxed the `cluster-ping-interval` and `cluster-node-timeout`
so that cluster has enough time to stabilize and propagate changes.
Fixes this test occasional failure when running with valgrind:
[err]: Node #10 should eventually replicate node #5 in
tests/unit/cluster/slave-selection.tcl
#10 didn't became slave of #5
Backported to the 9.0 branch in #2731.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
If we don't wait for the replica to resync, the migration may be
cancelled by the time the replica resyncs, resulting in a test failure
when we can't find the migration on the replica.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Seeing test failures due to this on the 9.0.0 branch:
```
[exception]: Executing test client: ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary].
ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary]
```
It turns out we are just providing the version as an argument, instead
of specifying which version to run on
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
For now, introduce this and have it do nothing. Eventually, we can use
this to negotiate supported capabilities on either end. Right now, there
is nothing to send or support, so it just accepts it and doesn't reply.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
# Problem
In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.
# Design
The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.
## Success Case
```
Source Target Target Replica
| | |
|------------ SYNCSLOTS ESTABLISH -------------->| |
| |----- SYNCSLOTS ESTABLISH ------>|
|<-------------------- +OK ----------------------| |
| | |
|~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| |
| |~~~~~~ forward snapshot ~~~~~~~~>|
|----------- SYNCSLOTS SNAPSHOT-EOF ------------>| |
| | |
|<----------- SYNCSLOTS REQUEST-PAUSE -----------| |
| | |
|~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| |
| |~~~~~~ forward changes ~~~~~~~~~>|
|--------------- SYNCSLOTS PAUSED -------------->| |
| | |
|<---------- SYNCSLOTS REQUEST-FAILOVER ---------| |
| | |
|---------- SYNCSLOTS FAILOVER-GRANTED --------->| |
| | |
| (performs takeover & |
| propagates topology) |
| | |
| |------- SYNCSLOTS FINISH ------->|
(finds out about topology | |
change & marks migration done) | |
| | |
```
## Failure Case
```
Source Target Target Replica
| | |
|------------ SYNCSLOTS ESTABLISH -------------->| |
| |----- SYNCSLOTS ESTABLISH ------>|
|<-------------------- +OK ----------------------| |
... ... ...
| | |
| <FAILURE> |
| | |
| (performs cleanup) |
| | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
| | |
| | ------ SYNCSLOTS FINISH ------->|
| | |
```
## Full Sync, Partial Sync, and RDB
In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.
## AOF
Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.
However, considering there is work to change this (#59#1901) this
design doesn't make any assumptions about this.
We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.
Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.
## Promotion & Demotion
Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.
# Other Options Considered
I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.
This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
The CVE fixes had a formatting and external test issue that wasn't
caught because private branches don't run those CI steps.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
New client flags in reported by CLIENT INFO and CLIENT LIST:
* `i` for atomic slot migration importing client
* `E` for atomic slot migration exporting client
New flags in return value of `ValkeyModule_GetContextFlags`:
* `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client
attached to this context is the slot import client.
* `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client
attached to this context is the slot export client.
Users could use this to monitor the underlying client info of the slot
migration, and more clearly understand why they see extra clients during
the migration.
Modules can use these to detect keyspace notifications on import
clients. I am also adding export flags for symmetry, although there
should not be keyspace notifications. But they would potentially be
visible in command filters or in server events triggered by that client.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
**Issue History:**
1. The flaky test issue "defrag didn't stop" was originally detected in
February 2025: https://github.com/valkey-io/valkey/issues/1746
Solution for 1746: https://github.com/valkey-io/valkey/pull/1762
2. Similar issue occurred recently:
https://github.com/valkey-io/valkey/actions/runs/16585350083/job/46909359496#step:5:7640
**Investigation:**
1. First, the issue occurs specifically to Active Defrag stream in
cluster mode.
2. After investigating `test_stream` in `memefficiency.tcl`, I found the
root cause is in defrag logic rather than the test itself - There are
still failed tests with the same error even if I tried different
parameters for the test.
3. Then I looked at malloc-stats and identified potential defrag issues,
particularly in the 80B bin where utilization only reaches ~75% after
defrag instead of the expected near 100%, while other bins show proper
defrag behavior - 80B actually is the size of a new stream(confirmed in
`t_stream.c`) that we add during test.
4. For 80B, after adding 200000 streams and fragmenting, `curregs `=
100030, after a lot of defrag cycles, there are still 122
nonfull-slabs/511 slabs with the remaining 446 items not defragged
(average 4/nonfull-slab).
**Detailed malloc-stats:**
- Total slabs: 511
- Non-full slabs: 122
- Full slabs: 511-122=389
- Theoretical maximum per slab: 256 items
- Allocated items in non-full slabs: 100030-389*256=446
- Average items per non-full slab: 446/122=3.66
**Root Cause:**
**There are some immovable items which prevent complete defrag**
**Problems in old defrag logic:**
1. The previous condition (we don't defrag if slab utilization > 'avg
utilization' * 1.125), the 12.5% threshold doesn’t work well with low
utilizations.
- Let's imagine we have 446 items in 122 nonfull-slabs (avg 3.66
items/nonfull-slab), let's say, e.g. we have 81 slabs with 5 items each
+41 slabs with 1 item each)
- 12.5% threshold: 3.66*1.125=4.11
- If those 41 single items are immovable, they actually lower the
average, so the rest 81 slabs will be above the threshold (5>4.11) and
will not be defragged - defrag didn't stop.
2. Distribution of immovable items across slabs was causing inconsistent
defragmentation and flaky test outcome.
- If those 41 single items are movable, they will be moved and the avg
will be 5, then 12.5% threshold: 5*1.125=5.625, so the rest 81 slabs
will fall below the threshold (5<5.625) and will be defragged - defrag
success.
- This can explain why we got flaky defrag tests.
**Final solution :**
1. Add one more condition before the old logic in `makeDefragDecision
`to trigger defragmentation when slab is less than 1/8 full (1/8
threshold (12.5%) chosen to align with existing utilization threshold
factor) - Ensures no low-utilization slabs left without defragged, and
stabilize the defrag behavior.
2. The reason why we have immovable items and how to handle them is
going to be investigate later.
3. Be sure to rebuild Valkey before testing it.
**Local test result:**
- Before fix:
pass rate 80.8% (63/78)
- After fix:
Test only stream: pass rate 100% (200/200)
Test the whole memefficiency.tcl: pass rate 100% (100/100)
Resolves#2398 , the "defrag didn't stop" issue, with help from @JimB123
@madolson
---------
Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
Signed-off-by: asagegeLiu <liusalisa6363@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
As requested, here is a version of lolwut for 9 that visualizes a Julia
set with ASCII art.
Example:
```
127.0.0.1:6379> lolwut version 9
.............
......................
............................
......:::--:::::::::::::::.......
.....:::=+*@@@=--------=+===--::....
....:::-+@@@@&*+=====+%@@@@@@@@=-::....
.....:::-=+@@@@@%%*++*@@@@@@@@@&*=--::....
.....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::....
......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::...
......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::....
.......::-=+%@@@@@@@@@@@@@@@@#%%*+++++%@+-:.....
.......::-=@&@@@@@@@@@@@@@@@@&*++=====---::.....
.......:::--*@@@@@@@@@@@@@@@@@%++===----::::.....
........::::-=+*%&@@@@@@@@@&&&%*+==----:::::......
........::::--=+@@@@@@@@@@&##%*++==---:::::.......
.......:::::---=+#@@@@@@@@&&&#%*+==---:::::.......
........:::::---=++*%%#&&@@@@@@@@@+=---::::........
.......:::::----=++*%##&@@@@@@@@@@%+=--::::.......
......::::-----==++#@@@@@@@@@@@@@&%*+=-:::........
......:::---====++*@@@@@@@@@@@@@@@@@@+-:::.......
.....::-=++==+++**%@@@@@@@@@@@@@@@@#*=--::.......
....:-%@@%****%###&@@@@@@@@@@@@@@@@&+--:.......
....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::......
...::+@@@@@@@@@@@@@@@&&@@@@@@@@%**@+-::.....
....::-=+%#@@@@@@@@@&%%%&@@@@@@*==-:::.....
....::--+%@@@@@@@%++==++*#@@@@&=-:::....
....:::-*@**@@+==----==*%@@@@+-:::....
.....:::---::::::::--=+@=--::.....
.........::::::::::::::.......
.........................
..................
...
Ascii representation of Julia set with constant 0.41 + 0.29i
Don't forget to have fun! Valkey ver. 255.255.255
```
You can pass in arbitrary rows and colums (it's best when rows is 2x
number of columns) and an arbitrary julia constant so it is repeatable.
Worst case it takes about ~100us on my m2 macbook, which should be fine
to make sure it's not taking too many system resources.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Make objectComputeSize account for the key size as well when the key is
a module datatype
fixes: https://github.com/valkey-io/valkey/issues/2660
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
The problem is that ACKs run on a set loop (once every second) and this
will happen every loop with hz 1.
Instead, we can do the ACK after running the main logic. We can also do
an additional optimization where we don't send an ACK from source to
target if we already sent some data this cron loop, since the target
will reset the ack timer on any data over the connection.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
In standalone mode, after a switch over, the command that was originally
blocking on primary returns -REDIRECT instead of -UNBLOCKED when the
client has the redirect capability.
Similarly, in cluster mode, after a switch over, the blocked commands
receive a -MOVED redirect instead of -UNBLOCKED.
After this fix, the handling of blocked connections during a switch over
between standalone and cluster is nearly identical. This can be
summarized as follows:
Standalone:
1. Client that has the redirect capability blocked on the key on the
primary node will receive a -REDIRECT after the switch over completes
instead of -UNBLOCKED.
2. Readonly clients blocked on the primary or replica node will remain
blocked throughout the switch over.
Cluster:
1. Client blocked on the key served by the primary node will receive a
-MOVED instead of a probabilistic -UNBLOCKED error.
2. Readonly clients blocked on the key served by primary or replica node
will remain blocked throughout the switch over.
---------
Signed-off-by: cjx-zar <56825069+cjx-zar@users.noreply.github.com>
Co-authored-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This is already handled by `clusterRedirectBlockedClientIfNeeded`. With
the work we are doing in #2329, it makes sense to have an explicit test here
to prevent regression.
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
There is a daily test failure in valgrind, which looks like an issue related to
slowness in valgrind mode.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>