Commit Graph

13473 Commits

Author SHA1 Message Date
Binbin 8ab0152bef
Check for duplicate nodeids when loading nodes.conf (#2852)
For corrupted (human-made) or program-error-recovered nodes.conf files,
check for duplicate nodeids when loading nodes.conf. If a duplicate is
found, panic is triggered to prevent nodes from starting up
unexpectedly.

The node ID is used to identify every node across the whole cluster,
we do not expect to find duplicate nodeids in nodes.conf.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-17 11:23:38 +08:00
Ping Xie 6b60e6bfc7
Update MAINTAINERS list and add committee chair section (#2939) 2025-12-16 14:06:26 -08:00
Ping Xie 50ff460007
Make 1/3 TSC organization limit explicitly inclusive (#2930) 2025-12-16 07:15:38 -08:00
Binbin 51f871ae52
Strictly check CRLF when parsing querybuf (#2872)
Currently, when parsing querybuf, we are not checking for CRLF,
instead we assume the last two characters are CRLF by default,
as shown in the following example:
```
telnet 127.0.0.1 6379
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
*3
$3
set
$3
key
$5
value12
+OK
get key
$5
value

*3
$3
set
$3
key
$5
value12345
+OK
-ERR unknown command '345', with args beginning with:
```

This should actually be considered a protocol error. When a bug
occurs in the client-side implementation, we may execute incorrect
requests (writing incorrect data is the most serious of these).

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-14 09:10:57 +02:00
Ping Xie cd6faaa726
Refine major decision process and update TSC composition rules (#2927)
- Require a 2/3 supermajority vote for all Governance Major Decisions.
- Update Technical Major Decision voting to prioritize simple majority, limiting the use of "+2" approval.
- Define remediation steps for when the 1/3 organization limit is exceeded.

---------

Signed-off-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-12-12 08:53:50 -08:00
Vitah Lin 79fff5dbac
Upgrade macos version in actions (#2920)
GitHub has deprecated older macOS runners, and macos-13 is no longer supported.

1. The latest version of cross-platform-actions/action does allow
running on ubuntu-latest (Linux runner) and does not strictly require macOS.
2. Previously, cross-platform-actions/action@v0.22.0 used runs-on:
macos-13. I checked the latest version of cross-platform-actions, and
the official examples now use runs-on: ubuntu. I think we can switch from macOS to Ubuntu.

---------

Signed-off-by: Vitah Lin <vitahlin@gmail.com>
2025-12-11 23:50:26 +01:00
Viktor Söderqvist 5940dbfb0b
Revert "Allow partial sync after loading AOF with preamble (#2366)" (#2925)
This reverts commit 2da21d9def.

The implementation in #2366 made it possible to perform a partial
resynchronization after loading an AOF file, by storing the replication
offset in the AOF preamble and counting the bytes in the AOF command
stream in the same way as we count byte offset in a replication stream.

However, this approach isn't safe because some commands are replicated
but not stored in the AOF file. This includes the commands REPLCONF,
PING, PUBLISH and module-implemented commands where a module can control
to propagate a command to the replication stream only, to AOF only or to
both. This oversight led to data inconsistency, where the wrong
replication offset is used for partial resynchronization as explained in
issue #2904.

The revert caused small merge conflicts with 3c3a1966ec which are
solved.

Fixes #2904.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-12-11 08:39:53 +02:00
Vadym Khoptynets 44dc58181d
Fix a typo in aof-multi-part.tcl (#2922)
A small PR to fix a smal typo: `util` --> `until`

Signed-off-by: Vadym Khoptynets <1099644+poiuj@users.noreply.github.com>
2025-12-09 21:30:28 +02:00
bandalgomsu 2a9a1731ee
Fix CLUSTER SLOTS crash when called from module timer callback (#2915)
The CLUSTER SLOTS reply depends on whether the client is connected over
IPv6, but for a fake client there is no connection and when this command
is called from a module timer callback or other scenario where no real
client is involved, there is no connection to check IPv6 support on.
This fix handles the missing case by returning the reply for IPv4
connected clients.

Fixes #2912.

---------

Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-12-09 14:02:35 +01:00
Binbin 48e0cbbb41
Fix commandlog large-reply when using reply copy avoidance (#2652)
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-09 10:20:45 +08:00
Binbin 29d3244937
Make the COB soft limit also use repl-backlog-size when its value is smaller (#2866)
We have the same settings for the hard limit, and we should apply them to the soft
limit as well. When the `repl-backlog-size` value is larger, all replication buffers
can be handled by the replication backlog, so there's no need to worry about the
client output buffer soft limit in here. Furthermore, when `soft_seconds` is 0, in
some ways, the soft limit behaves the same (mostly) as the hard limit.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-08 12:21:34 +08:00
zhaozhao.zz 04d0bba398
support whole cluster info for INFO command in cluster section (#2876)
Allow users to more easily get cluster information.

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-12-04 09:34:52 -08:00
Ouri Half 3d65a4aecd
Fix deadlock in IO thread shutdown during panic (#2898)
## Problem
IO thread shutdown can deadlock during server panic when the main thread
calls `pthread_cancel()` while the IO thread holds its mutex, preventing
the thread from observing the cancellation.

## Solution  
Release the IO thread mutex before cancelling to ensure clean thread
termination.

## Testing
Reproducer:
```
bash
./src/valkey-server --io-threads 2 --enable-debug-command yes
./src/valkey-cli debug panic
```

Before: Server hangs indefinitely
After: Server terminates cleanly

Signed-off-by: Ouri Half <ourih@amazon.com>
2025-12-04 09:10:09 -08:00
Roshan Khatri c90e634f11
Add PR and Release benchmark with new changes in framework (#2871)
This adds the workflow improvements for PR and Release benchmark where
it runs on `c8g.metal-48xl` for `ARM64` and `c7i.metal-48xl` for `X86`.

```
Cluster mode: disabled
TLS: disabled
io-threads: 1, 9
Pipelining: 1, 10
Clients: 1600
Benchmark Treads: 90
Data size: 16 ,96
Commands: SET, GET
```

c8g.metal-48xl Spec: https://aws.amazon.com/ec2/instance-types/c8g/
c7i.metal.48xl Spec: https://aws.amazon.com/ec2/instance-types/c7i/

```
vCPU: 192
NUMA nodes: 2
Memory (GiB): 384
Network Bandwidth (Gbps): 50
```

PR benchmarking will be executed on **ARM64** machine as it has been
seen to be more consistent.
Additionally, it runs 5 iterations for each tests and posts the average
and other statistical metrics like
- CI99%: 99% Confidence Interval - range where the true population mean
is likely to fall
- PI99%: 99% Prediction Interval - range where a single future
observation is likely to fall
- CV: Coefficient of Variation - relative variability (σ/μ × 100%)

_Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate
averages from X runs with standard deviation Y, coefficient of variation
Z%, 99% confidence interval margin of error ±W% of the mean, and 99%
prediction interval margin of error ±V% of the mean. CI bounds [A, B]
and PI bounds [C, D] show the actual interval ranges._

For comparing between versions, it adds a workflow which runs on both
**ARM64** and **X86** machine. It will also post the comparison between
the versions like this:
https://github.com/valkey-io/valkey/issues/2580#issuecomment-3399539615

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <117414976+roshkhatri@users.noreply.github.com>
2025-12-04 17:33:21 +01:00
Rain Valentine 70196ee20f
Add safe iterator tracking in hashtable to prevents invalid access on hashtable delete (#2807)
This makes it safe to delete hashtable while a safe iterator is
iterating it. This currently isn't possible, but this improvement is
required for fork-less replication #1754 which is being actively
worked on.

We discussed these issues in #2611 which guards against a different but
related issue: calling hashtableNext again after it has already returned
false.

I implemented a singly linked list that hashtable uses to track its
current safe iterators. It is used to invalidate all associated safe
iterators when the hashtable is released. A singly linked list is
acceptable because the list length is always very small - typically zero
and no more than a handful.

Also, renames the internal functions:

    hashtableReinitIterator -> hashtableRetargetIterator
    hashtableResetIterator -> hashtableCleanupIterator

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-12-03 18:15:45 +01:00
Harkrishn Patro 8ec9381974
Send duplicate multi meet packet only for node which supports it (#2840)
This prevents crashes on the older nodes in mixed clusters where some
nodes are running 8.0 or older. Mixed clusters often exist temporarily
during rolling upgrades.

Fixes: #2341 

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
2025-12-03 17:17:26 +01:00
Murad Shahmammadli c8341d4165
Replace strcmp with byte-by-byte comparison in valkey-check-aof (#2809)
Fixes #2792

Replace strcmp with byte-by-byte comparison to avoid accidental
heap-buffer-overflow errors.

Signed-off-by: murad shahmammadli <shmurad@amazon.com>
Co-authored-by: murad shahmammadli <shmurad@amazon.com>
2025-12-03 17:07:04 +01:00
Roshan Khatri 2e8fba49a2
[Test Fix] flaky benchmark test for warmup (#2890)
Fixes: https://github.com/valkey-io/valkey/issues/2859
Increased the warmup to 2 sec so we can verify that it runs more number
of commands than the actual benchmark.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
2025-12-02 21:40:20 +01:00
Jim Brunner 1b5f245eae
Refactor of LFU/LRU code for modularity (#2857)
General cleanup on LRU/LFU code. Improve modularity and maintainability.

Specifically:
* Consolidates the mathematical logic for LRU/LFU into `lrulfu.c`, with
an API in `lrulfu.h`. Knowledge of the LRU/LFU implementation was
previously spread out across `db.c`, `evict.c`, `object.c`, `server.c`,
and `server.h`.
* Separates knowledge of the LRU from knowledge of the object containing
the LRU value. `lrulfu.c` knows about the LRU/LFU algorithms, without
knowing about the `robj`. `object.c` knows about the `robj` without
knowing about the details of the LRU/LFU algorithms.
* Eliminated `server.lruclock`, instead using `server.unixtime`. This
also eliminates the periodic need to call `mstime()` to maintain the lru
clock.
* Fixed a minor computation bug in the old `LFUTimeElapsed` function
(off by 1 after rollover).
* Eliminate specific IF checks for rollover, using defined behavior for
unsigned rollover instead.
* Fixed a bug in `debug.c` which would perform LFU modification on an
LRU value.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-12-02 10:14:33 -08:00
Zhijun Liao 853d111f47
Build: Support `make test` when PROG_SUFFIX is used (#2885)
Closes #2883

Support a new environment variable `VALKEY_PROG_SUFFIX` in the test
framework, which can be used for running tests if the binaries are
compiled with a program suffix. For example, if the binaries are
compiled using `make PROG_SUFFIX=-alt` to produce binaries named
valkey-server-alt, valkey-cli-alt, etc., run the tests against these
binaries using `VALKEY_PROG_SUFFIX=-alt ./runtest` or simply using `make
test`.

Now the test with the make variable `PROG_SUFFIX` works well.

```
% make PROG_SUFFIX="-alt"
		...
		...
		CC trace/trace_aof.o
    LINK valkey-server-alt
    INSTALL valkey-sentinel-alt
    CC valkey-cli.o
    CC serverassert.o
    CC cli_common.o
    CC cli_commands.o
    LINK valkey-cli-alt
    CC valkey-benchmark.o
    LINK valkey-benchmark-alt
    INSTALL valkey-check-rdb-alt
    INSTALL valkey-check-aof-alt

Hint: It's a good idea to run 'make test' ;)
% 
% make test                                                                
cd src && /Library/Developer/CommandLineTools/usr/bin/make test
    CC Makefile.dep
    CC release.o
    LINK valkey-server-alt
    INSTALL valkey-check-aof-alt
    INSTALL valkey-check-rdb-alt
    LINK valkey-cli-alt
    LINK valkey-benchmark-alt
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 39435
Testing unit/pubsub
```

Signed-off-by: Zhijun <dszhijun@gmail.com>
2025-12-02 15:03:47 +01:00
Zhijun Liao 3fd0942279
Refactor TCL reference to support running tests with CMake (#2816)
Historically, Valkey’s TCL test suite expected all binaries
(src/valkey-server, src/valkey-cli, src/valkey-benchmark, etc.) to exist
under the src/ directory. This PR enables Valkey TCL tests to run
seamlessly after a CMake build — no manual symlinks or make build
required.

The test framework accepts a new environment variable `VALKEY_BIN_DIR`
to look for the binaries.

CMake will copy all TCL test entrypoints (runtest, runtest-cluster,
etc.) into the CMake build dir (e.g. `cmake-build-debug`) and insert
`VALKEY_BIN_DIR` into these. Now we can either do
./cmake-build-debug/runtest at the project root or ./runtest at the
Cmake dir to run all tests.

A new CMake post-build target prints a friendly reminder after
successful builds, guiding developers on how to run tests with their
CMake binaries:

```
Hint: It is a good idea to run tests with your CMake-built binaries ;)
      ./cmake-build-debug/runtest

Build finished
```

A helper TCL script `tests/support/set_executable_path.tcl` is added to
support this change, which gets called by all test entrypoints:
`runtest`, `runtest-cluster`, `runtest-sentinel`.

---------

Signed-off-by: Zhijun <dszhijun@gmail.com>
2025-12-02 14:14:28 +01:00
Daniil Kashapov 825d19fb09
Make all ACL categories explicit in JSON files (#2887)
Resolves #417

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
2025-12-02 13:33:20 +01:00
Binbin 4a0e20bbc9
Handle failed psync log when there is no replication backlog (#2886)
This crash was introduced in #2877, we will crash when there is no
replication backlog.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-01 10:20:32 +08:00
Binbin a087cc1132
Add psync offset range information to the failed psync log (#2877)
Although we can infer this infomartion from the replica logs,
i think this still would be useful to see this information directly
in the primary logs.

So now we can see the psync offset range when psync fails and then
we can analyze and adjust the value of repl-backlog-size.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-29 12:20:34 +08:00
Viktor Söderqvist 761fba2e9d
Fix persisting missing make variables (#2881)
Persist USE_FAST_FLOAT and PROG_SUFFIX to prevent a complete rebuild
next time someone types make or make test without specifying variables.

Fixes #2880

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-11-28 10:35:53 +01:00
Binbin d16788e52d
Fix discarded-qualifiers warnings reported by fedorarawhide (#2874)
fedorarawhide CI reports these warnings:
```
networking.c: In function 'afterErrorReply':
networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
  821 |             char *spaceloc = memchr(s, ' ', len < 32 ? len : 32);
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-27 10:39:42 +08:00
Zhijun Liao faac14ab9c
Cluster: Optimize slot bitmap iteration from per-bit to 64-bit word scan (#2781)
In functions `clusterSendFailoverAuthIfNeeded` and
`clusterProcessPacket`, we iterate through **every slot bit**
sequentially in the form of `for (int j = 0; j < CLUSTER_SLOTS; j++)`,
performing 16384 checks even when only a few bits were set, and thus
causing unnecessary loop overhead.

This is particularly wasteful in function
`clusterSendFailoverAuthIfNeeded` where we need to ensure the sender's
claimed slots all have up-to-date config epoch. Usually healthy senders
would meet such condition, and thus we normally need to exhaust the for
loop of 16384 checks.

The proposed new implementation loads 64 bits (8 byte word) at a time
and skips empty words completely, therefore only performing 256 checks.

---------

Signed-off-by: Zhijun <dszhijun@gmail.com>
2025-11-26 15:52:17 +02:00
Roshan Khatri 56ab3c4a81
Adds HGETDEL Support to Valkey (#2851)
Fixes this: https://github.com/valkey-io/valkey/issues/2850
Adds support for HGETDEL to Valkey and aligns with Redis 8.0 feature.
Maintains syntax compatibility
Retrieves all the values, and null if fields dont exists and deletes
once retrieved.
```
127.0.0.1:6379> HGETDEL key FIELDS numfields field [ field ... ]
```
```
127.0.0.1:6379> HSET foo field1 bar1 field2 bar2 field3 bar3
(integer) 3
127.0.0.1:6379> HGETDEL foo FIELDS 1 field2
1) "bar2"
127.0.0.1:6379> HGETDEL foo FIELDS 1 field2
1) (nil)
127.0.0.1:6379> HGETALL foo
1) "field1"
2) "bar1"
3) "field3"
4) "bar3"
127.0.0.1:6379>  HGETDEL foo FIELDS 2 field2 field3
1) (nil)
2) "bar3"
127.0.0.1:6379> HGETALL foo
1) "field1"
2) "bar1"
127.0.0.1:6379> HGETDEL foo FIELDS 3 field1 non-exist-field
(error) ERR syntax error
127.0.0.1:6379> HGETDEL foo FIELDS 2 field1 non-exist-field
1) "bar1"
2) (nil)
127.0.0.1:6379> HGETALL foo
(empty array)
127.0.0.1:6379> 
```

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-11-26 15:19:53 +02:00
Ran Shidlansik 9562bdc0ab
Align the complexity description for all multi field HFE commands docs (#2875)
When we added the Hash Field Expiration feature in Valkey 9.0, some of
the new command docs included complexity description of O(1) even tough
they except multiple arguments.
(see discussion in
https://github.com/valkey-io/valkey/pull/2851#discussion_r2535684985)
This PR does:
1. align all the commands to the same description
2. fix the complexity description of some commands (eg HSETEX and
HGETEX)

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-11-26 10:31:28 +02:00
Zhijun Liao da3c43d76b
Additional log information for cluster accept handler and message processing (#2815)
Enhance debugging for cluster logs

[1] Add human node names in cluster tests so that we can easily
understand which nodes we are interacting with:

```
pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) announces that it is a primary in shard c6d1152caee49a5e70cb4b77d1549386078be603
Reconfiguring node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) as primary for shard c6d1152caee49a5e70cb4b77d1549386078be603
Configuration change detected. Reconfiguring myself as a replica of node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) in shard c6d1152caee49a5e70cb4b77d1549386078be603
```



[2] Currently there are logs showing the address of incoming
connections:

```
Accepting cluster node connection from 127.0.0.1:59956
Accepting cluster node connection from 127.0.0.1:59957
Accepting cluster node connection from 127.0.0.1:59958
Accepting cluster node connection from 127.0.0.1:59959
```

but we have no idea which nodes these connections refer to. I added a
logging statement when the node is set to the inbound link connection.

```
Bound cluster node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) to connection of client 127.0.0.1:59956
```



[3] Add a debug log when processing a packet to show the packet type,
sender node name, and sender client port (this also has the benefit of
telling us whether this is an inbound or outbound link).

```
pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
ping packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
fail packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
auth-req packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
```

---------

Signed-off-by: Zhijun <dszhijun@gmail.com>
2025-11-25 17:07:21 -08:00
Leon Anavi e5de417f1e
Fix build on 32-bit ARM by only using NEON on AArch64 (#2873)
Only enable `HAVE_ARM_NEON` on AArch64 because it supports vaddvq and
all needed compiler intrinsics.

Fixes the following error when building for machine `qemuarm` using the
Yocto Project and OpenEmbedded:

```
| bitops.c: In function 'popcountNEON':
| bitops.c:219:23: error: implicit declaration of function 'vaddvq_u16'; did you mean 'vaddq_u16'? [-Wimplicit-function-declaration]
|   219 |         uint32_t t1 = vaddvq_u16(sc);
|       |                       ^~~~~~~~~~
|       |                       vaddq_u16
| bitops.c:225:14: error: implicit declaration of function 'vaddvq_u8'; did you mean 'vaddq_u8'? [-Wimplicit-function-declaration]
|   225 |         t += vaddvq_u8(vcntq_u8(vld1q_u8(p)));
|       |              ^~~~~~~~~
|       |              vaddq_u8
```

More details are available in the following log:
https://errors.yoctoproject.org/Errors/Details/889836/

Signed-off-by: Leon Anavi <leon.anavi@konsulko.com>
2025-11-25 21:23:32 +01:00
jiegang0219 dd2827a14e
Add support for asynchronous release to replicaKeysWithExpire on writable replica (#2849)
## Problem
When executing `FLUSHALL ASYNC` on a **writable replica** that has
a large number of expired keys directly written to it, the main thread
gets blocked for an extended period while synchronously releasing the
`replicaKeysWithExpire` dictionary. 

## Root Cause
`FLUSHALL ASYNC` is designed for asynchronous lazy freeing of core data
structures, but the release of `replicaKeysWithExpire` (a dictionary tracking
expired keys on replicas) still happens synchronously in the main thread.
This synchronous operation becomes a bottleneck when dealing with massive
key volumes, as it cannot be offloaded to the lazyfree background thread.

This PR addresses the issue by moving the release of `replicaKeysWithExpire`
to the lazyfree background thread, aligning it with the asynchronous design
of `FLUSHALL ASYNC` and eliminating main thread blocking.

## User scenarios
In some operations, people often need to do primary-replica switches.
One goal is to avoid noticeable impact on the business—like key loss
or reduced availability (e.g., write failures).

Here is the process: First, temporarily switch traffic to writable replicas.
Then we wait for the primary pending replication data to be fully synced
(so primry and replicas are in sync), before finishing the switch. We don't
usually need to do the flush in this case, but it's an optimization that can
be done.

Signed-off-by: Scut-Corgis <512141203@qq.com>
Signed-off-by: jiegang0219 <512141203@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-11-25 19:25:39 +08:00
Binbin 8ea7f1330c
Update dual channel replication conf to mention the local buffer is imited by COB (#2824)
After introducing the dual channel replication in #60, we decided in #915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-23 23:27:50 +08:00
Binbin 8189fe5c42
Add rdb_transmitted to replstateToString so that we can see it in INFO (#2833)
In dual channel replication, when the rdb channel client finish
the RDB transfer, it will enter REPLICA_STATE_RDB_TRANSMITTED
state. During this time, there will be a brief window that we are
not able to see the connection in the INFO REPLICATION.

In the worst case, we might not see the connection for the
DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE seconds. I guess there is no
harm to list this state, showing connected_slaves but not showing
the connection is bad when troubleshooting.

Note that this also affects the `valkey-cli --rdb` and `--functions-rdb`
options. Before the client is in the `rdb_transmitted` state and is
released, we will now see it in the info (see the example later).

Before, not showing the replica info
```
role:master
connected_slaves:1
```

After, for dual channel replication:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=rdb-channel
```

After, for valkey-cli --rdb-only and --functions-rdb:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=replica
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-21 18:31:31 +08:00
Ricardo Dias 05540af405
Add script function flags in the module API (#2836)
This commit adds script function flags to the module API, which allows
function scripts to specify the function flags programmatically.

When the scripting engine compiles the script code can extract the flags
from the code and set the flags on the compiled function objects.

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2025-11-20 10:23:00 +00:00
Hanxi Zhang ed8856bdfc
Fix cluster slot migration flaky test (#2756)
The original test code only checks:

The original test code only checks:

1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.

2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.


Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.

Fixes #2693.

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-11-20 15:07:16 +08:00
aradz44 e19ceb7a6d
deflake "Hash field TTL and active expiry propagates correctly" (#2856)
Fix a little miss in "Hash field TTL and active expiry propagates
correctly through chain replication" test in `hashexpire.tcl`.
The test did not wait for the initial sync of the chained replica and thus  made the test flakey

Signed-off-by: Arad Zilberstein <aradz@amazon.com>
2025-11-19 11:33:55 +02:00
Venkat Pamulapati 3c3a1966ec
Perform data cleanup during RDB load on successful version/signature validation (#2600)
Addresses: https://github.com/valkey-io/valkey/issues/2588

## Overview
Previously we call `emptyData()` during a fullSync before validating the
RDB version is compatible.

This change adds an rdb flag that allows us to flush the database from
within `rdbLoadRioWithLoadingCtx`. THhis provides the option to only
flush the data if the rdb has a valid version and signature. In the case
where we do have an invalid version and signature, we don't emptyData,
so if a full sync fails for that reason a replica can still serve stale
data instead of clients experiencing cache misses.

## Changes
- Added a new flag `RDBFLAGS_EMPTY_DATA` that signals to flush the
database after rdb validation
- Added logic to call `emptyData` in `rdbLoadRioWithLoadingCtx` in
`rdb.c`
- Added logic to not clear data if the RDB validation fails in
`replication.c` using new return type `RDB_INCOMPATIBLE`
- Modified the signature of `rdbLoadRioWithLoadingCtx` to return RDB
success codes and updated all calling sites.

## Testing
Added a tcl test that uses the debug command `reload nosave` to load
from an RDB that has a future version number. This triggers the same
code path that full sync's will use, and verifies that we don't flush
the data until after the validation is complete.

A test already exists that checks that the data is flushed:
https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504

---------

Signed-off-by: Venkat Pamulapati <pamuvenk@amazon.com>
Signed-off-by: Venkat Pamulapati <33398322+ChiliPaneer@users.noreply.github.com>
Co-authored-by: Venkat Pamulapati <pamuvenk@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
2025-11-18 17:08:10 -08:00
yzc-yzc 57892663be
Fix SCAN consistency test to only test what we guarantee (#2853)
Test the SCAN consistency by alternating SCAN
calls to primary and replica.
We cannot rely on the exact order of the elements and the returned
cursor number.

---------

Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-11-18 16:06:20 +01:00
chzhoo 33bfac37ba
Optimize zset memory usage by embedding element in skiplist (#2508)
By default, when the number of elements in a zset exceeds 128, the
underlying data structure adopts a skiplist. We can reduce memory usage
by embedding elements into the skiplist nodes. Change the `zskiplistNode`
memory layout as follows:

```
Before
                 +-------------+
         +-----> | element-sds |
         |       +-------------+
         |
 +------------------+-------+------------------+---------+-----+---------+
 | element--pointer | score | backward-pointer | level-0 | ... | level-N |
 +------------------+-------+------------------+---------+-----+---------+



After
 +-------+------------------+---------+-----+---------+-------------+
 + score | backward-pointer | level-0 | ... | level-N | element-sds |
 +-------+------------------+---------+-----+---------+-------------+
```

Before the embedded SDS representation, we include one byte representing
the size of the SDS header, i.e. the offset into the SDS representation
where that actual string starts.

The memory saving is therefore one pointer minus one byte = 7 bytes per
element, regardless of other factors such as element size or number of
elements.

### Benchmark step

I generated the test data using the following lua script && cli command.
And check memory usage using the `info` command.

**lua script**
```
local start_idx = tonumber(ARGV[1])
local end_idx = tonumber(ARGV[2])
local elem_count = tonumber(ARGV[3])

for i = start_idx, end_idx do
    local key = "zset:" .. string.format("%012d", i)
    local members = {}

    for j = 0, elem_count - 1 do
        table.insert(members, j)
        table.insert(members, "member:" .. j)
    end

    redis.call("ZADD", key, unpack(members))
end

return "OK: Created " .. (end_idx - start_idx + 1) .. " zsets"
```

**valkey-cli command**
`valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000
${ZSET_ELEMENT_NUM}`

### Benchmark result
|number of elements in a zset | memory usage before optimization |
memory usage after optimization | change |
|-------|-------|-------|-------|
| 129 | 1047MB | 943MB | -9.9% |
| 256 |  2010MB|  1803MB| -10.3%|
| 512 |  3904MB|3483MB| -10.8%|

---------

Signed-off-by: chzhoo <czawyx@163.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-11-18 14:27:15 +01:00
Roshan Khatri 616fccb4c5
Fix the failing warmup and duration are cumulative (#2854)
We need to verify total duration was at least 2 seconds, elapsed time
can be quite variable to check upper-bound

Fixes https://github.com/valkey-io/valkey/issues/2843

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
2025-11-17 21:26:12 +01:00
Binbin aef56e52f5
Fix timing issue in dual channel replication COB test (#2847)
After #2829, valgrind report a test failure, it seems that the time is
not enough to generate a COB limit in valgrind.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-17 17:25:19 +08:00
Binbin a06cf15b20
Allow dual channel full sync in plain failover (#2659)
PSYNC_FULLRESYNC_DUAL_CHANNEL is also a full sync, as the comment says,
we need to allow it. While we have not yet identified the exact edge case
that leads to this line, but during a failover, there should be no difference
between different sync strategies.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-15 12:57:27 +08:00
Harkrishn Patro 86db609219
Print node name on a best effort basis if light weight message is received before link stabilization (#2825)
fixes: #2803

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-11-14 14:33:16 -08:00
yzc-yzc b93cfcc332
Attempt to fix flaky SCAN consistency test (#2834)
Related test failures:

https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193394

https://github.com/valkey-io/valkey/actions/runs/19200556305/job/54887767594

> *** [err]: scan family consistency with configured hash seed in
tests/integration/scan-family-consistency.tcl
> Expected '5 {k:1 k:25 z k:11 k:18 k:27 k:45 k:7 k:12 k:19 k:29 k:40
k:41 k:43}' to be equal to '5 {k:1 k:25 k:11 z k:18 k:27 k:45 k:7 k:12
k:19 k:29 k:40 k:41 k:43}' (context: type eval line 26 cmd {assert_equal
$primary_cursor_next $replica_cursor_next} proc ::start_server)

The reason is that the RDB part of the primary-replica synchronization
affects the resize policy of the hashtable.
See
b835463a73/src/server.c (L807-L818)

Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
2025-11-14 10:55:05 +01:00
Binbin 331a852821
Change DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE from 60s to 5s (#2829)
Consider this scenario:
1. Replica starts loading the RDB using the rdb connection
2. Replica finishes loading the RDB before the replica main connection has
   initiated the PSYNC request
3. Replica stops replicating after receiving replicaof no one
4. Primary can't know that the replica main connection will never ask for
   PSYNC, so it keeps the reference to the replica's replication buffer block
5. Primary has a shutdown-timeout configured and requires to wait for the rdb
   connection to close before it can shut down.

The current 60-second wait time (DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE) is excessive
and leads to prolonged resource retention in edge cases. Reducing this timeout to
5 seconds would provide adequate time for legitimate PSYNC requests while mitigating
the issue described above.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-14 11:32:29 +08:00
Ricardo Dias 8e0b375da4
Fix cluster slot stats for scripts with cross-slot keys (#2835)
This commit fixes the cluster slot stats for scripts executed by
scripting engines when the scripts access cross-slot keys.

This was not a bug in Lua scripting engine, but `VM_Call` was missing a
call to invalidate the script caller client slot to prevent the
accumulation of stats.

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2025-11-13 12:05:52 -08:00
Rain Valentine 01a7657b83
Add --warmup and --duration parameters to valkey-benchmark (#2581)
It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂

I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)

```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^

Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
                                                                    ^^^^^^^^^^^^^^^

Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
                                                                    ^^^^^^^^^^^
```

To be clear, the report at the end remains unchanged.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-11-13 12:57:46 +01:00
Sarthak Aggarwal b835463a73
Fixes test-freebsd workflow in daily (package lang/tclX) (#2832)
This PR fixes the freebsd daily job that has been failing consistently
for the last days with the error "pkg: No packages available to install
matching 'lang/tclx' have been found in the repositories".

The package name is corrected from `lang/tclx` to `lang/tclX`. The
lowercase version worked previously but appears to have stopped working
in an update of freebsd's pkg tool to 2.4.x.

Example of failed job:

https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193499

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2025-11-13 08:24:37 +01:00
Binbin 7ffe4dcec4
Remove the EXAT and PXAT from some HFE notifications tests (#2831)
As we can see, we expected to get hexpired, but we got hexpire instead,
this means tht the expiration time has expired during execution.
```
*** [err]: HGETEX EXAT keyspace notifications for active expiry in tests/unit/hashexpire.tcl
Expected 'pmessage __keyevent@* __keyevent@9__:hexpired myhash' to match 'pmessage __keyevent@* __keyevent@*:hexpire myhash'
```

We should remove the EXAT and PXAT from these fixtures. And we indeed
have
the dedicated tests that verify that we get 'expired' when EX,PX are set
to 0
or EXAT,PXAT are in the past.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-12 14:32:13 +02:00