For corrupted (human-made) or program-error-recovered nodes.conf files,
check for duplicate nodeids when loading nodes.conf. If a duplicate is
found, panic is triggered to prevent nodes from starting up
unexpectedly.
The node ID is used to identify every node across the whole cluster,
we do not expect to find duplicate nodeids in nodes.conf.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently, when parsing querybuf, we are not checking for CRLF,
instead we assume the last two characters are CRLF by default,
as shown in the following example:
```
telnet 127.0.0.1 6379
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
*3
$3
set
$3
key
$5
value12
+OK
get key
$5
value
*3
$3
set
$3
key
$5
value12345
+OK
-ERR unknown command '345', with args beginning with:
```
This should actually be considered a protocol error. When a bug
occurs in the client-side implementation, we may execute incorrect
requests (writing incorrect data is the most serious of these).
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
- Require a 2/3 supermajority vote for all Governance Major Decisions.
- Update Technical Major Decision voting to prioritize simple majority, limiting the use of "+2" approval.
- Define remediation steps for when the 1/3 organization limit is exceeded.
---------
Signed-off-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
GitHub has deprecated older macOS runners, and macos-13 is no longer supported.
1. The latest version of cross-platform-actions/action does allow
running on ubuntu-latest (Linux runner) and does not strictly require macOS.
2. Previously, cross-platform-actions/action@v0.22.0 used runs-on:
macos-13. I checked the latest version of cross-platform-actions, and
the official examples now use runs-on: ubuntu. I think we can switch from macOS to Ubuntu.
---------
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
This reverts commit 2da21d9def.
The implementation in #2366 made it possible to perform a partial
resynchronization after loading an AOF file, by storing the replication
offset in the AOF preamble and counting the bytes in the AOF command
stream in the same way as we count byte offset in a replication stream.
However, this approach isn't safe because some commands are replicated
but not stored in the AOF file. This includes the commands REPLCONF,
PING, PUBLISH and module-implemented commands where a module can control
to propagate a command to the replication stream only, to AOF only or to
both. This oversight led to data inconsistency, where the wrong
replication offset is used for partial resynchronization as explained in
issue #2904.
The revert caused small merge conflicts with 3c3a1966ec which are
solved.
Fixes#2904.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The CLUSTER SLOTS reply depends on whether the client is connected over
IPv6, but for a fake client there is no connection and when this command
is called from a module timer callback or other scenario where no real
client is involved, there is no connection to check IPv6 support on.
This fix handles the missing case by returning the reply for IPv4
connected clients.
Fixes#2912.
---------
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We have the same settings for the hard limit, and we should apply them to the soft
limit as well. When the `repl-backlog-size` value is larger, all replication buffers
can be handled by the replication backlog, so there's no need to worry about the
client output buffer soft limit in here. Furthermore, when `soft_seconds` is 0, in
some ways, the soft limit behaves the same (mostly) as the hard limit.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Problem
IO thread shutdown can deadlock during server panic when the main thread
calls `pthread_cancel()` while the IO thread holds its mutex, preventing
the thread from observing the cancellation.
## Solution
Release the IO thread mutex before cancelling to ensure clean thread
termination.
## Testing
Reproducer:
```
bash
./src/valkey-server --io-threads 2 --enable-debug-command yes
./src/valkey-cli debug panic
```
Before: Server hangs indefinitely
After: Server terminates cleanly
Signed-off-by: Ouri Half <ourih@amazon.com>
This adds the workflow improvements for PR and Release benchmark where
it runs on `c8g.metal-48xl` for `ARM64` and `c7i.metal-48xl` for `X86`.
```
Cluster mode: disabled
TLS: disabled
io-threads: 1, 9
Pipelining: 1, 10
Clients: 1600
Benchmark Treads: 90
Data size: 16 ,96
Commands: SET, GET
```
c8g.metal-48xl Spec: https://aws.amazon.com/ec2/instance-types/c8g/
c7i.metal.48xl Spec: https://aws.amazon.com/ec2/instance-types/c7i/
```
vCPU: 192
NUMA nodes: 2
Memory (GiB): 384
Network Bandwidth (Gbps): 50
```
PR benchmarking will be executed on **ARM64** machine as it has been
seen to be more consistent.
Additionally, it runs 5 iterations for each tests and posts the average
and other statistical metrics like
- CI99%: 99% Confidence Interval - range where the true population mean
is likely to fall
- PI99%: 99% Prediction Interval - range where a single future
observation is likely to fall
- CV: Coefficient of Variation - relative variability (σ/μ × 100%)
_Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate
averages from X runs with standard deviation Y, coefficient of variation
Z%, 99% confidence interval margin of error ±W% of the mean, and 99%
prediction interval margin of error ±V% of the mean. CI bounds [A, B]
and PI bounds [C, D] show the actual interval ranges._
For comparing between versions, it adds a workflow which runs on both
**ARM64** and **X86** machine. It will also post the comparison between
the versions like this:
https://github.com/valkey-io/valkey/issues/2580#issuecomment-3399539615
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <117414976+roshkhatri@users.noreply.github.com>
This makes it safe to delete hashtable while a safe iterator is
iterating it. This currently isn't possible, but this improvement is
required for fork-less replication #1754 which is being actively
worked on.
We discussed these issues in #2611 which guards against a different but
related issue: calling hashtableNext again after it has already returned
false.
I implemented a singly linked list that hashtable uses to track its
current safe iterators. It is used to invalidate all associated safe
iterators when the hashtable is released. A singly linked list is
acceptable because the list length is always very small - typically zero
and no more than a handful.
Also, renames the internal functions:
hashtableReinitIterator -> hashtableRetargetIterator
hashtableResetIterator -> hashtableCleanupIterator
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This prevents crashes on the older nodes in mixed clusters where some
nodes are running 8.0 or older. Mixed clusters often exist temporarily
during rolling upgrades.
Fixes: #2341
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Fixes#2792
Replace strcmp with byte-by-byte comparison to avoid accidental
heap-buffer-overflow errors.
Signed-off-by: murad shahmammadli <shmurad@amazon.com>
Co-authored-by: murad shahmammadli <shmurad@amazon.com>
Fixes: https://github.com/valkey-io/valkey/issues/2859
Increased the warmup to 2 sec so we can verify that it runs more number
of commands than the actual benchmark.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
General cleanup on LRU/LFU code. Improve modularity and maintainability.
Specifically:
* Consolidates the mathematical logic for LRU/LFU into `lrulfu.c`, with
an API in `lrulfu.h`. Knowledge of the LRU/LFU implementation was
previously spread out across `db.c`, `evict.c`, `object.c`, `server.c`,
and `server.h`.
* Separates knowledge of the LRU from knowledge of the object containing
the LRU value. `lrulfu.c` knows about the LRU/LFU algorithms, without
knowing about the `robj`. `object.c` knows about the `robj` without
knowing about the details of the LRU/LFU algorithms.
* Eliminated `server.lruclock`, instead using `server.unixtime`. This
also eliminates the periodic need to call `mstime()` to maintain the lru
clock.
* Fixed a minor computation bug in the old `LFUTimeElapsed` function
(off by 1 after rollover).
* Eliminate specific IF checks for rollover, using defined behavior for
unsigned rollover instead.
* Fixed a bug in `debug.c` which would perform LFU modification on an
LRU value.
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Closes#2883
Support a new environment variable `VALKEY_PROG_SUFFIX` in the test
framework, which can be used for running tests if the binaries are
compiled with a program suffix. For example, if the binaries are
compiled using `make PROG_SUFFIX=-alt` to produce binaries named
valkey-server-alt, valkey-cli-alt, etc., run the tests against these
binaries using `VALKEY_PROG_SUFFIX=-alt ./runtest` or simply using `make
test`.
Now the test with the make variable `PROG_SUFFIX` works well.
```
% make PROG_SUFFIX="-alt"
...
...
CC trace/trace_aof.o
LINK valkey-server-alt
INSTALL valkey-sentinel-alt
CC valkey-cli.o
CC serverassert.o
CC cli_common.o
CC cli_commands.o
LINK valkey-cli-alt
CC valkey-benchmark.o
LINK valkey-benchmark-alt
INSTALL valkey-check-rdb-alt
INSTALL valkey-check-aof-alt
Hint: It's a good idea to run 'make test' ;)
%
% make test
cd src && /Library/Developer/CommandLineTools/usr/bin/make test
CC Makefile.dep
CC release.o
LINK valkey-server-alt
INSTALL valkey-check-aof-alt
INSTALL valkey-check-rdb-alt
LINK valkey-cli-alt
LINK valkey-benchmark-alt
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 39435
Testing unit/pubsub
```
Signed-off-by: Zhijun <dszhijun@gmail.com>
Historically, Valkey’s TCL test suite expected all binaries
(src/valkey-server, src/valkey-cli, src/valkey-benchmark, etc.) to exist
under the src/ directory. This PR enables Valkey TCL tests to run
seamlessly after a CMake build — no manual symlinks or make build
required.
The test framework accepts a new environment variable `VALKEY_BIN_DIR`
to look for the binaries.
CMake will copy all TCL test entrypoints (runtest, runtest-cluster,
etc.) into the CMake build dir (e.g. `cmake-build-debug`) and insert
`VALKEY_BIN_DIR` into these. Now we can either do
./cmake-build-debug/runtest at the project root or ./runtest at the
Cmake dir to run all tests.
A new CMake post-build target prints a friendly reminder after
successful builds, guiding developers on how to run tests with their
CMake binaries:
```
Hint: It is a good idea to run tests with your CMake-built binaries ;)
./cmake-build-debug/runtest
Build finished
```
A helper TCL script `tests/support/set_executable_path.tcl` is added to
support this change, which gets called by all test entrypoints:
`runtest`, `runtest-cluster`, `runtest-sentinel`.
---------
Signed-off-by: Zhijun <dszhijun@gmail.com>
Although we can infer this infomartion from the replica logs,
i think this still would be useful to see this information directly
in the primary logs.
So now we can see the psync offset range when psync fails and then
we can analyze and adjust the value of repl-backlog-size.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Persist USE_FAST_FLOAT and PROG_SUFFIX to prevent a complete rebuild
next time someone types make or make test without specifying variables.
Fixes#2880
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
fedorarawhide CI reports these warnings:
```
networking.c: In function 'afterErrorReply':
networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
821 | char *spaceloc = memchr(s, ' ', len < 32 ? len : 32);
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In functions `clusterSendFailoverAuthIfNeeded` and
`clusterProcessPacket`, we iterate through **every slot bit**
sequentially in the form of `for (int j = 0; j < CLUSTER_SLOTS; j++)`,
performing 16384 checks even when only a few bits were set, and thus
causing unnecessary loop overhead.
This is particularly wasteful in function
`clusterSendFailoverAuthIfNeeded` where we need to ensure the sender's
claimed slots all have up-to-date config epoch. Usually healthy senders
would meet such condition, and thus we normally need to exhaust the for
loop of 16384 checks.
The proposed new implementation loads 64 bits (8 byte word) at a time
and skips empty words completely, therefore only performing 256 checks.
---------
Signed-off-by: Zhijun <dszhijun@gmail.com>
When we added the Hash Field Expiration feature in Valkey 9.0, some of
the new command docs included complexity description of O(1) even tough
they except multiple arguments.
(see discussion in
https://github.com/valkey-io/valkey/pull/2851#discussion_r2535684985)
This PR does:
1. align all the commands to the same description
2. fix the complexity description of some commands (eg HSETEX and
HGETEX)
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Enhance debugging for cluster logs
[1] Add human node names in cluster tests so that we can easily
understand which nodes we are interacting with:
```
pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) announces that it is a primary in shard c6d1152caee49a5e70cb4b77d1549386078be603
Reconfiguring node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) as primary for shard c6d1152caee49a5e70cb4b77d1549386078be603
Configuration change detected. Reconfiguring myself as a replica of node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) in shard c6d1152caee49a5e70cb4b77d1549386078be603
```
[2] Currently there are logs showing the address of incoming
connections:
```
Accepting cluster node connection from 127.0.0.1:59956
Accepting cluster node connection from 127.0.0.1:59957
Accepting cluster node connection from 127.0.0.1:59958
Accepting cluster node connection from 127.0.0.1:59959
```
but we have no idea which nodes these connections refer to. I added a
logging statement when the node is set to the inbound link connection.
```
Bound cluster node 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) to connection of client 127.0.0.1:59956
```
[3] Add a debug log when processing a packet to show the packet type,
sender node name, and sender client port (this also has the benefit of
telling us whether this is an inbound or outbound link).
```
pong packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: :0
ping packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
fail packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
auth-req packet received from: 92a960ffd62f1bd04efeb260b30fe9ca6b9294ed (R4) from client: 127.0.0.1:59973
```
---------
Signed-off-by: Zhijun <dszhijun@gmail.com>
Only enable `HAVE_ARM_NEON` on AArch64 because it supports vaddvq and
all needed compiler intrinsics.
Fixes the following error when building for machine `qemuarm` using the
Yocto Project and OpenEmbedded:
```
| bitops.c: In function 'popcountNEON':
| bitops.c:219:23: error: implicit declaration of function 'vaddvq_u16'; did you mean 'vaddq_u16'? [-Wimplicit-function-declaration]
| 219 | uint32_t t1 = vaddvq_u16(sc);
| | ^~~~~~~~~~
| | vaddq_u16
| bitops.c:225:14: error: implicit declaration of function 'vaddvq_u8'; did you mean 'vaddq_u8'? [-Wimplicit-function-declaration]
| 225 | t += vaddvq_u8(vcntq_u8(vld1q_u8(p)));
| | ^~~~~~~~~
| | vaddq_u8
```
More details are available in the following log:
https://errors.yoctoproject.org/Errors/Details/889836/
Signed-off-by: Leon Anavi <leon.anavi@konsulko.com>
## Problem
When executing `FLUSHALL ASYNC` on a **writable replica** that has
a large number of expired keys directly written to it, the main thread
gets blocked for an extended period while synchronously releasing the
`replicaKeysWithExpire` dictionary.
## Root Cause
`FLUSHALL ASYNC` is designed for asynchronous lazy freeing of core data
structures, but the release of `replicaKeysWithExpire` (a dictionary tracking
expired keys on replicas) still happens synchronously in the main thread.
This synchronous operation becomes a bottleneck when dealing with massive
key volumes, as it cannot be offloaded to the lazyfree background thread.
This PR addresses the issue by moving the release of `replicaKeysWithExpire`
to the lazyfree background thread, aligning it with the asynchronous design
of `FLUSHALL ASYNC` and eliminating main thread blocking.
## User scenarios
In some operations, people often need to do primary-replica switches.
One goal is to avoid noticeable impact on the business—like key loss
or reduced availability (e.g., write failures).
Here is the process: First, temporarily switch traffic to writable replicas.
Then we wait for the primary pending replication data to be fully synced
(so primry and replicas are in sync), before finishing the switch. We don't
usually need to do the flush in this case, but it's an optimization that can
be done.
Signed-off-by: Scut-Corgis <512141203@qq.com>
Signed-off-by: jiegang0219 <512141203@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
After introducing the dual channel replication in #60, we decided in #915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.
We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In dual channel replication, when the rdb channel client finish
the RDB transfer, it will enter REPLICA_STATE_RDB_TRANSMITTED
state. During this time, there will be a brief window that we are
not able to see the connection in the INFO REPLICATION.
In the worst case, we might not see the connection for the
DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE seconds. I guess there is no
harm to list this state, showing connected_slaves but not showing
the connection is bad when troubleshooting.
Note that this also affects the `valkey-cli --rdb` and `--functions-rdb`
options. Before the client is in the `rdb_transmitted` state and is
released, we will now see it in the info (see the example later).
Before, not showing the replica info
```
role:master
connected_slaves:1
```
After, for dual channel replication:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=rdb-channel
```
After, for valkey-cli --rdb-only and --functions-rdb:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=replica
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit adds script function flags to the module API, which allows
function scripts to specify the function flags programmatically.
When the scripting engine compiles the script code can extract the flags
from the code and set the flags on the compiled function objects.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The original test code only checks:
The original test code only checks:
1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.
2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.
Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.
Fixes#2693.
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Fix a little miss in "Hash field TTL and active expiry propagates
correctly through chain replication" test in `hashexpire.tcl`.
The test did not wait for the initial sync of the chained replica and thus made the test flakey
Signed-off-by: Arad Zilberstein <aradz@amazon.com>
Addresses: https://github.com/valkey-io/valkey/issues/2588
## Overview
Previously we call `emptyData()` during a fullSync before validating the
RDB version is compatible.
This change adds an rdb flag that allows us to flush the database from
within `rdbLoadRioWithLoadingCtx`. THhis provides the option to only
flush the data if the rdb has a valid version and signature. In the case
where we do have an invalid version and signature, we don't emptyData,
so if a full sync fails for that reason a replica can still serve stale
data instead of clients experiencing cache misses.
## Changes
- Added a new flag `RDBFLAGS_EMPTY_DATA` that signals to flush the
database after rdb validation
- Added logic to call `emptyData` in `rdbLoadRioWithLoadingCtx` in
`rdb.c`
- Added logic to not clear data if the RDB validation fails in
`replication.c` using new return type `RDB_INCOMPATIBLE`
- Modified the signature of `rdbLoadRioWithLoadingCtx` to return RDB
success codes and updated all calling sites.
## Testing
Added a tcl test that uses the debug command `reload nosave` to load
from an RDB that has a future version number. This triggers the same
code path that full sync's will use, and verifies that we don't flush
the data until after the validation is complete.
A test already exists that checks that the data is flushed:
https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504
---------
Signed-off-by: Venkat Pamulapati <pamuvenk@amazon.com>
Signed-off-by: Venkat Pamulapati <33398322+ChiliPaneer@users.noreply.github.com>
Co-authored-by: Venkat Pamulapati <pamuvenk@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Test the SCAN consistency by alternating SCAN
calls to primary and replica.
We cannot rely on the exact order of the elements and the returned
cursor number.
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
By default, when the number of elements in a zset exceeds 128, the
underlying data structure adopts a skiplist. We can reduce memory usage
by embedding elements into the skiplist nodes. Change the `zskiplistNode`
memory layout as follows:
```
Before
+-------------+
+-----> | element-sds |
| +-------------+
|
+------------------+-------+------------------+---------+-----+---------+
| element--pointer | score | backward-pointer | level-0 | ... | level-N |
+------------------+-------+------------------+---------+-----+---------+
After
+-------+------------------+---------+-----+---------+-------------+
+ score | backward-pointer | level-0 | ... | level-N | element-sds |
+-------+------------------+---------+-----+---------+-------------+
```
Before the embedded SDS representation, we include one byte representing
the size of the SDS header, i.e. the offset into the SDS representation
where that actual string starts.
The memory saving is therefore one pointer minus one byte = 7 bytes per
element, regardless of other factors such as element size or number of
elements.
### Benchmark step
I generated the test data using the following lua script && cli command.
And check memory usage using the `info` command.
**lua script**
```
local start_idx = tonumber(ARGV[1])
local end_idx = tonumber(ARGV[2])
local elem_count = tonumber(ARGV[3])
for i = start_idx, end_idx do
local key = "zset:" .. string.format("%012d", i)
local members = {}
for j = 0, elem_count - 1 do
table.insert(members, j)
table.insert(members, "member:" .. j)
end
redis.call("ZADD", key, unpack(members))
end
return "OK: Created " .. (end_idx - start_idx + 1) .. " zsets"
```
**valkey-cli command**
`valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000
${ZSET_ELEMENT_NUM}`
### Benchmark result
|number of elements in a zset | memory usage before optimization |
memory usage after optimization | change |
|-------|-------|-------|-------|
| 129 | 1047MB | 943MB | -9.9% |
| 256 | 2010MB| 1803MB| -10.3%|
| 512 | 3904MB|3483MB| -10.8%|
---------
Signed-off-by: chzhoo <czawyx@163.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We need to verify total duration was at least 2 seconds, elapsed time
can be quite variable to check upper-bound
Fixes https://github.com/valkey-io/valkey/issues/2843
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
After #2829, valgrind report a test failure, it seems that the time is
not enough to generate a COB limit in valgrind.
Signed-off-by: Binbin <binloveplay1314@qq.com>
PSYNC_FULLRESYNC_DUAL_CHANNEL is also a full sync, as the comment says,
we need to allow it. While we have not yet identified the exact edge case
that leads to this line, but during a failover, there should be no difference
between different sync strategies.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Related test failures:
https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193394https://github.com/valkey-io/valkey/actions/runs/19200556305/job/54887767594
> *** [err]: scan family consistency with configured hash seed in
tests/integration/scan-family-consistency.tcl
> Expected '5 {k:1 k:25 z k:11 k:18 k:27 k:45 k:7 k:12 k:19 k:29 k:40
k:41 k:43}' to be equal to '5 {k:1 k:25 k:11 z k:18 k:27 k:45 k:7 k:12
k:19 k:29 k:40 k:41 k:43}' (context: type eval line 26 cmd {assert_equal
$primary_cursor_next $replica_cursor_next} proc ::start_server)
The reason is that the RDB part of the primary-replica synchronization
affects the resize policy of the hashtable.
See
b835463a73/src/server.c (L807-L818)
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Consider this scenario:
1. Replica starts loading the RDB using the rdb connection
2. Replica finishes loading the RDB before the replica main connection has
initiated the PSYNC request
3. Replica stops replicating after receiving replicaof no one
4. Primary can't know that the replica main connection will never ask for
PSYNC, so it keeps the reference to the replica's replication buffer block
5. Primary has a shutdown-timeout configured and requires to wait for the rdb
connection to close before it can shut down.
The current 60-second wait time (DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE) is excessive
and leads to prolonged resource retention in edge cases. Reducing this timeout to
5 seconds would provide adequate time for legitimate PSYNC requests while mitigating
the issue described above.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit fixes the cluster slot stats for scripts executed by
scripting engines when the scripts access cross-slot keys.
This was not a bug in Lua scripting engine, but `VM_Call` was missing a
call to invalidate the script caller client slot to prevent the
accumulation of stats.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
It's handy to be able to automatically do a warmup and/or test by
duration rather than request count. 🙂
I changed the real-time output a bit - not sure if that's wanted or not.
(Like, would it break people's weird scripts? It'll break my weird
scripts, but I know the price of writing weird fragile scripts.)
```
Prepended "Warming up " when in warmup phase:
Warming up SET: rps=69211.2 (overall: 69747.5) avg_msec=0.425 (overall: 0.393) 3.8 seconds
^^^^^^^^^^
Appended running request counter when based on -n:
SET: rps=70892.0 (overall: 69878.1) avg_msec=0.385 (overall: 0.398) 612482 requests
^^^^^^^^^^^^^^^
Appended running second counter when in warmup or based on --duration:
SET: rps=61508.0 (overall: 61764.2) avg_msec=0.430 (overall: 0.426) 4.8 seconds
^^^^^^^^^^^
```
To be clear, the report at the end remains unchanged.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR fixes the freebsd daily job that has been failing consistently
for the last days with the error "pkg: No packages available to install
matching 'lang/tclx' have been found in the repositories".
The package name is corrected from `lang/tclx` to `lang/tclX`. The
lowercase version worked previously but appears to have stopped working
in an update of freebsd's pkg tool to 2.4.x.
Example of failed job:
https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193499
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
As we can see, we expected to get hexpired, but we got hexpire instead,
this means tht the expiration time has expired during execution.
```
*** [err]: HGETEX EXAT keyspace notifications for active expiry in tests/unit/hashexpire.tcl
Expected 'pmessage __keyevent@* __keyevent@9__:hexpired myhash' to match 'pmessage __keyevent@* __keyevent@*:hexpire myhash'
```
We should remove the EXAT and PXAT from these fixtures. And we indeed
have
the dedicated tests that verify that we get 'expired' when EX,PX are set
to 0
or EXAT,PXAT are in the past.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Description
This change introduces the ability for modules to check ACL permissions
against key prefix. The update adds a dedicated `prefixmatchlen` helper
and extends the core ACL selector logic to support a prefix‑matching
mode.
The new API `ValkeyModule_ACLCheckPrefixPermissions` is registered and
exposed to modules, and a corresponding implementation is added in
`module.c`. Existing internal callers that already perform prefix checks
(e.g., `VM_ACLCheckKeyPermissions`) are updated to use the new flag,
while all legacy paths remain unchanged.
The change also modifies the `aclcheck§ test module that exercises the
new prefix‑checking API, ensuring that read/write operations are
correctly allowed or denied based on the ACL configuration.
Key areas touched:
* ACL logic
* Module API
* Testing
# Motivation
The search module presently makes costly calls to verify index
permissions
(see https://github.com/valkey-io/valkey-search/blob/main/src/acl.cc#L295).
This PR introduces a more efficient approach for that.
---------
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
After network failure nodes that come back to cluster do not always send
and/or receive messages from other nodes in shard, this fix avoids usage
of light weight messages to nodes with not ready bidirectional links.
When a light message comes before any normal message, freeing of cluster
link is happening because on the just established connection link->node
is not assigned yet. It is assigned in getNodeFromLinkAndMsg right after
the condition if (is_light).
So on a cluster with heavy pubsub load a long loop of disconnects is
possible, and we got this.
1. node A establishes cluster link to node B
2. node A propagates PUBLISH to node B
3. node B frees cluster link because of link->node == null as it has not
received non-light messages yet
4. go to 1.
During this loop subscribers of node B does not receive any messages
published to node A.
So here we want to make sure that PING was sent (and link->node was
initialized) on this connection before using lightweight messages.
---------
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
GEOADD is allocating/destroying a string object for "ZADD"
each time it is called. Created a shared string instead.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
This increases the times we check for the logs from 20 to 40.
I found that every `wait-for` check takes about 1.5 to 1.57 milliseconds
so when we were checking 2000 times after 1ms we were actually spending
(2000 * 1) + (2000 *1.75) = 5500ms time waiting.
this can be founds under: for 10 checks we took 35 ms more so thats
around 1.75 ms per check
```
Execution time: 2034 ms (failed)
[err]: 20 100 - Test dual-channel: primary tracking replica backlog refcount - start with empty backlog in tests/integration/dual-channel-replication-flaky.tcl
```
That is why increasing it to 40 100 will check for approx 4,070 ms which
is still less than the original 5500ms but should passes every single
time here:
https://github.com/roshkhatri/valkey/actions/runs/19279424967/job/55126976882
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
The AOF preamble mechanism replaces the traditional AOF base file with
an RDB snapshot during rewrite operations, which reduces I/O overhead
and improves loading performance.
However, when valkey loads the RDB-formatted preamble from the base AOF
file, it does not process the replication ID (replid) information within
the RDB AUX fields. This omission has two limitations:
* On a primary, it prevents the primary from accepting PSYNC continue
requests after restarting with a preamble-enabled AOF file.
* On a replica, it prevents the replica from successfully performing
partial sync requests (avoiding full sync) after restarting with a
preamble-enabled AOF file.
To resolve this, this commit aligns the AOF preamble handling with the
logic used for standalone RDB files, by storing the replication ID and
replication offset in the AOF preamble and restoring them when loading
the AOF file.
Resolves#2677
---------
Signed-off-by: arthur.lee <liziang.arthur@bytedance.com>
Signed-off-by: Arthur Lee <arthurkiller@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit extends the Module API to expose the `SIMPLE_STRING` and
`ARRAY_NULL` reply types to modules, by passing the new flag `X` to
the `ValkeyModule_Call` function.
By only exposing the new reply types behind a flag we keep the
backward compatibility with existing module implementations and
allow new modules to working with these reply type, which are
required for scripts to process correctly the reply type of commands
called inside scripts.
Before this change, commands like `PING` or `SET`, which return `"OK"`
as a simple string reply, would be returned as string replies to
scripts.
To allow the support of the Lua engine as an external module, we need to
distinguish between simple string and string replies to keep backward
compatibility.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The new module context flag `VALKEYMODULE_CTX_SCRIPT_EXECUTION` denotes
that the module API function is being called in the context of a
scripting engine execution.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
clientReplyBlock stores the size of the actual allocation in it size
field (minus the header size). This can be used for more effective
deallocation with zfree_with_size.
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Resolves: https://github.com/valkey-io/valkey/issues/2695
Increase the wait time for periodic log check for rdb load time. Also,
increases the delay of log check frequency.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <117414976+roshkhatri@users.noreply.github.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Related test failures:
*** [err]: Replica importing key containment (slot 0 from node 0 to 2) - DBSIZE command excludes importing keys in tests/unit/cluster/cluster-migrateslots.tcl
Expected '1' to match '0' (context: type eval line 2 cmd {assert_match "0" [R $node_idx DBSIZE]} proc ::test)
The reason is that we don't wait for the primary-replica synchronization
to complete before starting the next testcase.
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In this commit we add a new context for the ACL log entries that is used
to log ACL failures that occur during scripts execution. To maintain
backward compatibility we still maintain the "lua" context for failures
that happen when running Lua scripts. For other scripting engines the
context description will be just "script".
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The `README.md` file is currently missing a section to build Valkey with
`fast_float`, which was introduced in Valkey 8.1 as an optional
dependency (#1260)
Signed-off-by: hieu2102 <hieund2102@gmail.com>
Introduce a new config `hash-seed` which can be set only at startup and
controls the hash seed for the server. This includes all hash tables.
This change makes it so that both primaries and replicas will return the
same results for SCAN/HSCAN/ZSCAN/SSCAN cursors. This is useful in order
to make sure SCAN behaves correctly after a failover.
Resolves#4
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In `entry.c`, the `entry` is a block of memory with variable contents.
The structure can be difficult to understand. A new header comment more
clearly documents the contents/layout of the `entry`.
Also, in `entry.h`, the `entry` was defined by `typedef void entry`.
This allows blind casting to the `entry` type. It defeats compiler type
checking.
Even though the `entry` has a variable definition, we can define entry
as a legitimate type which allows the compiler to perform type checking.
By performing `typedef struct _entry entry`, now the `entry` is
understood to be a pointer to some type of undefined structure. We can
pass a pointer and the compiler can typecheck the pointer. (Of course we
can't dereference it, because we haven't actually defined the struct.)
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
### Summary
Addresses https://github.com/valkey-io/valkey/issues/2619.
This PR extends the `HSETEX` command to support optional key-level `NX`
and `XX` flags, allowing operations conditional on the existence of the
hash key.
### Changes
- Updated `hsetex.json` and regenerated `commands.def`.
- Extended argument parsing for NX/XX.
- Added key-level `NX`/`XX` support in `HSETEX`.
- Added tests covering all four NX/XX scenarios.
---------
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Since Valkey Sentinel 9.0, sentinel tries to abort an ongoing failover
when changing the role of a monitored instance. Since the result of the
command is ignored, the "FAILOVER ABORT" command is sent irrespective of
the actual failover status.
However, when using the documented pre 9.0 ACLs for a dedicated sentinel
user, the FAILOVER command is not allowed and _all_ failover cases fail.
(Additionally, the necessary ACL adaptation was not communicated well.)
Address this by:
- Updating the documentation in "sentinel.conf" to reflect the need for
an adapted ACL
- Only abort a failover when sentinel detected an ongoing (probably
stuck) failover. This means that standard failover and manual failover
continue to work with unchanged pre 9.0 ACLs. Only the new "SENTINEL
FAILOVER COORDINATED" requires to adapt the ACL on all Valkey nodes.
- Actually use a dedicated sentinel user and ACLs when testing standard
failover, manual failover, and manual coordinated failover.
Fixes#2779
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
There’s an issue with the LTRIM command. When LTRIM does not actually
modify the key — for example, with `LTRIM key 0 -1` — the server.dirty
counter is not updated because both ltrim and rtrim values are 0. As a
result, the command is not propagated. However, `signalModifiedKey` is
still called regardless of whether server.dirty changes. This behavior
is unexpected and can cause a mismatch between the source and target
during propagation, since the LTRIM command is not sent.
Signed-off-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
Just setting the authenticated flag actually authenticates to the
default user in this case. The default user may be granted no permission
to use CLUSTER SYNCSLOTS.
Instaed, we now authenticate to the NULL/internal user, which grants
access to all commands. This is the same as what we do for replication:
864de555ce/src/replication.c (L4717)
Add a test for this case as well.
Closes#2783
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Note: these changes are part of the effort to run Lua engine as an
external scripting engine module.
The new function `ValkeyModule_ReplyWithCustomErrorFormat` is being
added to the module API to allow scripting engines to return errors that
originated from running commands within the script code, without
counting twice in the error stats counters.
More details on why this is needed by scripting engines can be read in
an older commit aa856b39f2 messsage.
This PR also adds a new test to ensure the correctness of the newly
added function.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Fixes an assert crash in _writeToClient():
serverAssert(c->io_last_written.data_len == 0 ||
c->io_last_written.buf == c->buf);
The issue occurs when clientsCronResizeOutputBuffer() grows or
reallocates c->buf while io_last_written still points to the old buffer
and data_len is non-zero. On the next write, both conditions in the
assertion become false.
Reset io_last_written when resizing the output buffer to prevent stale
pointers and keep state consistent.
fixes https://github.com/valkey-io/valkey/issues/2769
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Note: these changes are another step towards being able to run Lua
engine as an external scripting engine module.
In this commit we improve the `ValkeyModule_Call` API function code to
match the validations and behavior of the `scriptCall` function,
currently used by the Lua engine to run commands using `server.call` Lua
Valkey API.
The changes made are backward compatible. The new behavior/validations
are only enabled when calling `ValkeyModule_Call` while running a script
using `EVAL` or `FCALL`.
To test these changes, we improved the `HELLO` dummy scripting engine
module to support calling commands, and compare the behavior with
calling the same command from a Lua script.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
This commit adds a new `INFO` section called "Scripting Engines" that
shows the information of the current scripting engines available in the
server.
Here's an output example:
```
> info scriptingengines
# Scripting Engines
engines_count:2
engines_total_used_memory:68608
engines_total_memory_overhead:56
engine_0:name=LUA,module=built-in,abi_version=4,used_memory=68608,memory_overhead=16
engine_1:name=HELLO,module=helloengine,abi_version=4,used_memory=0,memory_overhead=40
```
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Skip IPv6 tests automatically when IPv6 is not available.
This fixes the problem that tests fail when IPv6 is not available on the
system, which can worry users when they run `make test`.
IPv6 availibility is detected by opening a dummy server socket and
trying to connect to it using a client socket over IPv6.
Fixes#2643
---------
Signed-off-by: diego-ciciani01 <diego.ciciani@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Currently, monotonic clock initialization relies on the model name field
from /proc/cpuinfo to retrieve the clock speed. However, this is not
always present. In case it is not present, measure the clock tick and
use it instead.
Before fix:
```
monotonic: x86 linux, unable to determine clock rate
```
After fix:
```
21695:M 25 Oct 2025 20:16:23.168 * monotonic clock: X86 TSC @ 2649 ticks/us
```
Fixes#2774
---------
Signed-off-by: Ken Nam <otherscase@gmail.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
- Moved `server-cpulist`, `bio-cpulist`, `aof-rewrite-cpulist`,
`bgsave-cpulist` configurations to ADVANCED CONFIG.
- Moved `ignore-warnings` configuration to ADVANCED CONFIG.
- Moved `availability-zone` configuration to GENERAL.
These configs were incorrectly placed at the end of the file in the
ACTIVE DEFRAGMENTATION section.
Fixes#2736
---------
Signed-off-by: ritoban23 <ankudutt101@gmail.com>
A super tiny change to optimize the function
`sentinelAskPrimaryStateToOtherSentinels` to early return when the
sentinel does not observe the primary as subjectively down.
Signed-off-by: Zhijun <dszhijun@gmail.com>
This PR introduces the support for implementing remote debuggers in
scripting engines modules.
The module API is extended with scripting engines callbacks and new
functions that can be used by scripting engine modules to implement a
remote debugger.
Most of the code that was used to implement the Lua debugger, was
refactored and moved to the `scripting_engine.c` file, and only the code
specific to the Lua engine, remained in the `debug_lua.c` file.
The `SCRIPT DEBUG (YES|NO|SYNC)` command was extend with an optional
parameter that can be used to specify the engine name, where we want to
enable the debugger. If no engine name is specified, the Lua engine is
used to keep backwards compatibility.
In
[src/valkeymodule.h](https://github.com/valkey-io/valkey/pull/1701/files#diff-b91520205c29a3a5a940786e509b2f13a5e73a1ac2016be773e62ea64c7efb28)
we see the module API changes. And in the `helloscripting.c` file we can
see how to implement a simple debugger for the dummy HELLO scripting
engine.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
This is fixing a minor annoyance when running single tests. With this
change, the runtest script doesn't start more runners than the total
number of tests to run. These are seen in the printouts like `[ready]:
68359`.
Screenshot before:
```
$ ./runtest --single tests/unit/limits.tcl
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 68359
Testing unit/limits
[ready]: 68355
[ready]: 68362
[ready]: 68356
[ready]: 68361
[ready]: 68358
[ready]: 68364
[ready]: 68366
[ready]: 68367
[ready]: 68368
[ready]: 68357
[ready]: 68360
[ready]: 68363
[ready]: 68365
[ready]: 68369
[ready]: 68370
[ok]: Check if maxclients works refusing connections (906 ms)
[1/1 done]: unit/limits (1 seconds)
The End
Execution time of different units:
1 seconds - unit/limits
\o/ All tests passed without errors!
Cleanup: may take some time... OK
```
Screenshot after:
```
$ ./runtest --single tests/unit/limits.tcl
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 68439
Testing unit/limits
[ok]: Check if maxclients works refusing connections (906 ms)
[1/1 done]: unit/limits (1 seconds)
The End
Execution time of different units:
1 seconds - unit/limits
\o/ All tests passed without errors!
Cleanup: may take some time... OK
```
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When benchmarking with the a target thoughput using the `--rps` flag,
display an RPS histogram in the benchmark report. This can help identify
if there is a bottleneck.
Related to #2219.
---------
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
This change makes it so that the module API reference can be cleanly
generated from the module.c file. Most of this seems to be because of
the code formatting work we did. There are two parts:
1. Updated some of the odd corner cases in module.c so they can be
handled. For example, there was a method that had none of the method on
the first line, which was unhandled. None of these are functional and I
think format should be OK with them.
2. Better handle multi-line function prototypes in the ruby code. Before
we just relied on the function to be on a single line, now we handle it
on multiple lines. This feels pretty hacked in, but I don't really
understand the rest of the code but it does work.
Generated this PR:
https://github.com/valkey-io/valkey-doc/pull/374/files.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This clusterLink->flags was added in #2310, during the review, we at
the end chose to add a new CLUSTER_LINK_XXX flag instead of sharing the
old CLUSTER_NODE_XXX flag.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Safe iterators pause rehashing, but don't pause auto shrinking. This
allows stale bucket references which then cause use after free (in this
case, via compactBucketChain on a deleted bucket).
This problem is easily reproducible via atomic slot migration, where we
call delKeysInSlot which relies on calling delete within a safe
iterator. After the fix, it no longer causes a crash.
Since all cases where rehashing is paused expect auto shrinking to also
be paused, I am making this happen automatically as part of pausing
reshashing.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This was introduced in #1826. This create an `Uninitialised value was
created by a heap allocation` in the CI.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When working on #2635 I errorneously duplicated the
setSlotImportingStateInAllDbs call for successful imports. This resulted
in us doubling the key count in the kvstore. This results in DBSIZE
reporting an incorrect sum, and also causes BIT corruption that can
eventually result in a crash.
The solution is:
1. Only call setSlotImportingStateInAllDbs once (in our
finishSlotMigrationJob function)
2. Make setSlotImportingStateInAllDbs idempotent by checking if the
delete from the kvstore importing hashtable is a no-op
This also fixes a bug where the number of importing keys is not lowered
after the migration, but this is less critical since it is only used
when resizing the dictionary on RDB load. However, it could result in
un-loadable RDBs if the importing key count gets large enough.
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
There will be two issues in this test:
```
test {FUNCTION - test function flush} {
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
set before_flush_memory [s used_memory_vm_functions]
r function flush sync
set after_flush_memory [s used_memory_vm_functions]
puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
set before_flush_memory [s used_memory_vm_functions]
r function flush async
set after_flush_memory [s used_memory_vm_functions]
puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"
for {set i 0} {$i < 10000} {incr i} {
r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
}
puts "Test done"
}
```
The first one is the test output, we can see that after executing
FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576
```
The second one is there is a crash when loading the functions during the
async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
# valkey 255.255.255 crashed by signal: 11, si_code: 2
# Accessing address: 0xe0429b7100000a3c
# Crashed running the instruction at: 0x102e0b09c
------ STACK TRACE ------
EIP:
0 valkey-server 0x0000000102e0b09c luaH_getstr + 52
Backtrace:
0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56
1 valkey-server 0x0000000102e01054 luaD_precall + 96
2 valkey-server 0x0000000102e01b10 luaD_call + 104
3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76
4 valkey-server 0x0000000102e01e3c luaD_pcall + 60
5 valkey-server 0x0000000102dfc630 lua_pcall + 300
6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708
7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9 valkey-server 0x0000000102f70898 functionLoadCommand + 312
10 valkey-server 0x0000000102e3978c call + 416
11 valkey-server 0x0000000102e3b5b8 processCommand + 3340
12 valkey-server 0x0000000102e563cc processInputBuffer + 520
13 valkey-server 0x0000000102e55808 readQueryFromClient + 92
14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180
15 valkey-server 0x0000000102e20480 aeProcessEvents + 372
16 valkey-server 0x0000000102e4aad0 main + 26412
17 dyld 0x000000018acab154 start + 2476
------ STACK TRACE DONE ------
```
The reason is that, in the old implementation (introduced in 7.0),
FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not
trigger the gc,
it causes us to not be able to effectively reclaim memory after the
FUNCTION FLUSH.
The other issue is that, since we don't re-create the lua VM in FUNCTION
FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash
because
lua engine state is not thread-safe.
The correct solution is to re-create a new Lua VM to use, just like
SCRIPT FLUSH.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Co-authored-by: Ricardo Dias <ricardo.dias@percona.com>
Resolves#2696
The primary issue was that with sanitizer mode, the test needed more
time for primary’s replication buffers grow beyond `2 × backlog_size`.
Increasing the threshold of `repl-timeout` to 30s, ensures that the
inactive replica is not disconnected while the full sync is proceeding.
`rdb-key-save-delay` controls or throttles the data written to the
client output buffer, and in this case, we are deterministically able to
perform the fullsync within 10s (10000 keys * 0.001s).
Increasing the `wait_for_condition` gives it enough retries to verify
that `mem_total_replication_buffers` reaches the required `2 ×
backlog_size`.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
#2329 intoduced a bug that causes a blocked client in cluster mode to
receive two MOVED redirects instead of one. This was not seen in tests,
except in the reply schema validator.
The fix makes sure the client's pending command is cleared after sending
the MOVED redirect, to prevent if from being reprocessed.
Fixes#2676.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:
1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free
Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.
To solve this, we simply:
1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL
I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.
Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This test was failing, and causing the next test to throw an exception.
It is failing since we never waited for the slot migration to connect
before proceeding.
Fixes#2692
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
DEBUG LOADAOF sometimes works but it results in -LOADING responses to
the primary so there are lots of race conditions. It isn't something we
should be doing anyways. To test, I just disconnect the replica before
loading the AOF, then reconnect it.
Fixes#2712
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Currently, zdiff() becomes a no-op in the case that the first key is
empty.
The existing comment of "Skip everything if the smallest input is empty"
is misleading, as qsort is bypassed for the case of ZDIFF. There's no
guarantee that the first key holds the smallest cardinality.
The real reason behind such bypass is the following. ZDIFF computes the
difference between the first and all successive input sorted sets.
Meaning, if the first key is empty, we cannot reduce further from an
already empty collection, and thus zdiff() becomes a no-op.
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
When using `skip` inside a test to skip a test, when running with
--dump-logs it causes the logs to be dumped. Introduced in #2342.
The reason is the "skipped" exception is caught in the start_server proc
in tests/support/server.tcl. This is where the $::dump_logs variable is
checked.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the hash object does not exist FXX should simply fail the check
without creating the object while FNX should be trivial and succeed.
Note - also fix a potential compilation warning on some COMPILERS doing
constant folding of variable length array when size is const expression.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0
* Add mechanism in TCL test to skip tests dynamically - #2711
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We have relaxed the `cluster-ping-interval` and `cluster-node-timeout`
so that cluster has enough time to stabilize and propagate changes.
Fixes this test occasional failure when running with valgrind:
[err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl
#10 didn't became slave of #5
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
With #1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.
Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
#0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
#1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
#2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
#3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
#4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
#5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
#6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
#7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```
Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
If we don't wait for the replica to resync, the migration may be
cancelled by the time the replica resyncs, resulting in a test failure
when we can't find the migration on the replica.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Makes our tests possible to run with TCL 9.
The latest Fedora now has TCL 9.0 and it's working now, including the
TCL TLS package. (This wasn't working earlier due to some packaging
errors for TCL packages in Fedora, which have been fixed now.)
This PR also removes the custom compilation of TCL 8 used in our Daily
jobs and uses the system default TCL version instead. The TCL version
depends on the OS. For the latest Fedora, you get 9.0, for macOS you get
8.5 and for most other OSes you get 8.6.
The checks for TCL 8.7 are removed, because 8.7 doesn't exist. It was
never released.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Seeing test failures due to this on the 9.0.0 branch:
```
[exception]: Executing test client: ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary].
ERR Syntax error. Use: LOLWUT [columns rows] [real imaginary]
```
It turns out we are just providing the version as an argument, instead
of specifying which version to run on
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
For now, introduce this and have it do nothing. Eventually, we can use
this to negotiate supported capabilities on either end. Right now, there
is nothing to send or support, so it just accepts it and doesn't reply.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
# Problem
In the current slot migration design, replicas are completely unaware of
the slot migration. Because of this, they do not know to hide importing
keys, which results in exposure of these keys to commands like KEYS,
SCAN, RANDOMKEY, and DBSIZE.
# Design
The main part of the design is that we will now listen for and process
the `SYNCSLOTS ESTABLISH` command on the replica. When a `SYNCSLOTS
ESTABLISH` command is received from the primary, we begin tracking a new
slot import in a special `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
Replicas use this state to track the import, and await for a future
`SYNCSLOTS FINISH` message that tells them the import is
successful/failed.
## Success Case
```
Source Target Target Replica
| | |
|------------ SYNCSLOTS ESTABLISH -------------->| |
| |----- SYNCSLOTS ESTABLISH ------>|
|<-------------------- +OK ----------------------| |
| | |
|~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| |
| |~~~~~~ forward snapshot ~~~~~~~~>|
|----------- SYNCSLOTS SNAPSHOT-EOF ------------>| |
| | |
|<----------- SYNCSLOTS REQUEST-PAUSE -----------| |
| | |
|~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| |
| |~~~~~~ forward changes ~~~~~~~~~>|
|--------------- SYNCSLOTS PAUSED -------------->| |
| | |
|<---------- SYNCSLOTS REQUEST-FAILOVER ---------| |
| | |
|---------- SYNCSLOTS FAILOVER-GRANTED --------->| |
| | |
| (performs takeover & |
| propagates topology) |
| | |
| |------- SYNCSLOTS FINISH ------->|
(finds out about topology | |
change & marks migration done) | |
| | |
```
## Failure Case
```
Source Target Target Replica
| | |
|------------ SYNCSLOTS ESTABLISH -------------->| |
| |----- SYNCSLOTS ESTABLISH ------>|
|<-------------------- +OK ----------------------| |
... ... ...
| | |
| <FAILURE> |
| | |
| (performs cleanup) |
| | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
| | |
| | ------ SYNCSLOTS FINISH ------->|
| | |
```
## Full Sync, Partial Sync, and RDB
In order to ensure replicas that resync during the import are still
aware of the import, the slot import is serialized to a new
`cluster-slot-imports` aux field. The encoding includes the job name,
the source node name, and the slot ranges being imported. Upon loading
an RDB with the `cluster-slot-imports` aux field, replicas will add a
new migration in the `SLOT_IMPORT_OCCURRING_ON_PRIMARY` state.
It's important to note that a previously saved RDB file can be used as
the basis for partial sync with a primary. Because of this, whenever we
load an RDB file with the `cluster-slot-imports` aux field, even from
disk, we will still add a new migration to track the import. If after
loading the RDB, the Valkey node is a primary, it will cancel the slot
migration. Having this tracking state loaded on primaries will ensure
that replicas partial syncing to a restarted primary still get their
`SYNCSLOTS FINISH` message in the replication stream.
## AOF
Since AOF cannot be used as the basis for a partial sync, we don't
necessarily need to persist the `SYNCSLOTS ESTABLISH` and `FINISH`
commands to the AOF.
However, considering there is work to change this (#59#1901) this
design doesn't make any assumptions about this.
We will propagate the `ESTABLISH` and `FINISH` commands to the AOF, and
ensure that they can be properly replayed on AOF load to get to the
right state. Similar to RDB, if there are any pending "ESTABLISH"
commands that don't have a "FINISH" afterwards upon becoming primary, we
will make sure to fail those in `verifyClusterConfigWithData`.
Additionally, there was a bug in the existing slot migration where slot
import clients were not having their commands persisted to AOF. This has
been fixed by ensuring we still propagate to AOF even for slot import
clients.
## Promotion & Demotion
Since the primary is solely responsible for cleaning up unowned slots,
primaries that are demoted will not clean up previously active slot
imports. The promoted replica will be responsible for both cleaning up
the slot (`verifyClusterConifgWithData`) and sending a `SYNCSLOTS
FINISH`.
# Other Options Considered
I also considered tracking "dirty" slots rather than using the slot
import state machine. In this setup, primaries and replicas would simply
mark each slot's hashtable in the kvstore as dirty when something is
written to it and we do not currently own that slot.
This approach is simpler, but has a problem in that modules loaded on
the replica would still not get slot migration start/end notifications.
If the modules on the replica do not get such notifications, they will
not be able to properly contain these dirty keys during slot migration
events.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
The CVE fixes had a formatting and external test issue that wasn't
caught because private branches don't run those CI steps.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
New client flags in reported by CLIENT INFO and CLIENT LIST:
* `i` for atomic slot migration importing client
* `E` for atomic slot migration exporting client
New flags in return value of `ValkeyModule_GetContextFlags`:
* `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client
attached to this context is the slot import client.
* `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client
attached to this context is the slot export client.
Users could use this to monitor the underlying client info of the slot
migration, and more clearly understand why they see extra clients during
the migration.
Modules can use these to detect keyspace notifications on import
clients. I am also adding export flags for symmetry, although there
should not be keyspace notifications. But they would potentially be
visible in command filters or in server events triggered by that client.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
**Issue History:**
1. The flaky test issue "defrag didn't stop" was originally detected in
February 2025: https://github.com/valkey-io/valkey/issues/1746
Solution for 1746: https://github.com/valkey-io/valkey/pull/1762
2. Similar issue occurred recently:
https://github.com/valkey-io/valkey/actions/runs/16585350083/job/46909359496#step:5:7640
**Investigation:**
1. First, the issue occurs specifically to Active Defrag stream in
cluster mode.
2. After investigating `test_stream` in `memefficiency.tcl`, I found the
root cause is in defrag logic rather than the test itself - There are
still failed tests with the same error even if I tried different
parameters for the test.
3. Then I looked at malloc-stats and identified potential defrag issues,
particularly in the 80B bin where utilization only reaches ~75% after
defrag instead of the expected near 100%, while other bins show proper
defrag behavior - 80B actually is the size of a new stream(confirmed in
`t_stream.c`) that we add during test.
4. For 80B, after adding 200000 streams and fragmenting, `curregs `=
100030, after a lot of defrag cycles, there are still 122
nonfull-slabs/511 slabs with the remaining 446 items not defragged
(average 4/nonfull-slab).
**Detailed malloc-stats:**
- Total slabs: 511
- Non-full slabs: 122
- Full slabs: 511-122=389
- Theoretical maximum per slab: 256 items
- Allocated items in non-full slabs: 100030-389*256=446
- Average items per non-full slab: 446/122=3.66
**Root Cause:**
**There are some immovable items which prevent complete defrag**
**Problems in old defrag logic:**
1. The previous condition (we don't defrag if slab utilization > 'avg
utilization' * 1.125), the 12.5% threshold doesn’t work well with low
utilizations.
- Let's imagine we have 446 items in 122 nonfull-slabs (avg 3.66
items/nonfull-slab), let's say, e.g. we have 81 slabs with 5 items each
+41 slabs with 1 item each)
- 12.5% threshold: 3.66*1.125=4.11
- If those 41 single items are immovable, they actually lower the
average, so the rest 81 slabs will be above the threshold (5>4.11) and
will not be defragged - defrag didn't stop.
2. Distribution of immovable items across slabs was causing inconsistent
defragmentation and flaky test outcome.
- If those 41 single items are movable, they will be moved and the avg
will be 5, then 12.5% threshold: 5*1.125=5.625, so the rest 81 slabs
will fall below the threshold (5<5.625) and will be defragged - defrag
success.
- This can explain why we got flaky defrag tests.
**Final solution :**
1. Add one more condition before the old logic in `makeDefragDecision
`to trigger defragmentation when slab is less than 1/8 full (1/8
threshold (12.5%) chosen to align with existing utilization threshold
factor) - Ensures no low-utilization slabs left without defragged, and
stabilize the defrag behavior.
2. The reason why we have immovable items and how to handle them is
going to be investigate later.
3. Be sure to rebuild Valkey before testing it.
**Local test result:**
- Before fix:
pass rate 80.8% (63/78)
- After fix:
Test only stream: pass rate 100% (200/200)
Test the whole memefficiency.tcl: pass rate 100% (100/100)
Resolves#2398 , the "defrag didn't stop" issue, with help from @JimB123
@madolson
---------
Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
Signed-off-by: asagegeLiu <liusalisa6363@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
As requested, here is a version of lolwut for 9 that visualizes a Julia
set with ASCII art.
Example:
```
127.0.0.1:6379> lolwut version 9
.............
......................
............................
......:::--:::::::::::::::.......
.....:::=+*@@@=--------=+===--::....
....:::-+@@@@&*+=====+%@@@@@@@@=-::....
.....:::-=+@@@@@%%*++*@@@@@@@@@&*=--::....
.....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::....
......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::...
......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::....
.......::-=+%@@@@@@@@@@@@@@@@#%%*+++++%@+-:.....
.......::-=@&@@@@@@@@@@@@@@@@&*++=====---::.....
.......:::--*@@@@@@@@@@@@@@@@@%++===----::::.....
........::::-=+*%&@@@@@@@@@&&&%*+==----:::::......
........::::--=+@@@@@@@@@@&##%*++==---:::::.......
.......:::::---=+#@@@@@@@@&&&#%*+==---:::::.......
........:::::---=++*%%#&&@@@@@@@@@+=---::::........
.......:::::----=++*%##&@@@@@@@@@@%+=--::::.......
......::::-----==++#@@@@@@@@@@@@@&%*+=-:::........
......:::---====++*@@@@@@@@@@@@@@@@@@+-:::.......
.....::-=++==+++**%@@@@@@@@@@@@@@@@#*=--::.......
....:-%@@%****%###&@@@@@@@@@@@@@@@@&+--:.......
....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::......
...::+@@@@@@@@@@@@@@@&&@@@@@@@@%**@+-::.....
....::-=+%#@@@@@@@@@&%%%&@@@@@@*==-:::.....
....::--+%@@@@@@@%++==++*#@@@@&=-:::....
....:::-*@**@@+==----==*%@@@@+-:::....
.....:::---::::::::--=+@=--::.....
.........::::::::::::::.......
.........................
..................
...
Ascii representation of Julia set with constant 0.41 + 0.29i
Don't forget to have fun! Valkey ver. 255.255.255
```
You can pass in arbitrary rows and colums (it's best when rows is 2x
number of columns) and an arbitrary julia constant so it is repeatable.
Worst case it takes about ~100us on my m2 macbook, which should be fine
to make sure it's not taking too many system resources.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Make objectComputeSize account for the key size as well when the key is
a module datatype
fixes: https://github.com/valkey-io/valkey/issues/2660
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
The problem is that ACKs run on a set loop (once every second) and this
will happen every loop with hz 1.
Instead, we can do the ACK after running the main logic. We can also do
an additional optimization where we don't send an ACK from source to
target if we already sent some data this cron loop, since the target
will reset the ack timer on any data over the connection.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In standalone mode, after a switch over, the command that was originally
blocking on primary returns -REDIRECT instead of -UNBLOCKED when the
client has the redirect capability.
Similarly, in cluster mode, after a switch over, the blocked commands
receive a -MOVED redirect instead of -UNBLOCKED.
After this fix, the handling of blocked connections during a switch over
between standalone and cluster is nearly identical. This can be
summarized as follows:
Standalone:
1. Client that has the redirect capability blocked on the key on the
primary node will receive a -REDIRECT after the switch over completes
instead of -UNBLOCKED.
2. Readonly clients blocked on the primary or replica node will remain
blocked throughout the switch over.
Cluster:
1. Client blocked on the key served by the primary node will receive a
-MOVED instead of a probabilistic -UNBLOCKED error.
2. Readonly clients blocked on the key served by primary or replica node
will remain blocked throughout the switch over.
---------
Signed-off-by: cjx-zar <56825069+cjx-zar@users.noreply.github.com>
Co-authored-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This is already handled by `clusterRedirectBlockedClientIfNeeded`. With
the work we are doing in #2329, it makes sense to have an explicit test here
to prevent regression.
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
There is a daily test failure in valgrind, which looks like an issue related to
slowness in valgrind mode.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Each insertion of a skiplist node requires generating a random level
(via the `zslRandomLevel` function), and some commands (such as
`zunionstore`) call the `zslRandomLevel` function multiple times.
Therefore, optimizing `zslRandomLevel` can significantly improve the
performance of these commands.
The main optimization approach is as follows:
1. Original logic: Each iteration called the `random` function, with a
0.25 probability of continuing to call `random` again. In the worst-case
scenario, it required up to 32 calls (though the probability of this
occurring is extremely low).
2. Optimized logic: We only need to call the `genrand64_int64` function
once. Each iteration uses only 2 bits of randomness, effectively
achieving the equivalent of 32 iterations in the original algorithm.
3. Additionally, the introduction of `__builtin_clzll` significantly
reduces CPU usage, which compiles into a single, highly efficient CPU
instruction (e.g., LZCNT on x86, CLZ on ARM) on supported hardware
platforms
4. Although I've explained a lot, the actual code changes are quite
minimal, so just look at the code directly.
---------
Signed-off-by: chzhoo <czawyx@163.com>
Adding comprehensive unit tests for SHA-256 implementation.
These tests verify:
1. Basic functionality with known test vectors (e.g., "abc")
2. Handling of large input data (4KB repeated 1000 times)
3. Edge case with repeated single-byte input (1 million 'a' characters)
The tests ensure compatibility with standard SHA-256 implementations and
will help detect regressions during future code changes.
We probably should close the correct `slot_migration_pipe_read`. It
should resolve the valgrind errors.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Following the decision in #2189, we need to fix this test because the
`extended-redis-compatibility` config option is not going to be
deprecated in 9.0.
This commit changes the test to postpone the deprecation of
`extended-redis-compatibility` until 10.0 release.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The reason is that the replication stream may not have yet reached
the replica for execution. We could add a wait_for_condition. We decided
to replace those assert calls with assert_replication_stream to verify
the contents of the replication stream rather than the commandstats.
```
*** [err]: Flush slot command propagated to replica in tests/unit/cluster/cluster-flush-slot.tcl
```
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
reduce the req and warmup time to finish in 6 hrs as the github workflow
times out after 6 hrs
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).
In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.
Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.
Interface changes:
- New slot_migration_fork_in_progress info field.
- New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
- Also add slot migration fork to the cluster class trace latency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Jacob Murphy <jkmurphy@google.com>
Set free method for deferred_reply list to properly clean up
ClientReplyValue objects when the list is destroyed
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
With #2604 merged, the `Node #10 should eventually replicate node #5`
started passing successfully with valgrind, but I guess we are seeing a
new daily failure from a `New Master down consecutively` test that runs
shortly after.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Resolves#2545
Followed the steps to reproduce the issue, and was able to get non-zero
`total_net_repl_output_bytes`.
```
(base) ~/workspace/valkey git:[fix-bug-2545]
src/valkey-cli INFO | grep total_net_repl_output_bytes
total_net_repl_output_bytes:1788
```
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
As discussed in https://github.com/valkey-io/valkey/issues/2579
Notably, I am exposing this feature as "Atomic Slot Migration" to
modules. If we want to call it something else, we should consider that
now (e.g.. "Replication-Based Slot Migration"?)
Also note: I am exposing both target and source node in the event. It is
not guaranteed that either target or source would be the node the event
fires on (e.g. replicas will fire the event after replica containment is
introduced). Even though it could be potentially inferred from CLUSTER
SLOTS, it should help modules parse it this way. Modules should be able
to infer whether it is occurring on primary/replica from `ctx` flags, so
not duplicating that here.
Closes#2579
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
When we introduced the new Hash fields expiration functionality,
we decided to combine the current active expiration job between generic
keys and hash fields.
During that job we run a tight loop. In each loop iteration we scan over
maximum of 20 keys (with default expire effort) and try to "expire"
them.
For hash fields expiration job, the "expire" of a hash key, means
expiring maximum of 80 fields (with default expire effort).
The problem is that we might do much more work per each iteration of
hash fields expiration job.
The current code is shared between the 2 jobs, and currently we only
perform time check every 16 iterations.
as a result the CPU of fields active expiration can spike and consume
much higher CPU% than the current 25% bound allows.
Example:
Before this PR
| Scenario | AVG CPU | Time to expire all fields |
|----------------------------------------------------|---------|---------------------------|
| Expiring 10M volatile fields from a single hash | 20.18% | 26 seconds
|
| Expiring 10M volatile fields from 10K hash objects | 32.72% | 7
seconds |
After this PR
Scenario | AVG CPU | Time to expire all fields
| Scenario | AVG CPU | Time to expire all fields |
|----------------------------------------------------|---------|---------------------------|
| Expiring 10M volatile fields from a single hash | 20.23%. | 26 seconds
|
| Expiring 10M volatile fields from 10K hash objects | 20.76%. | 11
seconds |
*NOTE*
The change introduced here make the field job check the time every
iteration. We offer compile time option to use efficient time check
using TSC (X86) or VCR (ARM) on most modern CPU, so the impact is
expected to be low. Still, in order to avoid degradation for existing
workloads, the code change was made so it will not impact the existing
generic keys active expiration job.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Interactive use of valkey-cli often involves working with long keys
(e.g. MY:INCREDIBLY:LONG:keythattakesalongtimetotype). In shells like
bash, zsh, or psql, users can quickly move the cursor word by word with
**Alt/Option+Left/Right**, **Ctrl+Left/Right** or **Alt/Option+b/f**.
This makes editing long commands much more efficient.
Until now, valkey-cli (via linenoise) only supported single-character
cursor moves, which is painful for frequent key editing.
This patch adds such support, with simple code changes in linenoise. It
now supports both the Meta (Alt/Option) style and CSI (control sequence
introducer) style:
| | Meta style | CSI style (Ctrl) | CSI style (Alt) |
| --------------- | ---------- | ---------------- | --------- |
| move word left | ESC b | ESC [1;5D | ESC [1;3D |
| move word right | ESC f | ESC [1;5C | ESC [1;3C |
Notice that I handle these two styles differently since people have
different preference on the definition of "what is a word".
Specifically, I define:
- "sub-word": just letters and digits. For example "my:namespace:key"
has 3 sub-words. This is handled by Meta style.
- "big-word": as any character that is not space. For example
"my:namespace:key" is just one single big-word. This is handled by CSI
style.
## How I verified
I'm using MacOS default terminal (`$TERM = xterm-256color`). I
customized the terminal keyboard setting to map option + left to `\033b`
, and ctrl + left to `\033[1;5D` so that I can produce both the Meta
style and CSI style. This code change should also work for
Linux/BSD/other terminal users.
Now the valkey-cli works like the following. `|` shows where the cursor
is currently at.
Press Alt + left (escape sequence `ESC b` ):
```
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
```
Press Ctrl + left (escape sequence `ESC [1;5D` ):
```
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
```
Press Alt + right (escape sequence `ESC f` ):
```
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
```
Press Ctrl + right (escape sequence `ESC [1;5C` ):
```
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
set cache:item itemid
|
```
---------
Signed-off-by: Zhijun <dszhijun@gmail.com>
New config options:
* cluster-announce-client-port
* cluster-announce-client-tls-port
If enabled, clients will always get to see the configured port for a
node instead of the internally announced port(s), the same way that
`cluster-announce-client-ipv4` and `cluster-announce-client-ipv6` work.
Cluster-internal communication uses the non-client variant of these
options.
The configuration is propagated throughout the cluster using new ping
extensions.
Closes#2377
---------
Signed-off-by: Marvin Rösch <marvinroesch99@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #1604, we attempt to read future Valkey RDB formats, but we rejected
foreign RDB formats, because of the risk that the opcodes and types
added by other projects collide with the new types and opcodes added in
recent Valkey versions.
This change accepts foreign RDB versions but limits the types and
opcodes to the ones that we can understand, to prevent misinterpretation
of types/opcodes which could lead to undefined behavior. If unsupported
RDB types or opcodes are seen, we error out.
Additional changes:
* Improve error reporting when encountering unknown RDB types in relaxed
version check mode.
* Tests for loading various RDB files.
* Improvement to valkey-check-rdb to accept future and foreign RDB
versions, including tests.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
As per #2459, this PR removes deprecation and `deprecated_since`
element and `DEPRECATED` doc flag from commands. Closes#2459.
---------
Signed-off-by: Kyle J. Davis <kyledvs@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
`bitops.c`: `serverPopcount()` used `popcountAVX2()`, which as the name
implies requires AVX2 support, on AVX-only machines, causing an "illegal
instruction" error.
Added a `__builtin_cpu_supports("avx2")` check and falling back to the
platform agnostic version if AVX2 is not supported.
Fixes#2570
Signed-off-by: Ted Lyngmo <ted@lyncon.se>
In Valkey 9.0 we added HFE support which is currently using a per slot
hashtable in order to track keys (hash objects) which contains volatile
fields. in order to optimize the RDB load we should use the same method
for expires and generic keys kvstores.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Delete the out-of-date comment explaining why changed latency from 5 to
40 in #2421 , which is a leftover of #2553
Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
This flag.deny_blocking check causes some code to becomde dead code.
Some of the code below checks islua or ismulti, but they are marked
with flag.deny_blocking and will return early.
In addition, cleanup some documents, some of them are inaccurate,
and restore the code of blocking_auth_cb in tests/modules/auth.c,
this reply should be returned from core. The reply used to from the
core and was changed to from the module in #1819, and now it is from
the core again.
Cleanup some dead code around #1819.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In this commit we introduce a new module API event called
`ValkeyModuleEvent_AuthenticationAttempt` to track successful/failed
authentication attempts.
This event will fill a struct, called `ValkeyModuleAuthenticationInfo`,
with the client ID of the connection, the username, the module name that
handle the authentication event, and the result of the authentication
attempt.
Fixes: #2211
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
This commit refactors the scripting engine to support multiple cached
module contexts per engine, rather than relying on a single cached
`ValkeyModuleCtx` object.
Previously, having only one cached context object caused data races over
the state stored in the context object, because it's possible that a
script that is running for a long time to yield and the server event
loop may call the `scriptingEngineCallGetMemoryInfo` function to get the
scripting engine memory information, which re-uses the same cached
context object. Another possible data-race is caused by the asynchronous
scripts flush, which calls the `scriptingEngineCallFreeFunction`
function in an background thread, and also re-uses the cached context
object.
To address this, a cache array of module contexts was introduced in the
scripting engine structure, with each slot dedicated to a specific use
case—such as script execution, memory info queries, or function freeing.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Changed the defrag hit threshold from 512 to 1 in the `defragLaterStep`
& `defragStageKvstoreHelper` function to reduce defrag latency. (idea
from Jim)
1. Previously, the defrag loop would continue processing up to 512
successful defragmentations before checking if the time limit was
exceeded. Now it checks after every single successful allocation move.
2. The trade-off is slightly more frequent time checks, but the time
check (~19ns) is negligible compared to the actual defragmentation work
(even a single 8-byte reallocation takes ~43ns and the
allocatorShouldDefrag function call takes ~49ns per block). This
overhead is minimal compared to the latency improvement gained from
better time management during active defragmentation.
3. Also revert the change from
https://github.com/valkey-io/valkey/pull/2421/files to test.
4. Solved compilation issue with unsigned by changing the type of the
local variable `prev_defragged `to match the type of
`server.stat_active_defrag_hits`
Closes#2444
---------
Signed-off-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
Co-authored-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
In script, client will be replaced with its caller, so commandlog needs
to use the metrics of the client that currently executing the command.
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Previously CONFIG RESETSTATS only resets the slot statistics in
cluster part, this PR makes it reset cluster bus messages at well.
Additionally, we also reset stat_cluster_links_buffer_limit_exceeded.
Now we will reset:
- cluster_stats_messages_sent*
- cluster_stats_messages_received*
- total_cluster_links_buffer_limit_exceeded
Closes#2439.
Signed-off-by: Hongrui <2086160503@qq.com>
The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node,
we will send the AUTH command and then reads the response. If the target node is blocked
during this process, the source node will also be blocked. We should use a read handler
to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and
SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If all cluster nodes have functions, slot migration will fail
since the target will return the function already exists error
when doing the FUNCTION LOAD.
And in addition, the target's replica could panic when it executes
the FUNCTION LOAD propagated from the primary (see
propagation-error-behavior).
Introduced in #1949.
Signed-off-by: Binbin <binloveplay1314@qq.com>
It prints an extra ": " after the message, which is a bit weird, i
thought it was printing cluster-bus port information, but it was not.
```
Moving slot 5458 from 127.0.0.1:30001 to 127.0.0.1:30003:
Moving slot 5459 from 127.0.0.1:30001 to 127.0.0.1:30003:
Moving slot 5460 from 127.0.0.1:30001 to 127.0.0.1:30003:
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
Try to fix the failures seen for `test "PSYNC2 #3899 regression: verify
consistency"`.
This change resets the query buffer parser state in
`replicationCachePrimary()` which is called when the connection to the
primary is lost. Before #2092, this was done by `resetClient()`.
The solution was inspired by the discussion about the regression
mentioned (discussion from 2017) and the related commits from that time:
6bc6bd4c38,
469d6e2b37,
c180bc7d98.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
tiny change. It failed once for me (a little time passed and it returned
2 seconds instead of 3), so I figured it's probably a little flaky for
others too
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Both LMOVE and BLMOVE can return null values because the source key is
empty.
PR changes include
- change the LMOVE reply_schema to include the possibility of a nil
return value
- Add comment to BLMOVE reply_schema to indicate it can return nil
because the source does not exist
This fixes#2532
---------
Signed-off-by: Adam Fowler <adamfowler71@gmail.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
This gets rid of the need to use a void* as a carrier for the worker
number. Instead a pointer to the relevant worker data is passed to the
started thread.
Fixes#2529
---------
Signed-off-by: Ted Lyngmo <ted@lyncon.se>
In C99, we had to use `#define static_assert(expr, lit) extern char
__static_assert_failure[(expr) ? 1 : -1]` for static assertions.
However, we now have native support for static_assert with
_Static_assert. Previously one of the correct #defines was getting
called first, setting it to _Static_assert, however after
65215e5378
the first import defining the symbol was "serverassert.h", which
included the old style.
This change removes all unnecessary imports and always defines
static_assert as _Static_assert.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This may result in meaningless slot migration job, we should
return an error to user in advance to avoid operation error.
Also `by myself` is not correct English grammar and `myself`
is a internal code terminology, changed to `by this node`.
Was introduced in #1949.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Instead of parsing only one command per client before executing it,
parse multiple commands from the query buffer and batch-prefetch the
keys accessed by the commands in the queue before executing them.
This is an optimization for pipelined commands, both with and without
I/O threads. The optimization is currently disabled for the replication
stream, due to failures (probably caused by how the replication offset
is calculated based on the query buffer offset).
* When parsing commands from the input buffer, multiple commands are
parsed and stored in a command queue per client.
* In single-threaded mode (I/O threads off) keys are batch-prefetched
before the commands in the queue are executed. Multi-key commands like
MGET, MSET and DEL benefit from this even if pipelining is not used.
* Prefetching when I/O threads are used does prefetching for multiple
clients in parallel. This code takes client command queues into account,
improving prefetching when pipelining is used. The batch size is
controlled by the existing config `prefetch-batch-max-size` (default
16), which so far only was used together with I/O threads. The config is
moved to a different section in `valkey.conf`.
* When I/O threads are used and the maximum number of keys are
prefetched, a client's command is executed, then the next one in the
queue, etc. If there are more commands in the queue for which the keys
have not been prefetched (say the client sends 16 pipelined MGET with 16
keys in each) keys for the next few commands in the queue are prefetched
before the commands is executed if prefetching has not been done for the
next command. (This utilizes the code path used in single-threaded
mode.)
Code improvements:
* Decoupling of command parser state and command execution state:
* The variables reqtype, multibulklen and bulklen refer to the current
position in the query buffer. These are no longer reset in resetClient
(which runs after each command being executed). Instead, they are
reset in the parser code after each completely parsed command.
* The command parser code is partially decoupled from the client struct.
The query buffer is still one per client, but the resulting argument
vector is stored in caller-defined variables.
Fixes#2044
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
I believe this was a debug log at the time, and its printing
was quite annoying locally. The test is quite simple so i think
we can just remove it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
# Adds On-demand Benchmark Workflow
This PR introduces a new GitHub Actions workflow that enables on-demand
performance benchmarking for pull requests through label-based triggers.
This uses the new framework
[valkey-perf-benchmark](https://github.com/valkey-io/valkey-perf-benchmark)
developed for standard benchmarks and PR benchmarking.
## What is being added by this PR
- **Workflow File**: `.github/workflows/benchmark_on_label.yml`
- **Trigger**: Activated when specific labels are added to PRs
- **Supported Labels**:
- `run-benchmark` - Runs standard benchmarks
## Features
### On-Demand Execution
- Benchmarks run only when explicitly requested via PR labels
- No automatic execution to avoid unnecessary resource usage
### Performance Comparison
- Compares PR performance against the `unstable` baseline
- Generates detailed comparison reports
- Posts results directly as PR comments for easy review
### Flexible Configuration
- Currently uses github runners by will use dedicated performance
runners (`ec2-perf-ubuntu-24`)
- Configurable benchmark suites via JSON config files
### Artifact Management
- Uploads benchmark results as workflow artifacts
- Preserves both baseline and PR metrics for analysis
- Includes comparison markdown for offline review
### Automatic Cleanup
- Removes trigger labels after execution (success or failure)
- Prevents accidental re-runs from stale labels
## Usage
To run benchmarks on a PR:
1. Add the `run-benchmark` label for standalone and cluster, non-tls
mode
2. Wait for the workflow to complete
3. Review results in the automated PR comment
This workflow enhances our CI/CD pipeline by providing easy access to
performance testing without impacting regular development workflows.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <117414976+roshkhatri@users.noreply.github.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
* Add a flag `VALKEYMODULE_CLIENTINFO_FLAG_READONLY` to
`ValkeyModuleClientInfo.flags` and set it in
`ValkeyModule_GetClientInfoById()`.
* Add an optimization for accessing the current client by ID, to avoid
looking it up in a radix tree.
Closes#2487
---------
Signed-off-by: Allen Samuels <allenss@amazon.com>
## Problem
Test `Cluster module receive message API -
VM_RegisterClusterMessageReceiver` fails sporadically in CI with
"Expected '2' to be equal to '1'". The test failed in Daily CI 4 days
ago but hasn't failed since, indicating flaky behavior that I cannot
reproduce locally.
## Hypothesis
The failing line `assert_equal 2 [count_log_message 0 "* <cluster> DONG
(type 2) RECEIVED*"]` counts DONG message entries in node 0's log file
and expects exactly 2. The failure suggests a possible race condition
where there's a timing gap between when cluster statistics are updated
and when the corresponding log entries become visible in the log file.
## Solution
Add `wait_for_condition` to ensure log messages are written before
checking count:
```tcl
wait_for_condition 50 100 {
[count_log_message 0 "* <cluster> DONG (type 2) RECEIVED*"] eq 2
} else {
fail "node 1 didn't log 2 DONG messages"
}
---------
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
In the current implementation, the second list was never rewound again
once it was iterated. So for the first element of `ranges1`, `ranges2`
was iterated fully. But when the second element of `ranges1` was
processed, the `ranges2` was not rewound again.
With this change, for every element of `ranges1`, we start from the
beginning for `ranges2`
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Command: `./runtest --single unit/bitops --loops 3`
Unstable
```
[ignore]: large memory flag not provided
[-1/1 done]: unit/bitops (4 seconds)
[ignore]: large memory flag not provided
[0/1 done]: unit/bitops (4 seconds)
[ignore]: large memory flag not provided
[1/1 done]: unit/bitops (4 seconds)
The End
Execution time of different units:
4 seconds - unit/bitops
4 seconds - unit/bitops
4 seconds - unit/bitops
```
After fix
```
[1/3 done]: unit/bitops (4 seconds)
[ignore]: large memory flag not provided
[2/3 done]: unit/bitops (4 seconds)
[ignore]: large memory flag not provided
[3/3 done]: unit/bitops (4 seconds)
The End
Execution time of different units:
4 seconds - unit/bitops
4 seconds - unit/bitops
4 seconds - unit/bitops
```
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
currently hsetex is verifying the number of fields matches the provided
number of fields by using div. instead it can match to the
multiplication in order to prevent rounding the verified value down.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes: #2460
With this change we avoid divergence in cluster and replication layer
view. I've observed node can be marked as primary in cluster while it
can be marked as replica in replication layer view and have a active
replication link. Without this change, we used to end up in a invalid
replica chain in replication layer.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
This change avoids additional failure report creation if the node is
already marked as failed. The failure report count has never been used
after a node has been marked as failed. So, there is no value addition
in maintaining it further. This reduces operation of both add and delete
failure report. Hence, the performance benefit.
We can observe an avg. of 10% reduction in p99 CPU utilization (in a 2000
nodes cluster (1000 primary/ 1000 replica) with 330 nodes in failed
state with this change.
---------
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Fixes#2372
## Description
This PR adds an automated workflow that assigns PR authors to their own
pull requests when opened or reopened.
## Changes
- Added `.github/workflows/auto-author-assign.yml` workflow
- Uses `toshimaru/auto-author-assign@v2.1.1` action
- Triggers on `pull_request_target` events (opened, reopened)
- Requires `pull-requests: write` permission
## Benefits
- Improves PR tracking and organization
- Automatically assigns responsibility to PR authors
- Reduces manual assignment overhead for maintainers
- Helps contributors track their own PRs more easily
## Testing
✅ **Tested on my fork before submission:**
1. Merged the workflow into my fork's unstable branch
2. Created a test PR within my fork
3. Verified that I was automatically assigned as the assignee
4. Screenshot showing successful auto-assignment:
<img width="1278" height="684" alt="Screenshot 2025-08-01 at 3 39 05 PM"
src="https://github.com/user-attachments/assets/9ad643be-5eac-4ad6-bec7-184cf22e9cbd"
/>
The workflow executed successfully and assigned me to the test PR as
expected.
---------
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
If we want to expand kvstoreHashtableExpand, we need to make sure the
hashtable exists. Currently, when processing RDB slot-info, our expand
has no effect because the hashtable does not exist (we initialize it only
when we need it).
We also update kvstoreExpand to use the kvstoreHashtableExpand to make
sure there is only one code path. Also see #1199 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
several fixes:
1. fix not using bool input type for hashTypeIgnoreTTL - probably lost
during the 3 HFE PR merges
2. remove vset change hashtable encoding to single - The current code is
a bug. The entry is probably expired (or about to be expired soon) so we
can leave it as a hashtable till it does.
3. enable incremental rehashing for volatile item keys kvstore - This is
the center issue of this PR. without it the activeexpiry might not scan
the kvstore which is very fragmented with lots of empty buckets.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.
| Node timeout | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500 | 250 |
| 3000 | 100 |
| 1500 | 50 |
Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.
The purposes of the failover delay are
1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.
A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.
A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.
These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.
The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.
I hope this PR can be considered safer than #2227, although the two
changes are orthogonal.
Part of issue #2023.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We now pass in rdbSnapshotOptions options in this function, and
options.conns
is now malloc'ed in the caller side, so we need to zfree it when
returning early
due to an error. Previously, conns was malloc'ed after the error
handling, so we
don't have this.
Introduced in #1949.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Previously, the config names and values were stored in a temporary dict.
This ensures that no duplicates are returned, but it also makes the
order random.
In this PR, the config names and values still stored in the temporary
dict, but then they are copied to an array, which is sorted, before the
reply is sent.
Resolves#2042
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
We recently introduced a new template to create `test failures` issues
from a template. This change makes this template visible in the
`CONTRIBUTING.md` file. Also, added a tip to paste the stack trace since
outputs of CI links can expire.
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Fix https://github.com/valkey-io/valkey/issues/2438
Modified `DingReceiver` function in `tests/modules/cluster.c` by adding
null-termination logic for cross-version compatibility
---------
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
aefed3d363/src/networking.c (L2279-L2293)
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)
What's worse is that it's still the case after the reconnection. See
aefed3d363/src/replication.c (L756-L769).
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.
Resolves#2119
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
* Use pipelines of length 1000 instead of up to 200000.
* Use CLIENT REPLY OFF instead of reading and discarding the replies.
Fixes#2205
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Similar to dicts, we disallow resizing while the hashtable is
rehashing. In the previous code, if a resize was triggered during
rehashing, like if the rehashing wasn't fast enough, we would do
a while loop until the rehashing was complete, which could be a
potential issue when doing resize.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
The change will ensure that the slot is present on the node before the
slot is populated. This will avoid the errors during populating the
slot.
Resolves#2480
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Previously, each slot migration was logged individually, which could
lead to log spam in scenarios where many slots are migrated at once.
This commit enhances the logging mechanism to group consecutive slot
migrations into a single log entry, improving log readability and
reducing noise.
Log snippets
```
1661951:S 13 Aug 2025 15:47:10.132 * Slot range [16383, 16383] is migrated from node c3926da75f7c3a0a1bcd07e088b0bde09d48024c () in shard 7746b693330c0814178b90b757e2711ebb8c6609 to node 2465c29c8afb9231525e281e5825684d0bb79f7b () in shard 39342c039d2a6c7ef0ff96314b230dfd7737d646.
1661951:S 13 Aug 2025 15:47:10.289 * Slot range [10924, 16383] is migrated from node 2465c29c8afb9231525e281e5825684d0bb79f7b () in shard 39342c039d2a6c7ef0ff96314b230dfd7737d646 to node c3926da75f7c3a0a1bcd07e088b0bde09d48024c () in shard 7746b693330c0814178b90b757e2711ebb8c6609.
1661951:S 13 Aug 2025 15:47:10.524 * Slot range [10924, 16383] is migrated from node c3926da75f7c3a0a1bcd07e088b0bde09d48024c () in shard 7746b693330c0814178b90b757e2711ebb8c6609 to node 2465c29c8afb9231525e281e5825684d0bb79f7b () in shard 39342c039d2a6c7ef0ff96314b230dfd7737d646.
```
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to #2441.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently HSETEX always generate `hset` notification. In order to align
with generic `set` command, it should only generate `hset` if the
provided time-to-live is a valid future time.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
(1) The old logic may result in the RDMA event being acknowledged
unexpectly in the following two scenarios.
* ibv_get_cq_event get an EAGAIN error.
* ibv_get_cq_event get one event but it may ack multiple times in the
pollcq loop.
(2) In the benchmark result of valkey over RDMA, the tail latency is as
high as 177 milliseconds(almost 80x of TCP). This results from incorrect
benchmark client setup which includes the connection setup time into the
benchmark latency recording. This patch fixes this crazy tail latency
issue by modifying the valkey-benchmark.c. This change only affects
benchmark over RDMA as updates are regulated under Macro USE_RDMA.
There are following updates on valkey RDMA but I am willing to create
separated pull requests.
---------
Signed-off-by: Ruihong Wang <ruihong@google.com>
2025-08-13 09:28:44 +03:00
719 changed files with 15075 additions and 5152 deletions
@ -16,6 +16,7 @@ The Valkey project is led by a Technical Steering Committee, whose responsibilit
* Found a bug? [Report it here](https://github.com/valkey-io/valkey/issues/new?template=bug_report.md&title=%5BBUG%5D)
* Valkey crashed? [Submit a crash report here](https://github.com/valkey-io/valkey/issues/new?template=crash_report.md&title=%5BCRASH%5D+%3Cshort+description%3E)
* Suggest a new feature? [Post your detailed feature request here](https://github.com/valkey-io/valkey/issues/new?template=feature_request.md&title=%5BNEW%5D)
* Report a test failure? [Report it here](https://github.com/valkey-io/valkey/issues/new?template=test-failure.md)
* Want to help with documentation? [Move on to valkey-doc](https://github.com/valkey-io/valkey-doc)
* Report a vulnerability? See [SECURITY.md](SECURITY.md)
@ -4,7 +4,7 @@ The Valkey project is managed by a Technical Steering Committee (TSC) composed o
The Valkey project includes all of the current and future repositories under the Valkey-io organization.
Committers are defined as individuals with write access to the code within a repository.
Maintainers are defined as individuals with full access to a repository and own its governance.
Both maintainers and committers should be clearly listed in the MAINTAINERS.md file in a given projects repository.
Both maintainers and committers shall be clearly listed in the MAINTAINERS.md file in a given project's repository.
Maintainers of other repositories within the Valkey project are not members of the TSC unless explicitly added.
## Technical Steering Committee
@ -12,10 +12,15 @@ Maintainers of other repositories within the Valkey project are not members of t
The TSC is responsible for oversight of all technical, project, approval, and policy matters for Valkey.
The TSC members are listed in the [MAINTAINERS.md](MAINTAINERS.md) file in the Valkey repository.
Maintainers (and accordingly, TSC members) may be added or removed by no less than 2/3 affirmative vote of the current TSC.
At any time, no more than one third (1/3) of the TSC members may be employees, contractors, or representatives of the same organization or affiliated organizations.
For the purposes of this document, “organization” includes companies, corporations, universities, research institutes, non-profits, governmental institutions, and any of their subsidiaries or affiliates.
If, at any time, the 1/3 organization limit is exceeded (for example, due to changes in employment, company acquisitions, or organizational affiliations), the TSC shall be notified as soon as possible.
The TSC must promptly take action to restore compliance, which may include removing or reassigning members in accordance with the procedures outlined in the [Termination of Membership](#termination-of-membership) section.
The TSC shall strive to resolve the situation within 30 days of notification, and document the steps taken to restore compliance.
The TSC shall appoint a Chair responsible for organizing TSC meetings.
If the TSC Chair is removed from the TSC (or the Chair steps down from that role), it is the responsibility of the TSC to appoint a new Chair.
The TSC can amend this governance document by no less than a 2/3 affirmative vote.
The TSC may, at its discretion, add or remove members who are not maintainers of the main Valkey repository.
The TSC may, at its discretion, add or remove maintainers from other repositories within the Valkey project.
@ -29,16 +34,33 @@ The TSC shall document evidence of consensus in accordance with these requiremen
If consensus cannot be reached, the TSC shall make the decision by a vote.
A vote shall also be called when an issue or pull request is marked as a major decision, which are decisions that have a significant impact on the Valkey architecture or design.
Examples of major decisions:
* Fundamental changes to the Valkey core datastructures
* Adding a new data structure or API
* Changes that affect backward compatibility
* New user visible fields that need to be maintained
* Modifications to the TSC or other governance documents
* Adding members to other roles within the Valkey project
* Delegation of maintainership for projects to other groups or individuals
* Adding or removing a new external library such as a client
or module to the project.
### Technical Major Decisions
Technical major decisions include:
* Fundamental changes to the Valkey core datastructures
* Adding a new data structure or API
* Changes that affect backward compatibility
* New user visible fields that need to be maintained
* Adding or removing a new external library such as a client or module to the project when it affects runtime behavior
Technical major decisions shall be approved by a simple majority vote whenever one can be obtained.
If a simple majority cannot be reached within a two-week voting period, and no TSC member has voted against, the decision may instead be approved through explicit “+2” support from at least two TSC members, recorded on the relevant issue or pull request.
If the pull request author or issue proposer is a TSC member, their +1 counts toward the +2.
If any TSC member casts a negative vote, the decision must follow the simple majority voting process and cannot be approved through +2.
Once a technical major decision has been approved through the +2 mechanism, any subsequent concerns shall be raised through a new major decision process; +2 approvals are not retracted directly.
### Governance Major Decisions
Governance major decisions include:
* Adding TSC members or involuntary removal of TSC members
* Modifying this governance document
* Delegation of maintainership for projects or governance authority
* Creating, modifying, or removing roles within the Valkey project
* Any change that alters voting rules, TSC responsibilities, or project oversight
* Structural changes to the TSC, including composition limits
Governance major decisions shall require approval by a super-majority vote of at least two thirds (2/3) of the entire TSC.
Any member of the TSC can call a vote with reasonable notice to the TSC, setting out a discussion period and a separate voting period.
Any discussion may be conducted in person or electronically by text, voice, or video.
@ -46,23 +68,24 @@ The discussion shall be open to the public, with the notable exception of discus
In any vote, each voting TSC member will have one vote.
The TSC shall give at least two weeks for all members to submit their vote.
Except as specifically noted elsewhere in this document, decisions by vote require a simple majority vote of all voting members.
It is the responsibility of the TSC chair to help facilitate the voting process as needed to make sure it completes within the voting period.
If a vote results in a tie, the status quo is preserved.
It is the responsibility of the TSC Chair to help facilitate the voting process as needed to make sure it completes within the voting period.
## Termination of Membership
A maintainer's access (and accordingly, their position on the TSC) will be removed if any of the following occur:
* Involuntary Removal: Removal via the [Governance Major Decision](#governance-major-decisions) voting process.
* Resignation: Written notice of resignation to the TSC.
* TSC Vote: 2/3 affirmative vote of the TSC to remove a member
* Unreachable Member: If a member is unresponsive for more than six months, the remaining active members of the TSC may vote to remove the unreachable member by a simple majority.
## Technical direction for other Valkey projects
The TSC may delegate decision making for other projects within the Valkey organization to the maintainers responsible for those projects.
Delegation of decision making for a project is considered a major decision, and shall happen with an explicit vote.
Delegation of decision making for a project is considered a [Governance Major Decision](#governance-major-decisions).
Projects within the Valkey organization must indicate the individuals with commit permissions by updating the MAINTAINERS.md within their repositories.
The TSC may, at its discretion, overrule the decisions made by other projects within the Valkey organization, although they should show restraint in doing so.
The TSC may, at its discretion, overrule the decisions made by other projects within the Valkey organization, although they shall show restraint in doing so.