Makes our tests possible to run with TCL 9.
The latest Fedora now has TCL 9.0 and it's working now, including the
TCL TLS package. (This wasn't working earlier due to some packaging
errors for TCL packages in Fedora, which have been fixed now.)
This PR also removes the custom compilation of TCL 8 used in our Daily
jobs and uses the system default TCL version instead. The TCL version
depends on the OS. For the latest Fedora, you get 9.0, for macOS you get
8.5 and for most other OSes you get 8.6.
The checks for TCL 8.7 are removed, because 8.7 doesn't exist. It was
never released.
Backport of #1673.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The CVE fixes had a formatting and external test issue that wasn't
caught because private branches don't run those CI steps.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
daea05b1e2/src/networking.c (L886-L886)
Fix the issue need to ensure that the subtraction `prev->size -
prev->used` does not underflow. This can be achieved by explicitly
checking that `prev->used` is less than `prev->size` before performing
the subtraction. This approach avoids relying on unsigned arithmetic and
ensures the logic is clear and robust.
The specific changes are:
1. Replace the condition `prev->size - prev->used > 0` with `prev->used
< prev->size`.
2. This change ensures that the logic checks whether there is remaining
space in the buffer without risking underflow.
**References**
[INT02-C. Understand integer conversion
rules](https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules)
[CWE-191](https://cwe.mitre.org/data/definitions/191.html)
---
Signed-off-by: Zeroday BYTE <github@zerodaysec.org>
When reading RDB files with information about the number of keys per cluster slot,
we need to create the dicts if they don't exist.
Currently, when processing RDB slot-info, our expand has no effect because the dict
does not exist (we initialize it only when we need it).
We also update kvstoreExpand to use the kvstoreDictExpand to make
sure there is only one code path. Also see #1199 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Fixes#2171
Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This should be + instread of *, otherwise it does not make any sense.
Otherwise we would have to calculate 20 more bytes for each prefix rax
node in 64 bits build.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Based on @enjoy-binbin's suggestion on #1611, I made the change to find
the available port. The test has been passing in the daily tests in my
local repo.
Resolved#1611
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
After #1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.
```
WARNING! The following commands were not hit at all:
echo
ERROR! at least one command was not hit by the tests
```
This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since we paused the primary node earlier, the replica may enter
cluster down due to primary node pfail. Here set allow read to
prevent subsequent read errors.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Asan now supports making sure you are passing in the correct pointer
type, which seems useful but we can't support it since we pass in an
incorrect pointer in several places. This is most commonly done with
generic free functions, where we simply cast it to the correct type.
It's not a lot of code to clean up, so it seems appropriate to cleanup
instead of disabling the check.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:
```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```
These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.
The test cases were introduced in #1432.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.
But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.
This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If we become an empty primary for some reason, we still need to
check if we need to delete dirty slots, because we may have dirty
slots data left over from a bad migration. Like the target node forcibly
executes CLUSTER SETSLOT NODE to take over the slot without
performing key migration.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.
Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.
In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In CLUSTER FAILOVER FORCE case, we will set mf_can_start to
1 and wait for a cron to trigger the election. We can also set a
CLUSTER_TODO_HANDLE_MANUALFAILOVER flag so that we
can start the election as soon as possible instead of waiting for
the cron, so that we won't have a 100ms delay (clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The old reqEpoch mainly refers to requestCurrentEpoch, see:
```
if (requestCurrentEpoch < server.cluster->currentEpoch) {
serverLog(LL_WARNING, "Failover auth denied to %.40s (%s): reqEpoch (%llu) < curEpoch(%llu)", node->name,
node->human_nodename, (unsigned long long)requestCurrentEpoch,
(unsigned long long)server.cluster->currentEpoch);
return;
}
```
And in here we refer to requestConfigEpoch, it's a bit misleading,
so change it to reqConfigEpoch to make it clear.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
A minor debugging change that helped in the investigation of
https://github.com/valkey-io/valkey/issues/1251. Basically there are
some edge cases where we want to fully isolate a note from receiving
packets, but can't suspend the process because we need it to continue
sending outbound traffic. So, added a filter for that.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The new test was added in #2178, obviously there may be
pending reads in the connection, so there may be a race
in the DROP-CLUSTER-PACKET-FILTER part causing the test
to fail. Add CLOSE-CLUSTER-LINK-ON-PACKET-DROP to ensure
that the replica does not process the packet.
Signed-off-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit 2019337e74)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When modifying port or tls-port through config set, we need to call
clusterUpdateMyselfAnnouncedPorts to update myself's port, otherwise
CLUSTER SLOTS/NODES will be old information from myself's perspective.
In addition, in some places, such as clusterUpdateMyselfAnnouncedPorts
and clusterUpdateMyselfIp, beforeSleep save is added so that the
new ip info can be updated to nodes.conf.
Remove clearCachedClusterSlotsResponse in updateClusterAnnouncedPort
since now we add beforeSleep save in clusterUpdateMyselfAnnouncedPorts,
and it will call clearCachedClusterSlotsResponse.
Signed-off-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit 5cc2b25753)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the primary changes the config epoch and then down immediately,
the replica may not update the config epoch in time. Although we will
broadcast the change in cluster (see #1813), there may be a race in
the network or in the code. In this case, the replica will never finish
the failover since other primaries will refuse to vote because the
replica's slot config epoch is old.
We need a way to allow the replica can finish the failover in this case.
When the primary refuses to vote because the replica's config epoch is
less than the dead primary's config epoch, it can send an UPDATE packet
to the replica to inform the replica about the dead primary. The UPDATE
message contains information about the dead primary's config epoch and
owned slots. The failover will time out, but later the replica can try
again with the updated config epoch and it can succeed.
Fixes#2169.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
(cherry picked from commit 476671be19)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Resolves https://github.com/valkey-io/valkey/issues/2145
Incorporate the CVE patch that was sent to us by Redis Ltd.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
(cherry picked from commit 73696bf6e2)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Correctly use a 32 bit integer for accumulating the length of ping
extensions.
The current code may accidentally truncate the length of an
extension that is greater than 64kb and fail the validation check. We
don't currently emit any extensions that are this length, but if we were
to do so in the future we might have issues with older nodes (without
this fix) will silently drop packets from newer nodes. We should
backport this to all versions.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
(cherry picked from commit 30d7f08a4e)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When creating an outgoing TLS connection, we don't check if `SSL_new()`
returned NULL.
Without this patch, the check was done only for incoming connections in
`connCreateAcceptedTLS()`. This patch moves the check to
`createTLSConnection()` which is used both for incoming and outgoing
connections.
This check makes sure we fail the connection before going any further,
e.g. when `connCreate()` is followed by `connConnect()`, the latter
returns `C_ERR` which is commonly detected where outgoing connections
are established, such where a replica connects to a primary.
```c
int connectWithPrimary(void) {
server.repl_transfer_s = connCreate(connTypeOfReplication());
if (connConnect(server.repl_transfer_s, server.primary_host, server.primary_port, server.bind_source_addr,
server.repl_mptcp, syncWithPrimary) == C_ERR) {
serverLog(LL_WARNING, "Unable to connect to PRIMARY: %s", connGetLastError(server.repl_transfer_s));
connClose(server.repl_transfer_s);
server.repl_transfer_s = NULL;
return C_ERR;
}
```
For a more thorough explanation, see
https://github.com/valkey-io/valkey/issues/1939#issuecomment-2912177877.
Might fix#1939.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
(cherry picked from commit 17e66863a5)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When a client is blocked by something like `CLIENT PAUSE`, we should not
allow `CLIENT UNBLOCK timeout` to unblock it, since some blocking types
does not has the timeout callback, it will trigger a panic in the core,
people should use `CLIENT UNPAUSE` to unblock it.
Also using `CLIENT UNBLOCK error` is not right, it will return a UNBLOCKED
error to the command, people don't expect a `SET` command to get an error.
So in this commit, in these cases, we will return 0 to `CLIENT UNBLOCK`
to indicate the unblock is fail. The reason is that we assume that if
a command doesn't expect to be timedout, it also doesn't expect to be
unblocked by `CLIENT UNBLOCK`.
The old behavior of the following command will trigger panic in timeout
and get UNBLOCKED error in error. Under the new behavior, client unblock
will get the result of 0.
```
client 1> client pause 100000 write
client 2> set x x
client 1> client unblock 2 timeout
or
client 1> client unblock 2 error
```
Potentially breaking change, previously allowed `CLIENT UNBLOCK error`.
Fixes#2111.
Signed-off-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit 3bc40be6cd)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit fixes two issues in `pubsubUnsubscribeChannel` that could
lead to memory corruption:
1. When calculating the slot for a channel, we were using getKeySlot()
which might use the current_client's slot if available. This is
problematic when a client kills another client (e.g., via CLIENT KILL
command) as the slot won't match the channel's actual slot.
2. The `found` variable was not initialized to `NULL`, causing the
serverAssert to potentially pass incorrectly when the hashtable lookup
failed, leading to memory corruption in subsequent operations.
The fix:
- Calculate the slot directly from the channel name using keyHashSlot()
instead of relying on the current client's slot
- Initialize 'found' to NULL
Added a test case that reproduces the issue by having one client kill
another client that is subscribed to a sharded pubsub channel during a
transaction.
Crash log (After initializing the variable 'found' to null, without
initialization, memory corruption could occur):
```
VALKEY BUG REPORT START: Cut & paste starting from here ===
59707:M 24 May 2025 23:10:40.429 # === ASSERTION FAILED CLIENT CONTEXT ===
59707:M 24 May 2025 23:10:40.429 # client->flags = 108086391057154048
59707:M 24 May 2025 23:10:40.429 # client->conn = fd=11
59707:M 24 May 2025 23:10:40.429 # client->argc = 0
59707:M 24 May 2025 23:10:40.429 # === RECURSIVE ASSERTION FAILED ===
59707:M 24 May 2025 23:10:40.429 # ==> pubsub.c:348 'found' is not true
------ STACK TRACE ------
Backtrace:
0 valkey-server 0x0000000104974054 _serverAssertWithInfo + 112
1 valkey-server 0x000000010496c7fc pubsubUnsubscribeChannel + 268
2 valkey-server 0x000000010496cea0 pubsubUnsubscribeAllChannelsInternal + 216
3 valkey-server 0x000000010496c2e0 pubsubUnsubscribeShardAllChannels + 76
4 valkey-server 0x000000010496c1d4 freeClientPubSubData + 60
5 valkey-server 0x00000001048f3cbc freeClient + 792
6 valkey-server 0x0000000104900870 clientKillCommand + 356
7 valkey-server 0x00000001048d1790 call + 428
8 valkey-server 0x000000010496ef4c execCommand + 872
9 valkey-server 0x00000001048d1790 call + 428
10 valkey-server 0x00000001048d3a44 processCommand + 5056
11 valkey-server 0x00000001048fdc20 processCommandAndResetClient + 64
12 valkey-server 0x00000001048fdeac processInputBuffer + 276
13 valkey-server 0x00000001048f2ff0 readQueryFromClient + 148
14 valkey-server 0x0000000104a182e8 callHandler + 60
15 valkey-server 0x0000000104a1731c connSocketEventHandler + 488
16 valkey-server 0x00000001048b5e80 aeProcessEvents + 820
17 valkey-server 0x00000001048b6598 aeMain + 64
18 valkey-server 0x00000001048dcecc main + 4084
19 dyld 0x0000000186b34274 start + 2840
````
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
(cherry picked from commit bd5dcb2819)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When we refactored the blocking framework we introduced the client
reprocessing infrastructure. In cases the client was blocked on keys, it
will attempt to reprocess the command. One challenge was to keep track
of the command timeout, since we are reprocessing and do not want to
re-register the client with a fresh timeout each time. The solution was
to consider the client reprocessing flag when the client is
blockedOnKeys:
```
if (!c->flag.reprocessing_command) {
/* If the client is re-processing the command, we do not set the timeout
* because we need to retain the client's original timeout. */
c->bstate->timeout = timeout;
}
```
However, this introduced a new issue. There are cases where the client
will consecutive blocking of different types for example:
```
CLIENT PAUSE 10000 ALL
BZPOPMAX zset 1
```
would have the client blocked on the zset endlessly if nothing will be
written to it.
**Credits to @uriyage for locating this with his fuzzer testing**
The suggested solution is to only flag the client when it is
specifically unblocked on keys.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit d00fb8e713)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit introduces a mechanism to track client authentication state
with a new `ever_authenticated` flag. It refactors client authentication
handling by adding a `clientSetUser` function that properly sets both
the authenticated and `ever_authenticated` flags.
The implementation limits output buffer size for clients that have never
been authenticated.
Added tests to verify the output buffer limiting behavior for
unauthenticated clients.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
As part of #1463, I made a small refactor between the PR and the daily
test I submitted to try to improve readability by adding a function to
abstract the extraction of the message types. However, that change
apparently caused GCC to throw another warning, so reverting the
abstraction on just one line.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Fix two problems in fedora CI jobs:
1. Install awk where missing. It's required for building jemalloc.
2. Fix problems with TCL, required for running the tests. Fedora comes
with TCL 9 by default, but the TLS package 'tcltls' isn't built for TCL 9.
Install 'tcl8' and build 'tcltls' from source.
---------
Signed-off-by: vitah <vitahlin@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>