Commit Graph

823 Commits

Author SHA1 Message Date
Binbin 51f871ae52
Strictly check CRLF when parsing querybuf (#2872)
Currently, when parsing querybuf, we are not checking for CRLF,
instead we assume the last two characters are CRLF by default,
as shown in the following example:
```
telnet 127.0.0.1 6379
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
*3
$3
set
$3
key
$5
value12
+OK
get key
$5
value

*3
$3
set
$3
key
$5
value12345
+OK
-ERR unknown command '345', with args beginning with:
```

This should actually be considered a protocol error. When a bug
occurs in the client-side implementation, we may execute incorrect
requests (writing incorrect data is the most serious of these).

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-14 09:10:57 +02:00
bandalgomsu 2a9a1731ee
Fix CLUSTER SLOTS crash when called from module timer callback (#2915)
The CLUSTER SLOTS reply depends on whether the client is connected over
IPv6, but for a fake client there is no connection and when this command
is called from a module timer callback or other scenario where no real
client is involved, there is no connection to check IPv6 support on.
This fix handles the missing case by returning the reply for IPv4
connected clients.

Fixes #2912.

---------

Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-12-09 14:02:35 +01:00
Binbin 48e0cbbb41
Fix commandlog large-reply when using reply copy avoidance (#2652)
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-09 10:20:45 +08:00
Binbin 29d3244937
Make the COB soft limit also use repl-backlog-size when its value is smaller (#2866)
We have the same settings for the hard limit, and we should apply them to the soft
limit as well. When the `repl-backlog-size` value is larger, all replication buffers
can be handled by the replication backlog, so there's no need to worry about the
client output buffer soft limit in here. Furthermore, when `soft_seconds` is 0, in
some ways, the soft limit behaves the same (mostly) as the hard limit.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-12-08 12:21:34 +08:00
Binbin d16788e52d
Fix discarded-qualifiers warnings reported by fedorarawhide (#2874)
fedorarawhide CI reports these warnings:
```
networking.c: In function 'afterErrorReply':
networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
  821 |             char *spaceloc = memchr(s, ' ', len < 32 ? len : 32);
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-11-27 10:39:42 +08:00
Vadym Khoptynets 65ab07dde7
Leverage zfree_with_size for client reply blocks (#2624)
clientReplyBlock stores the size of the actual allocation in it size
field (minus the header size). This can be used for more effective
deallocation with zfree_with_size.

Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
2025-11-09 20:46:27 +02:00
Jacob Murphy 28e5dcce2c
Fix crash that occurs sometimes when aborting a slot migration while child snapshot is active (#2721)
The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:

1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free

Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.

To solve this, we simply:

1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL

I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.

Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
2025-10-13 08:18:13 -07:00
Harkrishn Patro 155b0bb821
Fix memory leak with CLIENT LIST/KILL duplicate filters (#2362)
With #1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    #8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    #9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    #10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    #11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    #12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    #13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    #14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    #15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    #16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
2025-10-08 16:04:47 -07:00
Jacob Murphy d5bb986fd5
Add slot migration client flags and module context flags (#2639)
New client flags in reported by CLIENT INFO and CLIENT LIST:

* `i` for atomic slot migration importing client
* `E` for atomic slot migration exporting client

New flags in return value of `ValkeyModule_GetContextFlags`:

* `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client
attached to this context is the slot import client.
* `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client
attached to this context is the slot export client.

Users could use this to monitor the underlying client info of the slot
migration, and more clearly understand why they see extra clients during
the migration.

Modules can use these to detect keyspace notifications on import
clients. I am also adding export flags for symmetry, although there
should not be keyspace notifications. But they would potentially be
visible in command filters or in server events triggered by that client.

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
2025-10-02 21:12:34 +02:00
uriyage 80bbbcf6fe
Fix memory leak in deferred reply buffer (#2615)
Set free method for deferred_reply list to properly clean up 
ClientReplyValue objects when the list is destroyed

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2025-09-17 11:14:44 +03:00
Sarthak Aggarwal 9b11d3d9ed
Evict client only when limit is breached (#2596)
I believe we should evict the clients when the client eviction limit is
breached instead of _at_ the breach. I came across this function in the
failed [daily
test](https://github.com/valkey-io/valkey/actions/runs/17521272806/job/49765359298#step:6:7770),
which could possibly be related.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2025-09-11 17:19:05 +03:00
Viktor Söderqvist 06cefc181a
Attempt to fix sub-replica getting out of sync (#2548)
Try to fix the failures seen for `test "PSYNC2 #3899 regression: verify
consistency"`.

This change resets the query buffer parser state in
`replicationCachePrimary()` which is called when the connection to the
primary is lost. Before #2092, this was done by `resetClient()`.

The solution was inspired by the discussion about the regression
mentioned (discussion from 2017) and the related commits from that time:
6bc6bd4c38,
469d6e2b37,
c180bc7d98.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-08-27 07:24:00 +02:00
Ted Lyngmo 9d682bad0b
bio.c: Organize all worker data in a struct (#2530)
This gets rid of the need to use a void* as a carrier for the worker
number. Instead a pointer to the relevant worker data is passed to the
started thread.

Fixes #2529

---------

Signed-off-by: Ted Lyngmo <ted@lyncon.se>
2025-08-25 12:24:34 +02:00
Viktor Söderqvist 09fb436cf0
Optimize pipelining by parsing and prefetching multiple commands (#2092)
Instead of parsing only one command per client before executing it,
parse multiple commands from the query buffer and batch-prefetch the
keys accessed by the commands in the queue before executing them.

This is an optimization for pipelined commands, both with and without
I/O threads. The optimization is currently disabled for the replication
stream, due to failures (probably caused by how the replication offset
is calculated based on the query buffer offset).

* When parsing commands from the input buffer, multiple commands are
parsed and stored in a command queue per client.
* In single-threaded mode (I/O threads off) keys are batch-prefetched
before the commands in the queue are executed. Multi-key commands like
MGET, MSET and DEL benefit from this even if pipelining is not used.
* Prefetching when I/O threads are used does prefetching for multiple
clients in parallel. This code takes client command queues into account,
improving prefetching when pipelining is used. The batch size is
controlled by the existing config `prefetch-batch-max-size` (default
16), which so far only was used together with I/O threads. The config is
moved to a different section in `valkey.conf`.
* When I/O threads are used and the maximum number of keys are
prefetched, a client's command is executed, then the next one in the
queue, etc. If there are more commands in the queue for which the keys
have not been prefetched (say the client sends 16 pipelined MGET with 16
keys in each) keys for the next few commands in the queue are prefetched
before the commands is executed if prefetching has not been done for the
next command. (This utilizes the code path used in single-threaded
mode.)

Code improvements:

* Decoupling of command parser state and command execution state:
  * The variables reqtype, multibulklen and bulklen refer to the current
    position in the query buffer. These are no longer reset in resetClient
    (which runs after each command being executed). Instead, they are
    reset in the parser code after each completely parsed command.
  * The command parser code is partially decoupled from the client struct.
    The query buffer is still one per client, but the resulting argument
    vector is stored in caller-defined variables.

Fixes #2044

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-08-22 16:02:27 +02:00
yzc-yzc e7bb2354da
Don't call SSL_write() with num=0 (#2490)
aefed3d363/src/networking.c (L2279-L2293)
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)

What's worse is that it's still the case after the reconnection. See
aefed3d363/src/replication.c (L756-L769).
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.

Resolves #2119

---------

Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
2025-08-15 20:37:15 +02:00
Jacob Murphy d7993b78d8
Introduce atomic slot migration (#1949)
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```
                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 
```

Co-authored-by: Binbin <binloveplay1314@qq.com>

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-08-11 18:02:37 -07:00
zhaozhao.zz 1fbf5fb1fe
support negative filtering for client command filters (#2378)
introduce negative filters for `CLIENT LIST` and `CLIENT KILL` commands:

1. `NOT-ID`: Excludes clients in the IDs set
2. `NOT-TYPE`: Excludes clients of the specified type
3. `NOT-ADDR`: Excludes clients of the specified address and port
4. `NOT-LADDR`: Excludes clients connected to the specified local
address and port
5. `NOT-USER`: Excludes clients of the specified user
6. `NOT-FLAGS`: Excludes clients with the specified flag string
7. `NOT-NAME`: Excludes clients with the specified name
8. `NOT-LIB-NAME`: Excludes clients using the specified library name
9. `NOT-LIB-VER`: Excludes clients with the specified library version
10. `NOT-DB`: Excludes clients with the specified database ID
11. `NOT-CAPA`: Excludes clients with the specified capabilities
12. `NOT-IP`: Excludes clients with the specified IP address

close #1936

and fix the matching algorithm for flag 'N'.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-08-01 10:31:24 +08:00
Omkar Mestry a739531106
Fixing open comments on #1920 PR (#2364) 2025-07-22 22:44:01 -07:00
Binbin e5956d0c23
Fix client tracking memory overhead calculation (#2360)
This should be + instread of *, otherwise it does not make any sense.
Otherwise we would have to calculate 20 more bytes for each prefix rax
node in 64 bits build.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-07-22 17:19:02 +08:00
Omkar Mestry 2cd62fa809
Add support for automatic client authentication via TLS certificate fields (#1920)
This PR implements support for automatic client authentication based on
a field in the client's TLS certificate.

API Changes:

* New configuration directive `tls-auth-clients-user`, values `CN` |
`off`, default `off`. CN means take username from the CommonName field
in the client's certificate.
* New INFO field `acl_access_denied_tls_cert` under the `Stats` section,
indicating the number of failed authentications using this feature, i.e.
client certificates for which no matching username was found.
* New reason "tcl-cert" in the ACL log, logged when a client
certificate's CommonName fails to match any existing username.

Closes #1866

---------

Signed-off-by: Omkar Mestry <om.m.mestry@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Omkar Mestry <omanges@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-07-12 10:58:25 +02:00
Binbin 3a01c7f7e3
Fix unixsocket too long cause addr / laddr being truncated in CLIENT INFO / LIST and logging (#2216)
When displaying client addr / laddr using catClientInfoString for example,
if the client is connected via unix domain socket and the unixsocket exceeds
NET_ADDR_STR_LEN, the output will be truncated because NET_ADDR_STR_LEN is
not long enough.

Currently NET_ADDR_STR_LEN is 46+32 that is 78, and the maximum length of a
UNIX domain socket path is typically 108 bytes on Linux and 104 bytes on macOS,
both number including null terminator. In this fix, we use CONN_ADDR_STR_LEN
instead which is 128, that should be long enough for most cases.

It affects the output of CLIENT INFO / CLIENT LIST commands, as well as the
printing of some logs.

Other cleanup:
1. In acceptCommonHandler, when calling connFormatAddr on a unix socket
client, the connFormatAddr function will always return /unixsocket since
in anetFdToString we hardcoded the string. We changed it to the same display.

2. We changed the return value of connSocketAddr from returning C_OK/C_ERR to
returning 0 and -1, which is consistent with the return values of other types.

One change worth mentioning is that in connSocketAddr, we used to call
anetFdToString which would return `/unixsocket` hardcoded in this case.
Now we will use server.unixsocket, the format is `path:0`.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-06-20 10:33:38 +08:00
charsyam b5e012a108
Fix invalid functionname processMultibulkBuffer typo in comments. (#2097)
Fixed processMultiBulkBuffer to processMultibulkBuffer.

processMultibulkBuffer is real function name but there is a typo to
write as processMultiBulkBuffer, B should be lowercase.

Signed-off-by: charsyam <charsyam@naver.com>
2025-06-10 17:07:37 +08:00
Binbin 3bc40be6cd
CLIENT UNBLOCK should't be able to unpause paused clients (#2117)
When a client is blocked by something like `CLIENT PAUSE`, we should not
allow `CLIENT UNBLOCK timeout` to unblock it, since some blocking types
does not has the timeout callback, it will trigger a panic in the core,
people should use `CLIENT UNPAUSE` to unblock it.

Also using `CLIENT UNBLOCK error` is not right, it will return a UNBLOCKED
error to the command, people don't expect a `SET` command to get an error.

So in this commit, in these cases, we will return 0 to `CLIENT UNBLOCK`
to indicate the unblock is fail. The reason is that we assume that if
a command doesn't expect to be timedout, it also doesn't expect to be
unblocked by `CLIENT UNBLOCK`.

The old behavior of the following command will trigger panic in timeout
and get UNBLOCKED error in error. Under the new behavior, client unblock
will get the result of 0.
```
client 1> client pause 100000 write
client 2> set x x

client 1> client unblock 2 timeout
or
client 1> client unblock 2 error
```

Potentially breaking change, previously allowed `CLIENT UNBLOCK error`.
Fixes #2111.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-06-10 10:29:25 +08:00
skyfirelee 1941d28acd
[NEW] Introduce lttng based tracing (#2070)
## Introduce

In a production environment, it's quite challenging to figure out why a
Valkey is under high load. Right now, tools like INFO or slowlog can
offer some clues. But if the Valkey can't respond, we might not get any
information at all.
Usually, we have to rely on tools like `strace` or `perf` to find the
root cause. If we set up trace points in advance during the project
development, we can quickly pinpoint performance issues.

In this current PR, support has been added for all latency sampling
points. Also, information reporting for command execution has been
added. At the same time, it supports dynamically turning on or off the
information reporting as required. The trace feature is implemented
based on LTTng, and this capability is supported in projects like QEMU,
Ceph.

## How to use

Building Valkey with LTTng support:

```
USE_LTTNG=yes make
```

Open event report:
```
config set trace-events "sys server db cluster aof commands"
```

Events are classified as follows:
- sys (System-level operations)
- server (Server core logic)
- db (Database core operations)
- cluster (Cluster configuration operations)
- aof (AOF persistence operations)
- commands(Command execution information)

## How to trace

Enable lttng trace events dynamically:
```
~# lttng destroy valkey
~# lttng create valkey
~# lttng enable-event -u valkey:*
~# lttng track -u -p `pidof valkey-server`
~# lttng start
~# lttng stop
~# lttng view
```

Examples (a client run 'SET', another run 'keys'):

```
[15:30:19.334463706] (+0.000001243) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334465183] (+0.000001477) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334466516] (+0.000001333) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334467738] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334469105] (+0.000001367) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334470327] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.369348485] (+0.034878158) libai valkey:command_call: { cpu_id = 15 }, { name = "keys", duration = 34874 }
[15:30:19.369698322] (+0.000349837) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 4 }
[15:30:19.369702327] (+0.000004005) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 2 }

```

Then we can use another script to analyze topN slow commands and other
system
level events.

About performance overhead (valkey-benchmark -t get -n 1000000 --threads
4):
1> no lttng builtin: 285632.69 requests per second
2> lttng builtin, no trace: 285551.09 requests per second (almost 0
overhead)
3> lttng builtin, trace commands: 266595.59 requests per second (about
~6.6 overhead)

Generally valkey-server would not run in full utilization, the overhead
is acceptable.

## Problem analysis

Add prot and conn field into trace command

Run benchmark tool:
```
GET: rps=227428.0 (overall: 222756.2) avg_msec=0.114 (overall: 0.117)
GET: rps=225248.0 (overall: 223005.2) avg_msec=0.115 (overall: 0.117)
GET: rps=167474.1 (overall: 217942.2) avg_msec=0.193 (overall: 0.122) --> performance drop
GET: rps=220192.0 (overall: 218129.5) avg_msec=0.118 (overall: 0.122)
GET: rps=222868.0 (overall: 218493.7) avg_msec=0.117 (overall: 0.121)

```
Run another 'keys *' command in another connection, lead benchmark
performance
drop.

At the same time, lttng traces events:
```
[21:16:30.420997167] (+0.000004064) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54668", name = "get", duration = 1 }
[21:16:30.421001262] (+0.000004095) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54782", name = "get", duration = 1 }
[21:16:30.485562459] (+0.064561197) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54386", name = "keys", duration = 64551 } --> root cause
[21:16:30.485583101] (+0.000020642) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54522", name = "get", duration = 1 }
[21:16:30.485763891] (+0.000180790) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54542", name = "get", duration = 1 }
[21:16:30.485766451] (+0.000002560) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54438", name = "get", duration = 1 }
```

From this change, we can see that connection
127.0.0.1:6379-127.0.0.1:54386
affects other connections.

---------

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: skyfirelee <739609084@qq.com>
Co-authored-by: zhenwei pi <pizhenwei@bytedance.com>
2025-06-08 14:39:57 -07:00
xbasel 838ba44cd6
Reply Copy Avoidance (#2078)
### Overview
This PR introduces the ability to avoid copying the content of string
object into replies (i.e. bulk string replies) and to allow I/O threads
refer directly to obj->ptr in writev iov.

### Key Changes
* Added capability to reply construction allowing to interleave regular
replies with copy avoid replies in client reply buffers
* Extended write-to-client handlers to support copy avoid replies
* Added copy avoidance of string bulk replies when copy avoidance
indicated by I/O threads
* Minor changes in cluster slots stats in order to support
`network-bytes-out` for copy avoid replies
* Copy avoidance is beneficial for performance despite object size only
starting certain number of threads. So it will be enabled only starting
certain number of threads. Internal configuration
``min-io-threads-copy-avoid`` introduced to manage this number of
threads

**Note**: When copy avoidance disabled content and handling of client
reply buffers remains as before this PR

### Implementation Details
####  ``client`` and  ``clientReplyBlock`` structs:
1. ``buf_encoded`` flag has been added to ``clientReplyBlock`` struct
and to ``client`` struct for static ``c->buf`` to indicate if reply
buffer is in copy avoidance mode (i.e. include headers and payloads) or
not (i.e. plain replies only).
2. ``io_last_written_buf``, ``io_last_written_bufpos``,
``io_last_written_data_len`` fields added ``client`` struct to to keep
track of write state between ``writevToClient`` invocations
####  Reply construction:
1. Original ```_addReplyToBuffer``` and ```_addReplyProtoToList``` have
been renamed to ```_addReplyPayloadToBuffer``` and
```_addReplyPayloadToList``` and extended to support different types of
payloads - regular replies and copy avoid replies.
3. New ```_addReplyToBuffer``` and ```_addReplyProtoToList``` calls now
```_addReplyPayloadToBuffer``` and ```_addReplyPayloadToList``` and used
for adding **regular** replies to client reply buffers.
4. Newly introduced ```_addBulkOffloadToBuffer``` and
```_addBulkOffloadToList``` are used for adding **copy avoid** replies
to client reply buffers.
 
####  Write-to-client infrastructure:
The ```writevToClient``` and ```_postWriteToClient``` has been
significantly changed to support copy avoidance capability.

####  Debug configuration:
1. ``min-io-threads-avoid-copy-reply`` - Minimum number of IO threads
for copy avoidance
2. ``min-string-size-avoid-copy-reply`` - Minimum bulk string size for
copy avoidance when IO threads disabled
3. ``min-string-size-avoid-copy-reply-threaded`` - Minimum bulk string
size for copy avoidance when IO threads enabled

### Testing
1. Existing unit and integration tests passed. Copy avoidance enabled on
tests with ``--io-threads`` flag
2. Added unit tests for copy avoidance functionality

### Performance Tests

Note: pay attention `io-threads 1` config means only main thread with no
additional io-threads, `io-threads 2` means main thread plus 1 I/O
thread, `io-threads 9` means main thread plus 8 I/O threads.

#### 512 byte object size
Tests are conducted on memory optimized instances using:
* 3,000,000 keys
* 512 bytes object size 
* 1000 clients

|io-threads (including main thread)	|Plain Reply	|Copy Avoidance	|
|---	|---	|---	|
|7	|1,160,000	|1,160,000	|
|8	|1,150,000	|1,280,000	|
|9	|1,150,000	|1,330,000	|
|10	|N/A	|1,380,000	|
|11	|N/A	|1,420,000	|

#### Various object size, small number of threads

|iothreads |Data size |Keys |Clients |Instance type |Unstable branch
|Copy Avoidance On |
|---	|---	|---	|---	|---	|---	|---	|
|1	|512 byte	|3,000,000	|1,000	|memory optimized	|195,000	|195,000	|
|2	|512 byte	|3,000,000	|1,000	|memory optimized	|245,000	|245,000	|
|3	|512 byte	|3,000,000	|1,000	|memory optimized	|455,000	|459,000	|
|4	|512 byte	|3,000,000	|1,000	|memory optimized	|685,000	|685,000	|
|	|	|	|	|	|	|	|
|1	|1K	|3,000,000	|1,000	|memory optimized	|185,000	|185,000	|
|2	|1K	|3,000,000	|1,000	|memory optimized	|235,000	|235,000	|
|3	|1K	|3,000,000	|1,000	|memory optimized	|450,000	|450,000	|
|	|	|	|	|	|	|	|
|1	|4K	|1,000,000	|1,000	|network optimized	|182,000	|187,000	|
|2	|4K	|1,000,000	|1,000	|network optimized	|240,000	|238,000	|
|	|	|	|	|	|	|	|
|1	|16K	|1,000,000	|500	|network optimized	|100,000	|120,000	|
|2	|16K	|1,000,000	|500	|network optimized	|140,000	|140,000	|
|3	|16K	|1,000,000	|500	|network optimized	|275,000	|260,000	|
|	|	|	|	|	|	|	|
|1	|32K	|500,000	|500	|network optimized	|57,000	|90,000	|
|2	|32K	|500,000	|500	|network optimized	|110,000	|110,000	|
|3	|32K	|500,000	|500	|network optimized	|215,000	|215,000	|
|	|	|	|	|	|	|	|
|1	|64K	|100,000	|500	|network optimized	|30,000	|57,000	|
|2	|64K	|100,000	|500	|network optimized	|69,000	|61,000	|
|3	|64K	|100,000	|500	|network optimized	|120,000	|120,000	|
|4	|64K	|100,000	|500	|network optimized	|115,000 - 175,000	|175,000	|
|5	|64K	|100,000	|500	|network optimized	|115,000 - 165,000	|230,000	|

---------

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Alexander Shabanov <alexander.shabanov@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-06-07 20:49:51 -07:00
Viktor Söderqvist 3789b29e92
Offload slot calculation and cross-slot detection to I/O threads (#2165)
Refactors `getNodeByQuery()` to be able to move the CRC16 slot
calculations in I/O threads and skip many checks if slot migrations are
not ongoing.

The slot calculation for a command is moved out of `getNodeByQuery()`
into a new function `clusterSlotByCommand()` which is safe to call from
I/O threads.

For MULTI-EXEC transactions, the slot is stored per command in the
multi-state, to be able to detect cross-slot transactions on EXEC
without computing the slots again.

Additionally, cross-slot detection and arity check is offloaded to I/O
threads. To unify the code paths for commands parsed by I/O threads and
commands parsed by the main thread, the command lookup, arity check,
slot lookup are moved out of `processCommand()` to a new
`prepareCommand()` function that needs to be called before
`processCommand()`.

The client's read flags are used for passing information about bad arity
and cross-slot. These flags are already used to convey information per
command between I/O thread and main thread.

Fixes #2077
Related to #632

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-06-04 09:50:12 +02:00
Ayush Sharma 4aaaf88337
Allow dynamic modification of io-threads num (#2033)
Item from #761 

This PR has the following changes

1. Bug fix where calling `pthread_join()` from main thread for an IO
thread would hang indefinitely. This is because `IOThreadMain()` doesn't
have a cancellation point.So `pthread_cancel()` from main thread is not
honored.
Can be reproed by calling `shutdownIOThread()` from the main thread for
any active thread with empty job queue.
 Fixed by adding cancellation point in `IOThreadMain()`.
2. Makes `io-threads` config runtime modifiable.

Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
2025-05-28 11:25:02 -07:00
xbasel f5b92f526a
Prioritize replication traffic in the replica (#1838)
## Overview
In high-load scenarios, a replica might not consume replication data
fast enough, leading to backpressure on the primary. When the primary’s
buffer overflows, replication lag increases and it can drops the replica
connection, triggering a full sync, a costly operation that impacts
system performance.

The solution is to read from replication clients until their is no longer pending data, up to 25 iterations.

## Performance Impact ##

Test setup:
1. Bombard the replica with expensive commands, leading to high CPU
utilization
2. Write to the main database to trigger replication traffic

Metric | Before (repl-flow-control Disabled) | After (repl-flow-control
Enabled)
-- | -- | --
Throughput (requests/sec) | 941.71 | 760.98
Avg Latency (ms) | 52.865 | 65.534
p50 Latency (ms) | 59.743 | 68.543
p95 Latency (ms) | 79.231 | 106.687
p99 Latency (ms) | 90.303 | 126.527
Max Latency (ms) | 188.031 | 385.535

- Replication stability improves, no full syncs were observed after
enabling flow control.
- Higher latency for normal clients due to increased resource allocation
for replication.
- CPU and memory usage remain stable, with no major overhead.
- Replica throughput slightly decreases as replication takes priority.
    
Implements https://github.com/valkey-io/valkey/issues/1596

---------

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-05-28 10:50:21 -07:00
Zeroday BYTE 374718b2a3
Fix unsigned difference expression compared to zero (#2101)
daea05b1e2/src/networking.c (L886-L886)

Fix the issue need to ensure that the subtraction `prev->size -
prev->used` does not underflow. This can be achieved by explicitly
checking that `prev->used` is less than `prev->size` before performing
the subtraction. This approach avoids relying on unsigned arithmetic and
ensures the logic is clear and robust.

The specific changes are:
1. Replace the condition `prev->size - prev->used > 0` with `prev->used
< prev->size`.
2. This change ensures that the logic checks whether there is remaining
space in the buffer without risking underflow.

**References**
[INT02-C. Understand integer conversion
rules](https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules)
[CWE-191](https://cwe.mitre.org/data/definitions/191.html)


---

Signed-off-by: Zeroday BYTE <github@zerodaysec.org>
2025-05-26 14:57:00 +03:00
Viktor Söderqvist 249495a3ff
Convert pubsub dicts to hashtables (#2007)
Two dicts are converted to hashtables:

1. On each client, the set of channels/patterns/shard-channels the
   client is subscribed to
2. On each channel or pattern, the set of clients subscribed to it.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Rain Valentine <rsg000@gmail.com>
2025-04-27 09:38:59 +02:00
Ran Shidlansik 30e9109d19
Add output buffer limiting for unauthenticated clients (#2006)
This commit introduces a mechanism to track client authentication state
with a new `ever_authenticated` flag. It refactors client authentication
handling by adding a `clientSetUser` function that properly sets both
the `authenticated` and `ever_authenticated` flags.

The implementation limits output buffer size for clients that have never
been authenticated.

Added tests to verify the output buffer limiting behavior for
unauthenticated clients.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-04-25 09:29:32 +03:00
Yair Gottdenker 38846504dc
blocking client followup fix related to log_req_res (#1989)
See
https://github.com/valkey-io/valkey/pull/1819#issuecomment-2817273924

Verified the fix by:
1. Build: `make SERVER_CFLAGS='-Werror -DLOG_REQ_RES' -j10`
2. Running module api tests: `CFLAGS='-Werror' ./runtest-moduleapi
--log-req-res --no-latency --dont-clean --force-resp3 --dont-pre-clean
--verbose --dump-logs`

---------

Signed-off-by: yairgott <yairgott@gmail.com>
2025-04-23 00:41:17 +02:00
Yair Gottdenker 57118a4754
Fix engine crash on module client blocking during keyspace events (#1819)
This change enhances user experience and consistency by allowing a
module to block a client on keyspace event notifications. Consistency is
improved by allowing that reads after writes on the same connection
yield expected results. For example, in ValkeySearch, mutations
processed earlier on the same connection will be available for search.

The implementation extends `VM_BlockClient` to support blocking clients
on keyspace event notifications. Internal clients, LUA clients, clients
issueing multi exec and those with the `deny_blocking` flag set are not
blocked. Once blocked, a client’s reply is withheld until it is
explicitly unblocked.

---------

Signed-off-by: yairgott <yairgott@gmail.com>
2025-04-17 18:13:21 -07:00
Sarthak Aggarwal be60586f17
[Client Introspection] Client Commands Extended Filtering (#1466)
In this PR, we introduce support for new filters for `CLIENT
LIST` and `CLIENT KILL` commands. The new filters are:

1. FLAGS `Client must include this flag. This can be a string with bunch
of flags present one after the other.`
2. NAME `client name`
3. IDLE `minimum idle time of the client`
4. LIB-NAME `clients with the specified lib name.`
5. LIB-VER `clients with the specified lib version.`
6. DB `clients currently operating on the specified database ID`
7. IP `client ip address`
8. CAPA `client capabilities` 

Partly Addresses: #668

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-04-14 16:08:41 -07:00
Sarthak Aggarwal e2b28c6a07
Rebranding in security warning log (#1945)
Replacing Redis with Valkey in security warning log via the
`SERVER_TITLE` macro.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2025-04-13 05:41:07 +02:00
VanessaTang 09f9630171
Redact protocol error log when hide-user-data-from-log enabled (#1889)
In this code logic:
https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L2767-L2773,
`c->querybuf + c->qb_pos` may also include user data.
Update the log message when config `hide-user-data-from-log` is enabled.

---------

Signed-off-by: VanessaTang <yuetan@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-03-31 10:46:20 +03:00
uriyage cab500cf6a
Fix incorrect assertion in client list operations (#1800)
The current assertion introduced in #11220:
```c
serverAssert(&c->clients_pending_write_node.next != NULL || &c->clients_pending_write_node.prev != NULL);
```

is incorrect for two reasons:
1. Using &pointer.next would always be non-NULL since it's the address
of the field.
2. The check is incorrect even without the & because in a single-node
list, both pointers can be NULL.

Fix:
1. Remove the always-true assertion
2. Add proper assertions in listUnlinkNode to ensure the node membership
in the list to cover all list cases.

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-03-04 15:24:28 +08:00
zhaozhao.zz 0cc0bf7222
make net_input_bytes_curr_cmd more readable (#1756)
The metric `net_input_bytes_curr_cmd` is now computed by aggregating its components separately.

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-03-03 09:40:24 +08:00
zhaozhao.zz 49663575c0
cmd's out bytes need count deferred reply (#1760)
the special deferred reply is ignored in current command's
`net_output_bytes_curr_cmd` counting

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-02-27 15:59:48 +08:00
Seungmin Lee b7116d4a72
Move TCP/TLS specific options from generic client to connection type (#1706)
Fixes https://github.com/valkey-io/valkey/issues/1702

Signed-off-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
2025-02-20 14:33:37 -08:00
Ray Cao 462cba5801 Modify parameter of clientMatchesFilter function. (#1733)
small change to [#1401](https://github.com/valkey-io/valkey/pull/1401/).
pass `clientFilter *` to clientMatchesFilter as suggest in
[comment](6bc64ca537 (r1875844273)).

Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
2025-02-15 10:49:30 +01:00
zhaozhao.zz 96857bfeb1
show capa in CLIENT LIST (#1698)
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-02-11 00:06:41 +08:00
bodong.ybd 61a854dbbd
Fix client trackinginfo crash when tracking off by default (#1684)
After #1405, `client trackinginfo` will crash when tracking is off
```
/lib64/libpthread.so.0(+0xf630)[0x7fab74931630]
./src/valkey-server *:6379(clientTrackingInfoCommand+0x12b)[0x57f8db]
./src/valkey-server *:6379(call+0x5ba)[0x5a791a]
./src/valkey-server *:6379(processCommand+0x968)[0x5a8938]
./src/valkey-server *:6379(processInputBuffer+0x18d)[0x58381d]
./src/valkey-server *:6379(readQueryFromClient+0x59)[0x585ea9]
./src/valkey-server *:6379[0x46fa4d]
./src/valkey-server *:6379(aeMain+0x89)[0x5bf3e9]
./src/valkey-server *:6379(main+0x4e1)[0x455821]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fab74576555]
./src/valkey-server *:6379[0x4564f2]
```

The reason is that we did not init pubsub_data by default, we only
init it when tracking on.

Fixes #1683.

Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
2025-02-10 10:56:23 +08:00
uriyage de3672a7ff
Offload replication writes to IO threads (#1485)
This PR offloads the write to replica clients to IO threads.

## Main Changes

* Replica writes will be offloaded but only after the replica is in
online mode..
* Replica reads will still be done in the main thread to reduce
complexity and because read traffic from replicas is negligible.
### Implementation Details

In order to offload the writes, `writeToReplica` has been split into 2
parts:
1. The write itself made by the IO thread or by the main thread
2. The post write where we update the replication buffers refcount will
be done in the main-thread after the write-job is done in the IO thread
(similar to what we do with a regular client)

### Additional Changes

* In `writeToReplica` we now use `writev` in case more than 1 buffer
exists.
* Changed client `nwritten` field to `ssize_t` since with a replica the
`nwritten` can theoretically exceed `int` size (not subject to
`NET_MAX_WRITES_PER_EVENT` limit).
* Changed parsing code to use `memchr` instead of `strchr`:
* During parsing command, ASAN got stuck for unknown reason when called
to `strchr` to look for the next `\r`
  * Adding assert for null-terminated querybuf didn't resolve the issue.
  * Switched to `memchr` as it's more secure and resolves the issue

### Testing
* Added integration tests
* Added unit tests

**Related issue:** #761

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2025-02-09 13:34:13 +02:00
Binbin da3f1c6144
Add paused_reason to INFO CLIENTS (#1564)
In #1519, we added paused_actions and paused_timeout_milliseconds,
it would be helpful if we add the paused_purpose since users also
want to know the purpose for the pause.

Currently available options:
- client_pause: trigger by CLIENT PAUSE command.
- shutdown_in_progress: during shutdown, primary waits the replicas to
catch up the offset.
- failover_in_progress: during failover, primary waits the replica to
catch up the offset.
- none

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-02-06 16:56:32 +01:00
zhaozhao.zz 3f21705a6c
Feature COMMANDLOG to record slow execution and large request/reply (#1294)
As discussed in PR #336.

We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:

1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.

This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:

* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`

And the slowlog is also incorporated into the commandlog.

For each of these three types, additional configs have been added for
control:

* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2025-01-24 11:41:40 +08:00
Sarthak Aggarwal 6a8f068e36
Adding Missing filters to CLIENT LIST and Dedup Parsing (#1401)
Adds filter options to CLIENT LIST:

    * USER <username>
      Return clients authenticated by <username>.
    * ADDR <ip:port>
      Return clients connected from the specified address.
    * LADDR <ip:port>
      Return clients connected to the specified local address.
    * SKIPME (YES|NO)
      Exclude the current client from the list (default: no).
    * MAXAGE <maxage>
      Only list connections older than the specified age.

Modifies the ID filter to CLIENT KILL to allow multiple IDs

    * ID <client-id> [<client-id>...]
      Kill connections by client ids.


This makes CLIENT LIST and CLIENT KILL accept the same options.

For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.

The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.

The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.

Partly addresses: #668

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2025-01-15 20:44:13 +01:00
zhaozhao.zz c5a1585547
add paused_actions for INFO Clients (#1519)
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-01-14 19:01:00 +08:00
uriyage 6c09eea2bc
client struct: lazy init components and optimize struct layout (#1405)
# Refactor client structure to use modular data components

## Current State
The client structure allocates memory for replication / pubsub /
multi-keys / module / blocked data for every client, despite these
features being used by only a small subset of clients. In addition the
current field layout in the client struct is suboptimal, with poor
alignment and unnecessary padding between fields, leading to a larger
than necessary memory footprint of 896 bytes per client. Furthermore,
fields that are frequently accessed together during operations are
scattered throughout the struct, resulting in poor cache locality.

## This PR's Change

1.  Lazy Initialization 
- **Components are only allocated when first used:**
  - PubSubData: Created on first SUBSCRIBE/PUBLISH operation
  - ReplicationData: Initialized only for replica connections
  - ModuleData: Allocated when module interaction begins
  - BlockingState: Created when first blocking command is issued
  - MultiState: Initialized on MULTI command

2. Memory Layout Optimization:
   - Grouped related fields for better locality
   - Moved rarely accessed fields (e.g., client->name) to struct end
   - Optimized field alignment to eliminate padding

3. Additional changes:
   - Moved watched_keys to be static allocated in the `mstate` struct
   - Relocated replication init logic to replication.c
  

### Key Benefits
- **Efficient Memory Usage:**
- 45% smaller base client structure - Basic clients now use 528 bytes
(down from 896).
- Better memory locality for related operations
- Performance improvement in high throughput scenarios. No performance
regressions in other cases.


### Performance Impact

Tested with 650 clients and 512 bytes values.

#### Single Thread Performance
| Operation   | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET        | 1 key   | 261,799      | 258,261      | +1.37%    |
| SET        | 3M keys | 209,134      | ~209,000     | ~0%       |
| GET        | 1 key   | 281,564      | 277,965      | +1.29%    |
| GET        | 3M keys | 231,158      | 228,410      | +1.20%    |

#### 8 IO Threads Performance
| Operation   | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET        | 1 key   | 1,331,578    | 1,331,626    | -0.00%    |
| SET        | 3M keys | 1,254,441    | 1,152,645    | +8.83%    |
| GET        | 1 key   | 1,293,149    | 1,289,503    | +0.28%    |
| GET        | 3M keys | 1,152,898    | 1,101,791    | +4.64%    |

#### Pipeline Performance (3M keys)
| Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % |
|-----------|--------------|---------------|---------------|-----------|
| SET       | 10          | 548,964      | 538,498      | +1.94%    |
| SET       | 20          | 606,148      | 594,872      | +1.89%    |
| SET       | 30          | 631,122      | 616,606      | +2.35%    |
| GET       | 10          | 628,482      | 624,166      | +0.69%    |
| GET       | 20          | 687,371      | 681,659      | +0.84%    |
| GET       | 30          | 725,855      | 721,102      | +0.66%    |

### Observations:
1. Single-threaded operations show consistent improvements (1-1.4%)
2. Multi-threaded performance shows significant gains for large
datasets:
   - SET with 3M keys: +8.83% improvement
   - GET with 3M keys: +4.64% improvement
3. Pipeline operations show consistent improvements:
   - SET operations: +1.89% to +2.35%
   - GET operations: +0.66% to +0.84%
4. No performance regressions observed in any test scenario


Related issue:https://github.com/valkey-io/valkey/issues/761

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-08 10:28:54 +02:00
Rueian 3b52186b6a
Add `availability_zone` to the HELLO response (#1487)
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.

This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
2025-01-07 22:54:55 +01:00