Currently, when parsing querybuf, we are not checking for CRLF,
instead we assume the last two characters are CRLF by default,
as shown in the following example:
```
telnet 127.0.0.1 6379
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
*3
$3
set
$3
key
$5
value12
+OK
get key
$5
value
*3
$3
set
$3
key
$5
value12345
+OK
-ERR unknown command '345', with args beginning with:
```
This should actually be considered a protocol error. When a bug
occurs in the client-side implementation, we may execute incorrect
requests (writing incorrect data is the most serious of these).
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
The CLUSTER SLOTS reply depends on whether the client is connected over
IPv6, but for a fake client there is no connection and when this command
is called from a module timer callback or other scenario where no real
client is involved, there is no connection to check IPv6 support on.
This fix handles the missing case by returning the reply for IPv4
connected clients.
Fixes#2912.
---------
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: KarthikSubbarao <karthikrs2021@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #2078, we did not report large reply when copy avoidance is allowed.
This results in replies larger than 16384 not being recorded in the
commandlog large-reply. This 16384 is controlled by the hidden config
min-string-size-avoid-copy-reply.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We have the same settings for the hard limit, and we should apply them to the soft
limit as well. When the `repl-backlog-size` value is larger, all replication buffers
can be handled by the replication backlog, so there's no need to worry about the
client output buffer soft limit in here. Furthermore, when `soft_seconds` is 0, in
some ways, the soft limit behaves the same (mostly) as the hard limit.
Signed-off-by: Binbin <binloveplay1314@qq.com>
fedorarawhide CI reports these warnings:
```
networking.c: In function 'afterErrorReply':
networking.c:821:30: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
821 | char *spaceloc = memchr(s, ' ', len < 32 ? len : 32);
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
clientReplyBlock stores the size of the actual allocation in it size
field (minus the header size). This can be used for more effective
deallocation with zfree_with_size.
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
The race condition causes the client to be used and subsequently double
freed by the slot migration read pipe handler. The order of events is:
1. We kill the slot migration child process during CANCELSLOTMIGRATIONS
1. We then free the associated client to the target node
1. Although we kill the child process, it is not guaranteed that the
pipe will be empty from child to parent
1. If the pipe is not empty, we later will read that out in the
slotMigrationPipeReadHandler
1. In the pipe read handler, we attempt to write to the connection. If
writing to the connection fails, we will attempt to free the client
1. However, the client was already freed, so this a double free
Notably, the slot migration being aborted doesn't need to be triggered
by `CANCELSLOTMIGRATIONS`, it can be any failure.
To solve this, we simply:
1. Set the slot migration pipe connection to NULL whenever it is
unlinked
2. Bail out early in slot migration pipe read handler if the connection
is NULL
I also consolidate the killSlotMigrationChild call to one code path,
which is executed on client unlink. Before, there were two code paths
that would do this twice (once on slot migration job finish, and once on
client unlink). Sending the signal twice is fine, but inefficient.
Also, add a test to cancel during the slot migration snapshot to make
sure this case is covered (we only caught it during the module test).
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
With #1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.
Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
#0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
#1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
#2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
#3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
#4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
#5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
#6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
#7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```
Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
New client flags in reported by CLIENT INFO and CLIENT LIST:
* `i` for atomic slot migration importing client
* `E` for atomic slot migration exporting client
New flags in return value of `ValkeyModule_GetContextFlags`:
* `VALKEYMODULE_CTX_FLAGS_SLOT_IMPORT_CLIENT`: Indicate the that client
attached to this context is the slot import client.
* `VALKEYMODULE_CTX_FLAGS_SLOT_EXPORT_CLIENT`: Indicate the that client
attached to this context is the slot export client.
Users could use this to monitor the underlying client info of the slot
migration, and more clearly understand why they see extra clients during
the migration.
Modules can use these to detect keyspace notifications on import
clients. I am also adding export flags for symmetry, although there
should not be keyspace notifications. But they would potentially be
visible in command filters or in server events triggered by that client.
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Set free method for deferred_reply list to properly clean up
ClientReplyValue objects when the list is destroyed
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Try to fix the failures seen for `test "PSYNC2 #3899 regression: verify
consistency"`.
This change resets the query buffer parser state in
`replicationCachePrimary()` which is called when the connection to the
primary is lost. Before #2092, this was done by `resetClient()`.
The solution was inspired by the discussion about the regression
mentioned (discussion from 2017) and the related commits from that time:
6bc6bd4c38,
469d6e2b37,
c180bc7d98.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This gets rid of the need to use a void* as a carrier for the worker
number. Instead a pointer to the relevant worker data is passed to the
started thread.
Fixes#2529
---------
Signed-off-by: Ted Lyngmo <ted@lyncon.se>
Instead of parsing only one command per client before executing it,
parse multiple commands from the query buffer and batch-prefetch the
keys accessed by the commands in the queue before executing them.
This is an optimization for pipelined commands, both with and without
I/O threads. The optimization is currently disabled for the replication
stream, due to failures (probably caused by how the replication offset
is calculated based on the query buffer offset).
* When parsing commands from the input buffer, multiple commands are
parsed and stored in a command queue per client.
* In single-threaded mode (I/O threads off) keys are batch-prefetched
before the commands in the queue are executed. Multi-key commands like
MGET, MSET and DEL benefit from this even if pipelining is not used.
* Prefetching when I/O threads are used does prefetching for multiple
clients in parallel. This code takes client command queues into account,
improving prefetching when pipelining is used. The batch size is
controlled by the existing config `prefetch-batch-max-size` (default
16), which so far only was used together with I/O threads. The config is
moved to a different section in `valkey.conf`.
* When I/O threads are used and the maximum number of keys are
prefetched, a client's command is executed, then the next one in the
queue, etc. If there are more commands in the queue for which the keys
have not been prefetched (say the client sends 16 pipelined MGET with 16
keys in each) keys for the next few commands in the queue are prefetched
before the commands is executed if prefetching has not been done for the
next command. (This utilizes the code path used in single-threaded
mode.)
Code improvements:
* Decoupling of command parser state and command execution state:
* The variables reqtype, multibulklen and bulklen refer to the current
position in the query buffer. These are no longer reset in resetClient
(which runs after each command being executed). Instead, they are
reset in the parser code after each completely parsed command.
* The command parser code is partially decoupled from the client struct.
The query buffer is still one per client, but the resulting argument
vector is stored in caller-defined variables.
Fixes#2044
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
aefed3d363/src/networking.c (L2279-L2293)
From above code, we can see that `c->repl_data->ref_block_pos` could be
equal to `o->used`.
When `o->used == o->size`, we may call SSL_write() with num=0 which does
not comply with the openSSL specification.
(ref: https://docs.openssl.org/master/man3/SSL_write/#warnings)
What's worse is that it's still the case after the reconnection. See
aefed3d363/src/replication.c (L756-L769).
So in this case the replica will keep reconnecting again and again until
it doesn't meet the requirements for partial synchronization.
Resolves#2119
---------
Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).
This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.
## New commands
* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
* `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations
This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.
## Slot migration jobs
Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.
All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.
## Replication
* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.
## `CLUSTER SYNCSLOTS`
To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.
Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:
* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness
## Interactions with other commands
* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data
## Error handling
* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.
## State machine
```
Target/Importing Node State Machine
─────────────────────────────────────────────────────────────
┌────────────────────┐
│SLOT_IMPORT_WAIT_ACK┼──────┐
└──────────┬─────────┘ │
ACK│ │
┌──────────────▼─────────────┐ │
│SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
└──────────────┬─────────────┘ │
SNAPSHOT-EOF│ │
┌───────────────▼──────────────┐ │
│SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤
└───────────────┬──────────────┘ │
PAUSED│ │
┌───────────────▼──────────────┐ │ Error Conditions:
│SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM
└───────────────┬──────────────┘ │ 2. Slot Ownership Change
FAILOVER-GRANTED│ │ 3. Demotion to replica
┌──────────────▼─────────────┐ │ 4. FLUSHDB
│SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost
└──────────────┬─────────────┘ │ 6. No ACK from source (timeout)
Takeover Performed│ │
┌──────────────▼───────────┐ │
│SLOT_MIGRATION_JOB_SUCCESS┼────┤
└──────────────────────────┘ │
│
┌─────────────────────────────────────▼─┐
│SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│
└────────────────────┬──────────────────┘
Unowned Slots Cleaned Up│
┌─────────────▼───────────┐
│SLOT_MIGRATION_JOB_FAILED│
└─────────────────────────┘
Source/Exporting Node State Machine
─────────────────────────────────────────────────────────────
┌──────────────────────┐
│SLOT_EXPORT_CONNECTING├─────────┐
└───────────┬──────────┘ │
Connected│ │
┌─────────────▼────────────┐ │
│SLOT_EXPORT_AUTHENTICATING┼───────┤
└─────────────┬────────────┘ │
Authenticated│ │
┌─────────────▼────────────┐ │
│SLOT_EXPORT_SEND_ESTABLISH┼───────┤
└─────────────┬────────────┘ │
ESTABLISH command written│ │
┌─────────────────────▼─────────────┐ │
│SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤
└─────────────────────┬─────────────┘ │
Full response read (+OK)│ │
┌────────────────▼──────────────┐ │ Error Conditions:
│SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION
└────────────────┬──────────────┘ │ 2. Slot ownership change
No other child process│ │ 3. Demotion to replica
┌────────────▼───────────┐ │ 4. FLUSHDB
│SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost
└────────────┬───────────┘ │ 6. AUTH failed
Snapshot done│ │ 7. ERR from ESTABLISH command
┌───────────▼─────────┐ │ 8. Unpaused before failover completed
│SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM)
└───────────┬─────────┘ │ 10. No ack from target (timeout)
PAUSE│ │ 11. Client output buffer overrun
┌──────────────▼─────────────┐ │
│SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤
└──────────────┬─────────────┘ │
Buffer drained│ │
┌──────────────▼────────────┐ │
│SLOT_EXPORT_FAILOVER_PAUSED┼───────┤
└──────────────┬────────────┘ │
Failover request granted│ │
┌───────────────▼────────────┐ │
│SLOT_EXPORT_FAILOVER_GRANTED┼───────┤
└───────────────┬────────────┘ │
New topology received│ │
┌──────────────▼───────────┐ │
│SLOT_MIGRATION_JOB_SUCCESS│ │
└──────────────────────────┘ │
│
┌─────────────────────────┐ │
│SLOT_MIGRATION_JOB_FAILED│◄────────┤
└─────────────────────────┘ │
│
┌────────────────────────────┐ │
│SLOT_MIGRATION_JOB_CANCELLED│◄──────┘
└────────────────────────────┘
```
Co-authored-by: Binbin <binloveplay1314@qq.com>
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
introduce negative filters for `CLIENT LIST` and `CLIENT KILL` commands:
1. `NOT-ID`: Excludes clients in the IDs set
2. `NOT-TYPE`: Excludes clients of the specified type
3. `NOT-ADDR`: Excludes clients of the specified address and port
4. `NOT-LADDR`: Excludes clients connected to the specified local
address and port
5. `NOT-USER`: Excludes clients of the specified user
6. `NOT-FLAGS`: Excludes clients with the specified flag string
7. `NOT-NAME`: Excludes clients with the specified name
8. `NOT-LIB-NAME`: Excludes clients using the specified library name
9. `NOT-LIB-VER`: Excludes clients with the specified library version
10. `NOT-DB`: Excludes clients with the specified database ID
11. `NOT-CAPA`: Excludes clients with the specified capabilities
12. `NOT-IP`: Excludes clients with the specified IP address
close#1936
and fix the matching algorithm for flag 'N'.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
This should be + instread of *, otherwise it does not make any sense.
Otherwise we would have to calculate 20 more bytes for each prefix rax
node in 64 bits build.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This PR implements support for automatic client authentication based on
a field in the client's TLS certificate.
API Changes:
* New configuration directive `tls-auth-clients-user`, values `CN` |
`off`, default `off`. CN means take username from the CommonName field
in the client's certificate.
* New INFO field `acl_access_denied_tls_cert` under the `Stats` section,
indicating the number of failed authentications using this feature, i.e.
client certificates for which no matching username was found.
* New reason "tcl-cert" in the ACL log, logged when a client
certificate's CommonName fails to match any existing username.
Closes#1866
---------
Signed-off-by: Omkar Mestry <om.m.mestry@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Omkar Mestry <omanges@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When displaying client addr / laddr using catClientInfoString for example,
if the client is connected via unix domain socket and the unixsocket exceeds
NET_ADDR_STR_LEN, the output will be truncated because NET_ADDR_STR_LEN is
not long enough.
Currently NET_ADDR_STR_LEN is 46+32 that is 78, and the maximum length of a
UNIX domain socket path is typically 108 bytes on Linux and 104 bytes on macOS,
both number including null terminator. In this fix, we use CONN_ADDR_STR_LEN
instead which is 128, that should be long enough for most cases.
It affects the output of CLIENT INFO / CLIENT LIST commands, as well as the
printing of some logs.
Other cleanup:
1. In acceptCommonHandler, when calling connFormatAddr on a unix socket
client, the connFormatAddr function will always return /unixsocket since
in anetFdToString we hardcoded the string. We changed it to the same display.
2. We changed the return value of connSocketAddr from returning C_OK/C_ERR to
returning 0 and -1, which is consistent with the return values of other types.
One change worth mentioning is that in connSocketAddr, we used to call
anetFdToString which would return `/unixsocket` hardcoded in this case.
Now we will use server.unixsocket, the format is `path:0`.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Fixed processMultiBulkBuffer to processMultibulkBuffer.
processMultibulkBuffer is real function name but there is a typo to
write as processMultiBulkBuffer, B should be lowercase.
Signed-off-by: charsyam <charsyam@naver.com>
When a client is blocked by something like `CLIENT PAUSE`, we should not
allow `CLIENT UNBLOCK timeout` to unblock it, since some blocking types
does not has the timeout callback, it will trigger a panic in the core,
people should use `CLIENT UNPAUSE` to unblock it.
Also using `CLIENT UNBLOCK error` is not right, it will return a UNBLOCKED
error to the command, people don't expect a `SET` command to get an error.
So in this commit, in these cases, we will return 0 to `CLIENT UNBLOCK`
to indicate the unblock is fail. The reason is that we assume that if
a command doesn't expect to be timedout, it also doesn't expect to be
unblocked by `CLIENT UNBLOCK`.
The old behavior of the following command will trigger panic in timeout
and get UNBLOCKED error in error. Under the new behavior, client unblock
will get the result of 0.
```
client 1> client pause 100000 write
client 2> set x x
client 1> client unblock 2 timeout
or
client 1> client unblock 2 error
```
Potentially breaking change, previously allowed `CLIENT UNBLOCK error`.
Fixes#2111.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Introduce
In a production environment, it's quite challenging to figure out why a
Valkey is under high load. Right now, tools like INFO or slowlog can
offer some clues. But if the Valkey can't respond, we might not get any
information at all.
Usually, we have to rely on tools like `strace` or `perf` to find the
root cause. If we set up trace points in advance during the project
development, we can quickly pinpoint performance issues.
In this current PR, support has been added for all latency sampling
points. Also, information reporting for command execution has been
added. At the same time, it supports dynamically turning on or off the
information reporting as required. The trace feature is implemented
based on LTTng, and this capability is supported in projects like QEMU,
Ceph.
## How to use
Building Valkey with LTTng support:
```
USE_LTTNG=yes make
```
Open event report:
```
config set trace-events "sys server db cluster aof commands"
```
Events are classified as follows:
- sys (System-level operations)
- server (Server core logic)
- db (Database core operations)
- cluster (Cluster configuration operations)
- aof (AOF persistence operations)
- commands(Command execution information)
## How to trace
Enable lttng trace events dynamically:
```
~# lttng destroy valkey
~# lttng create valkey
~# lttng enable-event -u valkey:*
~# lttng track -u -p `pidof valkey-server`
~# lttng start
~# lttng stop
~# lttng view
```
Examples (a client run 'SET', another run 'keys'):
```
[15:30:19.334463706] (+0.000001243) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334465183] (+0.000001477) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334466516] (+0.000001333) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334467738] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334469105] (+0.000001367) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334470327] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.369348485] (+0.034878158) libai valkey:command_call: { cpu_id = 15 }, { name = "keys", duration = 34874 }
[15:30:19.369698322] (+0.000349837) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 4 }
[15:30:19.369702327] (+0.000004005) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 2 }
```
Then we can use another script to analyze topN slow commands and other
system
level events.
About performance overhead (valkey-benchmark -t get -n 1000000 --threads
4):
1> no lttng builtin: 285632.69 requests per second
2> lttng builtin, no trace: 285551.09 requests per second (almost 0
overhead)
3> lttng builtin, trace commands: 266595.59 requests per second (about
~6.6 overhead)
Generally valkey-server would not run in full utilization, the overhead
is acceptable.
## Problem analysis
Add prot and conn field into trace command
Run benchmark tool:
```
GET: rps=227428.0 (overall: 222756.2) avg_msec=0.114 (overall: 0.117)
GET: rps=225248.0 (overall: 223005.2) avg_msec=0.115 (overall: 0.117)
GET: rps=167474.1 (overall: 217942.2) avg_msec=0.193 (overall: 0.122) --> performance drop
GET: rps=220192.0 (overall: 218129.5) avg_msec=0.118 (overall: 0.122)
GET: rps=222868.0 (overall: 218493.7) avg_msec=0.117 (overall: 0.121)
```
Run another 'keys *' command in another connection, lead benchmark
performance
drop.
At the same time, lttng traces events:
```
[21:16:30.420997167] (+0.000004064) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54668", name = "get", duration = 1 }
[21:16:30.421001262] (+0.000004095) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54782", name = "get", duration = 1 }
[21:16:30.485562459] (+0.064561197) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54386", name = "keys", duration = 64551 } --> root cause
[21:16:30.485583101] (+0.000020642) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54522", name = "get", duration = 1 }
[21:16:30.485763891] (+0.000180790) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54542", name = "get", duration = 1 }
[21:16:30.485766451] (+0.000002560) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54438", name = "get", duration = 1 }
```
From this change, we can see that connection
127.0.0.1:6379-127.0.0.1:54386
affects other connections.
---------
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: skyfirelee <739609084@qq.com>
Co-authored-by: zhenwei pi <pizhenwei@bytedance.com>
Refactors `getNodeByQuery()` to be able to move the CRC16 slot
calculations in I/O threads and skip many checks if slot migrations are
not ongoing.
The slot calculation for a command is moved out of `getNodeByQuery()`
into a new function `clusterSlotByCommand()` which is safe to call from
I/O threads.
For MULTI-EXEC transactions, the slot is stored per command in the
multi-state, to be able to detect cross-slot transactions on EXEC
without computing the slots again.
Additionally, cross-slot detection and arity check is offloaded to I/O
threads. To unify the code paths for commands parsed by I/O threads and
commands parsed by the main thread, the command lookup, arity check,
slot lookup are moved out of `processCommand()` to a new
`prepareCommand()` function that needs to be called before
`processCommand()`.
The client's read flags are used for passing information about bad arity
and cross-slot. These flags are already used to convey information per
command between I/O thread and main thread.
Fixes#2077
Related to #632
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Item from #761
This PR has the following changes
1. Bug fix where calling `pthread_join()` from main thread for an IO
thread would hang indefinitely. This is because `IOThreadMain()` doesn't
have a cancellation point.So `pthread_cancel()` from main thread is not
honored.
Can be reproed by calling `shutdownIOThread()` from the main thread for
any active thread with empty job queue.
Fixed by adding cancellation point in `IOThreadMain()`.
2. Makes `io-threads` config runtime modifiable.
Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
## Overview
In high-load scenarios, a replica might not consume replication data
fast enough, leading to backpressure on the primary. When the primary’s
buffer overflows, replication lag increases and it can drops the replica
connection, triggering a full sync, a costly operation that impacts
system performance.
The solution is to read from replication clients until their is no longer pending data, up to 25 iterations.
## Performance Impact ##
Test setup:
1. Bombard the replica with expensive commands, leading to high CPU
utilization
2. Write to the main database to trigger replication traffic
Metric | Before (repl-flow-control Disabled) | After (repl-flow-control
Enabled)
-- | -- | --
Throughput (requests/sec) | 941.71 | 760.98
Avg Latency (ms) | 52.865 | 65.534
p50 Latency (ms) | 59.743 | 68.543
p95 Latency (ms) | 79.231 | 106.687
p99 Latency (ms) | 90.303 | 126.527
Max Latency (ms) | 188.031 | 385.535
- Replication stability improves, no full syncs were observed after
enabling flow control.
- Higher latency for normal clients due to increased resource allocation
for replication.
- CPU and memory usage remain stable, with no major overhead.
- Replica throughput slightly decreases as replication takes priority.
Implements https://github.com/valkey-io/valkey/issues/1596
---------
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
daea05b1e2/src/networking.c (L886-L886)
Fix the issue need to ensure that the subtraction `prev->size -
prev->used` does not underflow. This can be achieved by explicitly
checking that `prev->used` is less than `prev->size` before performing
the subtraction. This approach avoids relying on unsigned arithmetic and
ensures the logic is clear and robust.
The specific changes are:
1. Replace the condition `prev->size - prev->used > 0` with `prev->used
< prev->size`.
2. This change ensures that the logic checks whether there is remaining
space in the buffer without risking underflow.
**References**
[INT02-C. Understand integer conversion
rules](https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules)
[CWE-191](https://cwe.mitre.org/data/definitions/191.html)
---
Signed-off-by: Zeroday BYTE <github@zerodaysec.org>
Two dicts are converted to hashtables:
1. On each client, the set of channels/patterns/shard-channels the
client is subscribed to
2. On each channel or pattern, the set of clients subscribed to it.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Rain Valentine <rsg000@gmail.com>
This commit introduces a mechanism to track client authentication state
with a new `ever_authenticated` flag. It refactors client authentication
handling by adding a `clientSetUser` function that properly sets both
the `authenticated` and `ever_authenticated` flags.
The implementation limits output buffer size for clients that have never
been authenticated.
Added tests to verify the output buffer limiting behavior for
unauthenticated clients.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This change enhances user experience and consistency by allowing a
module to block a client on keyspace event notifications. Consistency is
improved by allowing that reads after writes on the same connection
yield expected results. For example, in ValkeySearch, mutations
processed earlier on the same connection will be available for search.
The implementation extends `VM_BlockClient` to support blocking clients
on keyspace event notifications. Internal clients, LUA clients, clients
issueing multi exec and those with the `deny_blocking` flag set are not
blocked. Once blocked, a client’s reply is withheld until it is
explicitly unblocked.
---------
Signed-off-by: yairgott <yairgott@gmail.com>
In this PR, we introduce support for new filters for `CLIENT
LIST` and `CLIENT KILL` commands. The new filters are:
1. FLAGS `Client must include this flag. This can be a string with bunch
of flags present one after the other.`
2. NAME `client name`
3. IDLE `minimum idle time of the client`
4. LIB-NAME `clients with the specified lib name.`
5. LIB-VER `clients with the specified lib version.`
6. DB `clients currently operating on the specified database ID`
7. IP `client ip address`
8. CAPA `client capabilities`
Partly Addresses: #668
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In this code logic:
https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L2767-L2773,
`c->querybuf + c->qb_pos` may also include user data.
Update the log message when config `hide-user-data-from-log` is enabled.
---------
Signed-off-by: VanessaTang <yuetan@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
The current assertion introduced in #11220:
```c
serverAssert(&c->clients_pending_write_node.next != NULL || &c->clients_pending_write_node.prev != NULL);
```
is incorrect for two reasons:
1. Using &pointer.next would always be non-NULL since it's the address
of the field.
2. The check is incorrect even without the & because in a single-node
list, both pointers can be NULL.
Fix:
1. Remove the always-true assertion
2. Add proper assertions in listUnlinkNode to ensure the node membership
in the list to cover all list cases.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
The metric `net_input_bytes_curr_cmd` is now computed by aggregating its components separately.
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
the special deferred reply is ignored in current command's
`net_output_bytes_curr_cmd` counting
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
After #1405, `client trackinginfo` will crash when tracking is off
```
/lib64/libpthread.so.0(+0xf630)[0x7fab74931630]
./src/valkey-server *:6379(clientTrackingInfoCommand+0x12b)[0x57f8db]
./src/valkey-server *:6379(call+0x5ba)[0x5a791a]
./src/valkey-server *:6379(processCommand+0x968)[0x5a8938]
./src/valkey-server *:6379(processInputBuffer+0x18d)[0x58381d]
./src/valkey-server *:6379(readQueryFromClient+0x59)[0x585ea9]
./src/valkey-server *:6379[0x46fa4d]
./src/valkey-server *:6379(aeMain+0x89)[0x5bf3e9]
./src/valkey-server *:6379(main+0x4e1)[0x455821]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fab74576555]
./src/valkey-server *:6379[0x4564f2]
```
The reason is that we did not init pubsub_data by default, we only
init it when tracking on.
Fixes#1683.
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
This PR offloads the write to replica clients to IO threads.
## Main Changes
* Replica writes will be offloaded but only after the replica is in
online mode..
* Replica reads will still be done in the main thread to reduce
complexity and because read traffic from replicas is negligible.
### Implementation Details
In order to offload the writes, `writeToReplica` has been split into 2
parts:
1. The write itself made by the IO thread or by the main thread
2. The post write where we update the replication buffers refcount will
be done in the main-thread after the write-job is done in the IO thread
(similar to what we do with a regular client)
### Additional Changes
* In `writeToReplica` we now use `writev` in case more than 1 buffer
exists.
* Changed client `nwritten` field to `ssize_t` since with a replica the
`nwritten` can theoretically exceed `int` size (not subject to
`NET_MAX_WRITES_PER_EVENT` limit).
* Changed parsing code to use `memchr` instead of `strchr`:
* During parsing command, ASAN got stuck for unknown reason when called
to `strchr` to look for the next `\r`
* Adding assert for null-terminated querybuf didn't resolve the issue.
* Switched to `memchr` as it's more secure and resolves the issue
### Testing
* Added integration tests
* Added unit tests
**Related issue:** #761
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
In #1519, we added paused_actions and paused_timeout_milliseconds,
it would be helpful if we add the paused_purpose since users also
want to know the purpose for the pause.
Currently available options:
- client_pause: trigger by CLIENT PAUSE command.
- shutdown_in_progress: during shutdown, primary waits the replicas to
catch up the offset.
- failover_in_progress: during failover, primary waits the replica to
catch up the offset.
- none
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
As discussed in PR #336.
We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:
1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.
This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:
* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`
And the slowlog is also incorporated into the commandlog.
For each of these three types, additional configs have been added for
control:
* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Adds filter options to CLIENT LIST:
* USER <username>
Return clients authenticated by <username>.
* ADDR <ip:port>
Return clients connected from the specified address.
* LADDR <ip:port>
Return clients connected to the specified local address.
* SKIPME (YES|NO)
Exclude the current client from the list (default: no).
* MAXAGE <maxage>
Only list connections older than the specified age.
Modifies the ID filter to CLIENT KILL to allow multiple IDs
* ID <client-id> [<client-id>...]
Kill connections by client ids.
This makes CLIENT LIST and CLIENT KILL accept the same options.
For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.
The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.
The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.
Partly addresses: #668
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
# Refactor client structure to use modular data components
## Current State
The client structure allocates memory for replication / pubsub /
multi-keys / module / blocked data for every client, despite these
features being used by only a small subset of clients. In addition the
current field layout in the client struct is suboptimal, with poor
alignment and unnecessary padding between fields, leading to a larger
than necessary memory footprint of 896 bytes per client. Furthermore,
fields that are frequently accessed together during operations are
scattered throughout the struct, resulting in poor cache locality.
## This PR's Change
1. Lazy Initialization
- **Components are only allocated when first used:**
- PubSubData: Created on first SUBSCRIBE/PUBLISH operation
- ReplicationData: Initialized only for replica connections
- ModuleData: Allocated when module interaction begins
- BlockingState: Created when first blocking command is issued
- MultiState: Initialized on MULTI command
2. Memory Layout Optimization:
- Grouped related fields for better locality
- Moved rarely accessed fields (e.g., client->name) to struct end
- Optimized field alignment to eliminate padding
3. Additional changes:
- Moved watched_keys to be static allocated in the `mstate` struct
- Relocated replication init logic to replication.c
### Key Benefits
- **Efficient Memory Usage:**
- 45% smaller base client structure - Basic clients now use 528 bytes
(down from 896).
- Better memory locality for related operations
- Performance improvement in high throughput scenarios. No performance
regressions in other cases.
### Performance Impact
Tested with 650 clients and 512 bytes values.
#### Single Thread Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 261,799 | 258,261 | +1.37% |
| SET | 3M keys | 209,134 | ~209,000 | ~0% |
| GET | 1 key | 281,564 | 277,965 | +1.29% |
| GET | 3M keys | 231,158 | 228,410 | +1.20% |
#### 8 IO Threads Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 1,331,578 | 1,331,626 | -0.00% |
| SET | 3M keys | 1,254,441 | 1,152,645 | +8.83% |
| GET | 1 key | 1,293,149 | 1,289,503 | +0.28% |
| GET | 3M keys | 1,152,898 | 1,101,791 | +4.64% |
#### Pipeline Performance (3M keys)
| Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % |
|-----------|--------------|---------------|---------------|-----------|
| SET | 10 | 548,964 | 538,498 | +1.94% |
| SET | 20 | 606,148 | 594,872 | +1.89% |
| SET | 30 | 631,122 | 616,606 | +2.35% |
| GET | 10 | 628,482 | 624,166 | +0.69% |
| GET | 20 | 687,371 | 681,659 | +0.84% |
| GET | 30 | 725,855 | 721,102 | +0.66% |
### Observations:
1. Single-threaded operations show consistent improvements (1-1.4%)
2. Multi-threaded performance shows significant gains for large
datasets:
- SET with 3M keys: +8.83% improvement
- GET with 3M keys: +4.64% improvement
3. Pipeline operations show consistent improvements:
- SET operations: +1.89% to +2.35%
- GET operations: +0.66% to +0.84%
4. No performance regressions observed in any test scenario
Related issue:https://github.com/valkey-io/valkey/issues/761
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.
This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.
---------
Signed-off-by: Rueian <rueiancsie@gmail.com>