After introducing the dual channel replication in #60, we decided in #915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.
We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Introduce a new config `hash-seed` which can be set only at startup and
controls the hash seed for the server. This includes all hash tables.
This change makes it so that both primaries and replicas will return the
same results for SCAN/HSCAN/ZSCAN/SSCAN cursors. This is useful in order
to make sure SCAN behaves correctly after a failover.
Resolves#4
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
- Moved `server-cpulist`, `bio-cpulist`, `aof-rewrite-cpulist`,
`bgsave-cpulist` configurations to ADVANCED CONFIG.
- Moved `ignore-warnings` configuration to ADVANCED CONFIG.
- Moved `availability-zone` configuration to GENERAL.
These configs were incorrectly placed at the end of the file in the
ACTIVE DEFRAGMENTATION section.
Fixes#2736
---------
Signed-off-by: ritoban23 <ankudutt101@gmail.com>
When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).
In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.
Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.
Interface changes:
- New slot_migration_fork_in_progress info field.
- New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
- Also add slot migration fork to the cluster class trace latency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Jacob Murphy <jkmurphy@google.com>
New config options:
* cluster-announce-client-port
* cluster-announce-client-tls-port
If enabled, clients will always get to see the configured port for a
node instead of the internally announced port(s), the same way that
`cluster-announce-client-ipv4` and `cluster-announce-client-ipv6` work.
Cluster-internal communication uses the non-client variant of these
options.
The configuration is propagated throughout the cluster using new ping
extensions.
Closes#2377
---------
Signed-off-by: Marvin Rösch <marvinroesch99@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Instead of parsing only one command per client before executing it,
parse multiple commands from the query buffer and batch-prefetch the
keys accessed by the commands in the queue before executing them.
This is an optimization for pipelined commands, both with and without
I/O threads. The optimization is currently disabled for the replication
stream, due to failures (probably caused by how the replication offset
is calculated based on the query buffer offset).
* When parsing commands from the input buffer, multiple commands are
parsed and stored in a command queue per client.
* In single-threaded mode (I/O threads off) keys are batch-prefetched
before the commands in the queue are executed. Multi-key commands like
MGET, MSET and DEL benefit from this even if pipelining is not used.
* Prefetching when I/O threads are used does prefetching for multiple
clients in parallel. This code takes client command queues into account,
improving prefetching when pipelining is used. The batch size is
controlled by the existing config `prefetch-batch-max-size` (default
16), which so far only was used together with I/O threads. The config is
moved to a different section in `valkey.conf`.
* When I/O threads are used and the maximum number of keys are
prefetched, a client's command is executed, then the next one in the
queue, etc. If there are more commands in the queue for which the keys
have not been prefetched (say the client sends 16 pipelined MGET with 16
keys in each) keys for the next few commands in the queue are prefetched
before the commands is executed if prefetching has not been done for the
next command. (This utilizes the code path used in single-threaded
mode.)
Code improvements:
* Decoupling of command parser state and command execution state:
* The variables reqtype, multibulklen and bulklen refer to the current
position in the query buffer. These are no longer reset in resetClient
(which runs after each command being executed). Instead, they are
reset in the parser code after each completely parsed command.
* The command parser code is partially decoupled from the client struct.
The query buffer is still one per client, but the resulting argument
vector is stored in caller-defined variables.
Fixes#2044
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).
This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.
## New commands
* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
* `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations
This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.
## Slot migration jobs
Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.
All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.
## Replication
* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.
## `CLUSTER SYNCSLOTS`
To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.
Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:
* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness
## Interactions with other commands
* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data
## Error handling
* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.
## State machine
```
Target/Importing Node State Machine
─────────────────────────────────────────────────────────────
┌────────────────────┐
│SLOT_IMPORT_WAIT_ACK┼──────┐
└──────────┬─────────┘ │
ACK│ │
┌──────────────▼─────────────┐ │
│SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
└──────────────┬─────────────┘ │
SNAPSHOT-EOF│ │
┌───────────────▼──────────────┐ │
│SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤
└───────────────┬──────────────┘ │
PAUSED│ │
┌───────────────▼──────────────┐ │ Error Conditions:
│SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM
└───────────────┬──────────────┘ │ 2. Slot Ownership Change
FAILOVER-GRANTED│ │ 3. Demotion to replica
┌──────────────▼─────────────┐ │ 4. FLUSHDB
│SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost
└──────────────┬─────────────┘ │ 6. No ACK from source (timeout)
Takeover Performed│ │
┌──────────────▼───────────┐ │
│SLOT_MIGRATION_JOB_SUCCESS┼────┤
└──────────────────────────┘ │
│
┌─────────────────────────────────────▼─┐
│SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│
└────────────────────┬──────────────────┘
Unowned Slots Cleaned Up│
┌─────────────▼───────────┐
│SLOT_MIGRATION_JOB_FAILED│
└─────────────────────────┘
Source/Exporting Node State Machine
─────────────────────────────────────────────────────────────
┌──────────────────────┐
│SLOT_EXPORT_CONNECTING├─────────┐
└───────────┬──────────┘ │
Connected│ │
┌─────────────▼────────────┐ │
│SLOT_EXPORT_AUTHENTICATING┼───────┤
└─────────────┬────────────┘ │
Authenticated│ │
┌─────────────▼────────────┐ │
│SLOT_EXPORT_SEND_ESTABLISH┼───────┤
└─────────────┬────────────┘ │
ESTABLISH command written│ │
┌─────────────────────▼─────────────┐ │
│SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤
└─────────────────────┬─────────────┘ │
Full response read (+OK)│ │
┌────────────────▼──────────────┐ │ Error Conditions:
│SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION
└────────────────┬──────────────┘ │ 2. Slot ownership change
No other child process│ │ 3. Demotion to replica
┌────────────▼───────────┐ │ 4. FLUSHDB
│SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost
└────────────┬───────────┘ │ 6. AUTH failed
Snapshot done│ │ 7. ERR from ESTABLISH command
┌───────────▼─────────┐ │ 8. Unpaused before failover completed
│SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM)
└───────────┬─────────┘ │ 10. No ack from target (timeout)
PAUSE│ │ 11. Client output buffer overrun
┌──────────────▼─────────────┐ │
│SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤
└──────────────┬─────────────┘ │
Buffer drained│ │
┌──────────────▼────────────┐ │
│SLOT_EXPORT_FAILOVER_PAUSED┼───────┤
└──────────────┬────────────┘ │
Failover request granted│ │
┌───────────────▼────────────┐ │
│SLOT_EXPORT_FAILOVER_GRANTED┼───────┤
└───────────────┬────────────┘ │
New topology received│ │
┌──────────────▼───────────┐ │
│SLOT_MIGRATION_JOB_SUCCESS│ │
└──────────────────────────┘ │
│
┌─────────────────────────┐ │
│SLOT_MIGRATION_JOB_FAILED│◄────────┤
└─────────────────────────┘ │
│
┌────────────────────────────┐ │
│SLOT_MIGRATION_JOB_CANCELLED│◄──────┘
└────────────────────────────┘
```
Co-authored-by: Binbin <binloveplay1314@qq.com>
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In #1091, a new config `auto-failover-on-shutdown` was added. This PR
changes the config to make it unified with other shutdown related
options. This feature has not yet been released, so it's not a breaking
change.
The auto-failover-on-shutdown config is replaced by
* A new "failover" option to the existing configs `shutdown-on-sigterm`
and `shutdown-on-sigint`.
* A new FAILOVER option to the SHUTDOWN command.
Additionally, a history entry is added to the SHUTDOWN command which was
missing in #2195.
Follow-up of #1091.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR implements support for automatic client authentication based on
a field in the client's TLS certificate.
API Changes:
* New configuration directive `tls-auth-clients-user`, values `CN` |
`off`, default `off`. CN means take username from the CommonName field
in the client's certificate.
* New INFO field `acl_access_denied_tls_cert` under the `Stats` section,
indicating the number of failed authentications using this feature, i.e.
client certificates for which no matching username was found.
* New reason "tcl-cert" in the ACL log, logged when a client
certificate's CommonName fails to match any existing username.
Closes#1866
---------
Signed-off-by: Omkar Mestry <om.m.mestry@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Omkar Mestry <omanges@google.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Add SAFE option to SHUTDOWN. If we passed SAFE, the SHUTDOWN
will refuse to shutdown if it is not safe to shutdown. Like if
myself is a voting primary, it will refuse to shutdown. This
avoids the situation where a replica suddenly becomes the primary
when we shutting down the replica, or shutting down a primary node
by mistake and then causing the cluster to down.
Add SAFE option to SHUTDOWN command.
Add safe option to shutdown-on-sigint and shutdown-on-sigterm.
Note that SAFE cannot prevent FORCE, in the case of FORCE, SAFE
will print the relevant logs and do the FORCE shutdown, we allow
this combination.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Earlier we described the `slowlog-log-slower-than` configuration
option like this, we explicitly mentioned the special meaning of
negative numbers (actually -1) and 0.
```
The following time is expressed in microseconds, so 1000000 is equivalent
to one second. Note that a negative number disables the slow log, while
a value of zero forces the logging of every command.
slowlog-log-slower-than 10000
```
And after #1294 we lost this text, we need to mention these values,
and we can mention the special number for all command log's configs,
also it seems we should also mention `slowlog-max-len`.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## cluster: add multi-database support in cluster mode
Add multi-database support in cluster mode to align with standalone mode
and facilitate migration. Previously, cluster mode was restricted to a
single database (DB0). This change allows multiple databases while
preserving the existing slot-based key distribution.
### Key Features:
- Database-Agnostic Hashing. The hashing algorithm is unchanged.
Identical keys always map to the same slot across all databases,
ensuring consistent key distribution and compatibility with
existing single-database setups.
- Multi-DB commands support. SELECT, MOVE, and COPY are now supported in
cluster mode.
- Fully backward compatible with no API changes.
- SWAPDB is not supported in cluster mode. It is unsafe due to
inconsistency risks.
### Command-Level Changes:
- SELECT / MOVE / COPY are now supported in cluster mode.
- MOVE / COPY (with db) are rejected (TRYAGAIN error) during slot
migration to prevent multi-DB inconsistencies.
- SWAPDB will return an error if used when cluster mode is enabled.
- GETKEYSINSLOT, COUNTKEYSINSLOT and MIGRATE will operate in the context
of the selected database.
This means, for example, that migrating keys in a slot will require
iterating and repeating across all databases.
### Slot Migration Process:
- Multi-DB support in cluster mode affects slot migration. Operators
should now iterate over all the configured databases.
### Transaction Handling (MULTI/EXEC):
- getNodeByQuery key lookup behavior changed:
- No key lookups when queuing commands in MULTI, only cross-slot
validation.
- Key lookups happen at EXEC time.
- SELECT inside MULTI/EXEC is now checked, ensuring key validation
uses the selected DB at lookup.
### Valkey-cli:
- valkey-cli has been updated to support resharding across all
databases.
### Configuration:
- Introduce new configuration `cluster-databases`.
The new configuration controls the maximal number of databases in
cluster mode.
Implements https://github.com/valkey-io/valkey/issues/1319
---------
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Allow replicas to use MPTCP in the outgoing replication connection.
A new yes/no config is introduced `repl-mptcp`, default `no`.
For MPTCP to be used in replication, the primary needs to be configured
with `mptcp yes` and the replica with `repl-mptcp yes`. Otherwise, the
connection falls back to regular TCP.
Follow-up of #1811.
---------
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Multipath TCP (MPTCP) is an extension of the standard TCP protocol that
allows a single transport connection to use multiple network interfaces
or paths. MPTCP is useful for applications like bandwidth aggregation,
failover, and more resilient connections.
Linux kernel starts to support MPTCP since v5.6, it's time to support
it.
The test report shows that MPTCP reduces latency by ~25% in a 1%
networking packet drop environment.
Thanks to Matthieu Baerts <matttbe@kernel.org> for lots of review
suggestions.
Proposed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Cc Linux kernel MPTCP maintainer @matttbe
When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some
seconds).
It's too much time for us to not accept writes.
If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.
When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a
failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.
The primary does this by sending a CLUSTER FAILOVER command to the
replica.
We added a replicaid arg to CLUSTER FAILOVER, after receiving the
command,
the replica will check whether the node-id is itself, if not, the
command
will be ignored. The node-id is set by the replica through client
setname
during the replication handshake.
### New argument for CLUSTER FAILOVER
So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID
node-id],
this arg does not intented for user use, so it will not be added to the
JSON
file.
### Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id
During the replication handshake, replica now will use REPLCONF
SET-CLUSTER-NODE-ID
to inform the primary of replica node-id.
### Primary issue CLUSTER FAILOVER
Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas
because
it is a shared replication buffer but only the replica with the
mathching id
will execute it.
### Add a new auto-failover-on-shutdown config
People can disable this feature if they don't like it, the default is 0.
This closes#939.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Allows cluster admins to configure the cluster manual failover timeout
as
needed, admins can configure how long a primary would be paused in the
worst case scenario such as a failover timed out due to the insufficient
votes.
The configuration name is cluster-manual-failover-timeout, the unit is
milliseconds, and the range is [1, INT_MAX]ms.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Expired keys search happens without user actions. Therefore, the
"interactively" word in the description of the active-expire-key
parameter is confusing and is changed to "incrementally."
modified: valkey.conf
---------
Signed-off-by: Anastasia Alexadrova <anastasia.alexandrova@percona.com>
We add it to HELLO response in #1487.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The `serverCron()` function contains a variety of maintenance functions
and is set up as a timer job, configured to run at a certain rate (hz).
The default rate is 10hz (every 100ms).
One of the things that `serverCron()` does is to perform maintenance
functions on connected clients. Since the number of clients is variable,
and can be very large, this could cause latency spikes when the 100ms
`serverCron()` task gets invoked. To combat those latency spikes, a
feature called "dynamic-hz" was introduced. This feature will run
`serverCron()` more often, if there are more clients. The clients get
processed up to 200 at a time. The delay for `serverCron()` is shortened
with the goal of processing all of the clients every second.
The result of this is that some of the other (non-client) maintenance
functions also get (unnecessarily) run more often. Like
`cronUpdateMemoryStats()` and `databasesCron()`. Logically, it doesn't
make sense to run these functions more often, just because we happen to
have more clients attached.
This PR separates client activities onto a separate, variable, timer.
The "dynamic-hz" feature is eliminated. Now, `serverCron` will run at a
standard configured rate. The separate clients cron will automatically
adjust based on the number of clients. This has the added benefit that
often, the 2 crons will fire during separate event loop invocations and
will usually avoid the combined latency impact of doing both maintenance
activities together.
The new timer follows the same rules which were established with the
dynamic HZ feature.
* The goal is to process all of the clients once per second
* We never want to process more than 200 clients in a single invocation
(`MAX_CLIENTS_PER_CLOCK_TICK`)
* We always process at least 5 clients at a time
(`CLIENTS_CRON_MIN_ITERATIONS`)
* The minimum rate is determined by HZ
The delay (ms) for the new timer is also more precise, computing the
number of milliseconds needed to achieve the goal of reaching all of the
clients every second. The old dynamic-hz feature just performs a
doubling of the HZ until the clients processing rate is achieved (i.e.
delays of 100ms, 50ms, 25ms, 12ms...)
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
New config `rdb-version-check` with values:
* `strict`: Reject future RDB versions.
* `relaxed`: Try parsing future RDB versions and fail only when an
unknown RDB opcode or type is encountered.
This can make it possible for Valkey 8.1 to try read a dump from for
example Valkey 9.0 or later on a best-effort basis. The conditions for
when this is expected to work can be defined when the future Valkey
versions are released. Loading is expected to fail in the following
cases:
* If the data set contains any new key types or other data elements not
supported by the current version.
* If the RDB contains new representations or encodings of existing key
types or other data elements.
This change also prepares for the next RDB version bump. A range of RDB
versions (12-79) is reserved, since it's expected to be used by foreign
software RDB versions, so Valkey will not accept versions in this range
even with the `relaxed` version check. The DUMP/RESTORE format has no
magic string; only the RDB version number.
This change also prepares for the magic string to change from REDIS to
VALKEY next time we bump the RDB version.
Related to #1108.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
As discussed in PR #336.
We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:
1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.
This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:
* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`
And the slowlog is also incorporated into the commandlog.
For each of these three types, additional configs have been added for
control:
* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.
A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.
We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Refer to: https://github.com/valkey-io/valkey/issues/1141
This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing
With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.
(See https://github.com/valkey-io/valkey/issues/1141 for more complete
details.)
This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.
I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
There are several patches in this PR:
* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
New config: `import-mode (yes|no)`
New command: `CLIENT IMPORT-SOURCE (ON|OFF)`
The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.
When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.
E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.
In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.
In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.
**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Add config options for log format and timestamp format introduced by
#1022
Related to #1225
This change adds two new configs into valkey.conf:
log-format
log-timestamp-format
---------
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
Currently, if the replica has a lot of data, CLUSTER RESET
will block for a while and report the slowlog, and it seems
that there is no harm in making it async so external components
can be easier when monitoring it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.
If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.
This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).
This PR also improves the description of the activerehashing
configuration item to explain this change.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
this fixes: https://github.com/valkey-io/valkey/issues/1116
_Issue details from #1116 by @zuiderkwast_
> This config is undocumented since #758. The default was changed to
"yes" and it is quite useless to set it to "no". Yet, it can happen that
some user has an old config file where it is explicitly set to "no". The
result will be bad performace, since I/O threads will not do all the
I/O.
>
> It's indeed confusing.
>
> 1. Either remove the whole option from the code. And thus no need for
documentation. _OR:_
> 2. Introduce the option back in the configuration, just as a comment
is fine. And showing the default value "yes": `# io-threads-do-reads
yes` with additional text.
>
> _Originally posted by @melroy89 in [#1019 (reply in
thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
A new option for diskless replication on the replica side.
After a network failure, the replica may need to perform a full sync.
The other option for diskless full sync is `swapdb`, but it uses twice
as much memory, temporarily. In situations where this is not acceptable,
and where losing data is acceptable, the `flush-before-load` can be
useful. If the full sync fails, the old data is lost though. Therefore,
the new option is marked as "dangerous".
---------
Signed-off-by: kronwerk <ca11e5e22g@gmail.com>
Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>
Co-authored-by: kronwerk <ca11e5e22g@gmail.com>
Before this doc update, the comments in valkey.conf said that DEL is a
blocking command, and even refered to other synchronous freeing as "in a
blocking way, like if DEL was called". This has now become confusing and
incorrect, since DEL is now non-blocking by default.
The comments also mentioned too much about the "old default" and only
later explain that the "new default" is non-blocking.
This doc update focuses on the current default and expresses it like
"Starting from Valkey 8.0, lazy freeing is enabled by default", rather
than using words like old and new.
This is a follow-up to #913.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames.
Added a new hide-user-data-from-log configuration item, default yes.
---------
Signed-off-by: Amit Nagler <anagler123@gmail.com>
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.
## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently, the `dual-channel-replication` feature flag is immutable if
`enable-protected-configs` is enabled, which is the default behavior.
This PR proposes to make the `dual-channel-replication` flag mutable,
allowing it to be changed dynamically without restarting the cluster.
**Motivation:**
The ability to change the `dual-channel-replication` flag dynamically is
essential for testing and validating the feature on real clusters
running in production environments. By making the flag mutable, we can
enable or disable the feature without disrupting the cluster's
operations, facilitating easier testing and experimentation.
Additionally, this change would provide more flexibility for users to
enable or disable the feature based on their specific requirements or
operational needs without requiring a cluster restart.
---------
Signed-off-by: naglera <anagler123@gmail.com>
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.
### Memory Access Amortization
Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.
We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.
**Implementation details**
When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Add new optional, immutable string config called `unixsocketgroup`.
Change the group of the unix socket to `unixsocketgroup` after `bind()`
if specified.
Adds tests to validate the behavior.
Fixes#873.
Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
The repl-backlog-size 1mb is too small in most cases, now network
transmission and bandwidth performance have improved rapidly in more
than ten years.
The bigger the replication backlog, the longer the replica can endure
the disconnect and later be able to perform a partial resynchronization.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
I think it is a good idea to mention this.
The Cluster config file is written relative this directory, if the
'cluster-config-file' configuration directive is a relative path.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)
## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.
## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
using the RDB channel.
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
channel to request a PSYNC starting at the snapshot end offset.
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
changes into memory. Repl steady state continues normally.
## New replica state machine

## Data <a name="data"></a>



## Explanation
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.
First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).
## Test setup
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Allows cluster admins to configure the blacklist TTL as needed to allow
sufficient time for `CLUSTER FORGET` to be executed on every node in the
cluster.
Config name `cluster-blacklist-ttl`; unit seconds; deault 60.
---------
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
New configs:
* `cluster-announce-client-ipv4`
* `cluster-announce-client-ipv6`
New module API function:
* `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is
otherwise just like its non-ForClient cousin.
If configured, one of these IP addresses are reported to each client in
CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing
the IP (`custer-announce-ip` or the auto-detected IP) of each node.
Which one is reported to the client depends on whether the client is
connected over IPv4 or IPv6.
Benefits:
* This allows clients using IPv4 to get the IPv4 addresses of all
cluster nodes and IPv6 clients to get the IPv6 clients.
* This allows the IPs visible to clients to be different to the IPs used
between the cluster nodes due to NAT'ing.
The information is propagated in the cluster bus using new Ping
extensions. (Old nodes without this feature ignore unknown Ping
extensions.)
This adds another dimension to CLUSTER SLOTS reply. It now depends on
the client's use of TLS, the IP address family and RESP version.
Refactoring: The cached connection type definition is moved from
connection.h (it actually has nothing to do with the connection
abstraction) to server.h and is changed to a bitmap, with one bit for
each of TLS, IPv6 and RESP3.
Fixes#337
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests
per second, as detailed by [dan touitou](https://github.com/touitou-dan)
in https://github.com/valkey-io/valkey/issues/22. This PR modifies the
IO threads to be fully asynchronous, which is a first and necessary step
to allow more work offloading and better utilization of the IO threads.
### Current IO threads state:
Valkey IO threads were introduced in Redis 6.0 to allow better
utilization of multi-core machines. Before this, Redis was
single-threaded and could only use one CPU core for network and command
processing. The introduction of IO threads helps in offloading the IO
operations to multiple threads.
**Current IO Threads flow:**
1. Initialization: When Redis starts, it initializes a specified number
of IO threads. These threads are in addition to the main thread, each
thread starts with an empty list, the main thread will populate that
list in each event-loop with pending-read-clients or
pending-write-clients.
2. Read Phase: The main thread accepts incoming connections and reads
requests from clients. The reading of requests are offloaded to IO
threads. The main thread puts the clients ready-to-read in a list and
set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick
the clients up, perform the read operation and parse the first incoming
command.
3. Command Processing: After reading the requests, command processing is
still single-threaded and handled by the main thread.
4. Write Phase: Similar to the read phase, the write phase is also be
offloaded to IO threads. The main thread prepares the response in the
clients’ output buffer then the main thread puts the client in the list,
and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO
threads then pick the clients up and perform the write operation to send
the responses back to clients.
5. Synchronization: The main-thread communicate with the threads on how
many jobs left per each thread with atomic counter. The main-thread
doesn’t access the clients while being handled by the IO threads.
**Issues with current implementation:**
* Underutilized Cores: The current implementation of IO-threads leads to
the underutilization of CPU cores.
* The main thread remains responsible for a significant portion of
IO-related tasks that could be offloaded to IO-threads.
* When the main-thread is processing client’s commands, the IO threads
are idle for a considerable amount of time.
* Notably, the main thread's performance during the IO-related tasks is
constrained by the speed of the slowest IO-thread.
* Limited Offloading: Currently, Since the Main-threads waits
synchronously for the IO threads, the Threads perform only read-parse,
and write operations, with parsing done only for the first command. If
the threads can do work asynchronously we may offload more work to the
threads reducing the load from the main-thread.
* TLS: Currently, we don't support IO threads with TLS (where offloading
IO would be more beneficial) since TLS read/write operations are not
thread-safe with the current implementation.
### Suggested change
Non-blocking main thread - The main thread and IO threads will operate
in parallel to maximize efficiency. The main thread will not be blocked
by IO operations. It will continue to process commands independently of
the IO thread's activities.
**Implementation details**
**Inter-thread communication.**
* We use a static, lock-free ring buffer of fixed size (2048 jobs) for
the main thread to send jobs and for the IO to receive them. If the ring
buffer fills up, the main thread will handle the task itself, acting as
back pressure (in case IO operations are more expensive than command
processing). A static ring buffer is a better candidate than a dynamic
job queue as it eliminates the need for allocation/freeing per job.
* An IO job will be in the format: ` [void* function-call-back | void
*data] `where data is either a client to read/write from and the
function-ptr is the function to be called with the data for example
readQueryFromClient using this format we can use it later to offload
other types of works to the IO threads.
* The Ring buffer is one way from the main-thread to the IO thread, Upon
read/write event the main thread will send a read/write job then in
before sleep it will iterate over the pending read/write clients to
checking for each client if the IO threads has already finished handling
it. The IO thread signals it has finished handling a client read/write
by toggling an atomic flag read_state / write_state on the client
struct.
**Thread Safety**
As suggested in this solution, the IO threads are reading from and
writing to the clients' buffers while the main thread may access those
clients.
We must ensure no race conditions or unsafe access occurs while keeping
the Valkey code simple and lock free.
Minimal Action in the IO Threads
The main change is to limit the IO thread operations to the bare
minimum. The IO thread will access only the client's struct and only the
necessary fields in this struct.
The IO threads will be responsible for the following:
* Read Operation: The IO thread will only read and parse a single
command. It will not update the server stats, handle read errors, or
parsing errors. These tasks will be taken care of by the main thread.
* Write Operation: The IO thread will only write the available data. It
will not free the client's replies, handle write errors, or update the
server statistics.
To achieve this without code duplication, the read/write code has been
refactored into smaller, independent components:
* Functions that perform only the read/parse/write calls.
* Functions that handle the read/parse/write results.
This refactor accounts for the majority of the modifications in this PR.
**Client Struct Safe Access**
As we ensure that the IO threads access memory only within the client
struct, we need to ensure thread safety only for the client's struct's
shared fields.
* Query Buffer
* Command parsing - The main thread will not try to parse a command from
the query buffer when a client is offloaded to the IO thread.
* Client's memory checks in client-cron - The main thread will not
access the client query buffer if it is offloaded and will handle the
querybuf grow/shrink when the client is back.
* CLIENT LIST command - The main thread will busy-wait for the IO thread
to finish handling the client, falling back to the current behavior
where the main thread waits for the IO thread to finish their
processing.
* Output Buffer
* The IO thread will not change the client's bufpos and won't free the
client's reply lists. These actions will be done by the main thread on
the client's return from the IO thread.
* bufpos / block→used: As the main thread may change the bufpos, the
reply-block→used, or add/delete blocks to the reply list while the IO
thread writes, we add two fields to the client struct: io_last_bufpos
and io_last_reply_block. The IO thread will write until the
io_last_bufpos, which was set by the main-thread before sending the
client to the IO thread. If more data has been added to the cob in
between, it will be written in the next write-job. In addition, the main
thread will not trim or merge reply blocks while the client is
offloaded.
* Parsing Fields
* Client's cmd, argc, argv, reqtype, etc., are set during parsing.
* The main thread will indicate to the IO thread not to parse a cmd if
the client is not reset. In this case, the IO thread will only read from
the network and won't attempt to parse a new command.
* The main thread won't access the c→cmd/c→argv in the CLIENT LIST
command as stated before it will busy wait for the IO threads.
* Client Flags
* c→flags, which may be changed by the main thread in multiple places,
won't be accessed by the IO thread. Instead, the main thread will set
the c→io_flags with the information necessary for the IO thread to know
the client's state.
* Client Close
* On freeClient, the main thread will busy wait for the IO thread to
finish processing the client's read/write before proceeding to free the
client.
* Client's Memory Limits
* The IO thread won't handle the qb/cob limits. In case a client crosses
the qb limit, the IO thread will stop reading for it, letting the main
thread know that the client crossed the limit.
**TLS**
TLS is currently not supported with IO threads for the following
reasons:
1. Pending reads - If SSL has pending data that has already been read
from the socket, there is a risk of not calling the read handler again.
To handle this, a list is used to hold the pending clients. With IO
threads, multiple threads can access the list concurrently.
2. Event loop modification - Currently, the TLS code
registers/unregisters the file descriptor from the event loop depending
on the read/write results. With IO threads, multiple threads can modify
the event loop struct simultaneously.
3. The same client can be sent to 2 different threads concurrently
(https://github.com/redis/redis/issues/12540).
Those issues were handled in the current PR:
1. The IO thread only performs the read operation. The main thread will
check for pending reads after the client returns from the IO thread and
will be the only one to access the pending list.
2. The registering/unregistering of events will be similarly postponed
and handled by the main thread only.
3. Each client is being sent to the same dedicated thread (c→id %
num_of_threads).
**Sending Replies Immediately with IO threads.**
Currently, after processing a command, we add the client to the
pending_writes_list. Only after processing all the clients do we send
all the replies. Since the IO threads are now working asynchronously, we
can send the reply immediately after processing the client’s requests,
reducing the command latency. However, if we are using AOF=always, we
must wait for the AOF buffer to be written, in which case we revert to
the current behavior.
**IO threads dynamic adjustment**
Currently, we use an all-or-nothing approach when activating the IO
threads. The current logic is as follows: if the number of pending write
clients is greater than twice the number of threads (including the main
thread), we enable all threads; otherwise, we enable none. For example,
if 8 IO threads are defined, we enable all 8 threads if there are 16
pending clients; else, we enable none.
It makes more sense to enable partial activation of the IO threads. If
we have 10 pending clients, we will enable 5 threads, and so on. This
approach allows for a more granular and efficient allocation of
resources based on the current workload.
In addition, the user will now be able to change the number of I/O
threads at runtime. For example, when decreasing the number of threads
from 4 to 2, threads 3 and 4 will be closed after flushing their job
queues.
**Tests**
Currently, we run the io-threads tests with 4 IO threads
(443d80f168/.github/workflows/daily.yml (L353)).
This means that we will not activate the IO threads unless there are 8
(threads * 2) pending write clients per single loop, which is unlikely
to happened in most of tests, meaning the IO threads are not currently
being tested.
To enforce the main thread to always offload work to the IO threads,
regardless of the number of pending events, we add an
events-per-io-thread configuration with a default value of 2. When set
to 0, this configuration will force the main thread to always offload
work to the IO threads.
When we offload every single read/write operation to the IO threads, the
IO-threads are running with 100% CPU when running multiple tests
concurrently some tests fail as a result of larger than expected command
latencies. To address this issue, we have to add some after or wait_for
calls to some of the tests to ensure they pass with IO threads as well.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
When Redis/Valkey/KeyDB is run in a cloud environment across multiple
AZ's it is preferable to keep traffic local to an AZ both for cost
reasons and for latency. This is typically done when you are enabling
reads on replicas with the READONLY command.
For this change we are creating a setting that is echo'd back in the
info command. We do not want to add the cloud SDKs as dependencies and
this is the easiest way around that. It is fairly trivial to grab the AZ
from the cloud and push that into your setting file.
Currently at Snapchat we have a custom client that after connecting
reads this from the server and will preferentially use that server if
the AZ string matches its internally configured AZ.
In the future it would be ideal if we used this information when
performing failover or even exposed it in cluster nodes.
Signed-off-by: John Sully <john@csquare.ca>