valkey

Commit Graph

Author	SHA1	Message	Date
Binbin	8ea7f1330c	Update dual channel replication conf to mention the local buffer is imited by COB (#2824 ) After introducing the dual channel replication in #60, we decided in #915 not to add a new configuration item to limit the replica's local replication buffer, just use "client-output-buffer-limit replica hard" to limit it. We need to document this behavior and mention that once the limit is reached, all future data will accumulate in the primary side. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-11-23 23:27:50 +08:00
Sarthak Aggarwal	32844b8b0a	Configurable DB hash seed for SCAN family commands consistency (#2608 ) Introduce a new config `hash-seed` which can be set only at startup and controls the hash seed for the server. This includes all hash tables. This change makes it so that both primaries and replicas will return the same results for SCAN/HSCAN/ZSCAN/SSCAN cursors. This is useful in order to make sure SCAN behaves correctly after a failover. Resolves #4 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-11-05 08:45:52 -08:00
Ritoban Dutta	909d082cd0	Reorder valkey.conf: move configs to correct sections (#2737 ) - Moved `server-cpulist`, `bio-cpulist`, `aof-rewrite-cpulist`, `bgsave-cpulist` configurations to ADVANCED CONFIG. - Moved `ignore-warnings` configuration to ADVANCED CONFIG. - Moved `availability-zone` configuration to GENERAL. These configs were incorrectly placed at the end of the file in the ACTIVE DEFRAGMENTATION section. Fixes #2736 --------- Signed-off-by: ritoban23 <ankudutt101@gmail.com>	2025-10-28 10:36:23 +01:00
Binbin	f6a0f8cfc0	Separate RDB snapshotting from atomic slot migration (#2533 ) When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com> Co-authored-by: Jacob Murphy <jkmurphy@google.com>	2025-09-18 16:26:42 +08:00
Marvin Rösch	3b13a7cd13	Add cluster-announce-client-(port\|tls-port) configs (#2429 ) New config options: * cluster-announce-client-port * cluster-announce-client-tls-port If enabled, clients will always get to see the configured port for a node instead of the internally announced port(s), the same way that `cluster-announce-client-ipv4` and `cluster-announce-client-ipv6` work. Cluster-internal communication uses the non-client variant of these options. The configuration is propagated throughout the cluster using new ping extensions. Closes #2377 --------- Signed-off-by: Marvin Rösch <marvinroesch99@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-09-08 18:38:53 +02:00
Viktor Söderqvist	09fb436cf0	Optimize pipelining by parsing and prefetching multiple commands (#2092 ) Instead of parsing only one command per client before executing it, parse multiple commands from the query buffer and batch-prefetch the keys accessed by the commands in the queue before executing them. This is an optimization for pipelined commands, both with and without I/O threads. The optimization is currently disabled for the replication stream, due to failures (probably caused by how the replication offset is calculated based on the query buffer offset). * When parsing commands from the input buffer, multiple commands are parsed and stored in a command queue per client. * In single-threaded mode (I/O threads off) keys are batch-prefetched before the commands in the queue are executed. Multi-key commands like MGET, MSET and DEL benefit from this even if pipelining is not used. * Prefetching when I/O threads are used does prefetching for multiple clients in parallel. This code takes client command queues into account, improving prefetching when pipelining is used. The batch size is controlled by the existing config `prefetch-batch-max-size` (default 16), which so far only was used together with I/O threads. The config is moved to a different section in `valkey.conf`. * When I/O threads are used and the maximum number of keys are prefetched, a client's command is executed, then the next one in the queue, etc. If there are more commands in the queue for which the keys have not been prefetched (say the client sends 16 pipelined MGET with 16 keys in each) keys for the next few commands in the queue are prefetched before the commands is executed if prefetching has not been done for the next command. (This utilizes the code path used in single-threaded mode.) Code improvements: * Decoupling of command parser state and command execution state: * The variables reqtype, multibulklen and bulklen refer to the current position in the query buffer. These are no longer reset in resetClient (which runs after each command being executed). Instead, they are reset in the parser code after each completely parsed command. * The command parser code is partially decoupled from the client struct. The query buffer is still one per client, but the resulting argument vector is stored in caller-defined variables. Fixes #2044 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2025-08-22 16:02:27 +02:00
Jacob Murphy	d7993b78d8	Introduce atomic slot migration (#1949 ) Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover). This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at #1591, modified to meet the designs we had outlined in #23. ## New commands * `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id`: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating `SLOTSRANGE ... NODE ...` * `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations * `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core. ## Slot migration jobs Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it. All jobs, including those that finished recently, can be observed using the `CLUSTER GETSLOTMIGRATIONS` command. ## Replication * Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node. * We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes. ## `CLUSTER SYNCSLOTS` To coordinate the state machine transitions across the two nodes, a new command is added, `CLUSTER SYNCSLOTS`, that performs this control flow. Each end of the slot migration connection is expected to install a read handler in order to handle `CLUSTER SYNCSLOTS` commands: * `ESTABLISH`: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots. * `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the snapshot is done being written to the target. * `PAUSE`: informs the source node to pause whenever it gets the opportunity * `PAUSED`: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size * `REQUEST-FAILOVER`: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration * `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER is granted * `ACK`: heartbeat command used to ensure liveness ## Interactions with other commands * FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty) * FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry. * Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense * SCAN and KEYS are filtered to avoid exposing importing slot data ## Error handling * For any transient connection drops, the migration will be failed and require the user to retry. * If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data * If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail * If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner * The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than `repl-timeout`, the connection will be dropped, resulting in migration failure * If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target. ## State machine ``` Target/Importing Node State Machine ───────────────────────────────────────────────────────────── ┌────────────────────┐ │SLOT_IMPORT_WAIT_ACK┼──────┐ └──────────┬─────────┘ │ ACK│ │ ┌──────────────▼─────────────┐ │ │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤ └──────────────┬─────────────┘ │ SNAPSHOT-EOF│ │ ┌───────────────▼──────────────┐ │ │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤ └───────────────┬──────────────┘ │ PAUSED│ │ ┌───────────────▼──────────────┐ │ Error Conditions: │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM └───────────────┬──────────────┘ │ 2. Slot Ownership Change FAILOVER-GRANTED│ │ 3. Demotion to replica ┌──────────────▼─────────────┐ │ 4. FLUSHDB │SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost └──────────────┬─────────────┘ │ 6. No ACK from source (timeout) Takeover Performed│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS┼────┤ └──────────────────────────┘ │ │ ┌─────────────────────────────────────▼─┐ │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│ └────────────────────┬──────────────────┘ Unowned Slots Cleaned Up│ ┌─────────────▼───────────┐ │SLOT_MIGRATION_JOB_FAILED│ └─────────────────────────┘ Source/Exporting Node State Machine ───────────────────────────────────────────────────────────── ┌──────────────────────┐ │SLOT_EXPORT_CONNECTING├─────────┐ └───────────┬──────────┘ │ Connected│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_AUTHENTICATING┼───────┤ └─────────────┬────────────┘ │ Authenticated│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_SEND_ESTABLISH┼───────┤ └─────────────┬────────────┘ │ ESTABLISH command written│ │ ┌─────────────────────▼─────────────┐ │ │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤ └─────────────────────┬─────────────┘ │ Full response read (+OK)│ │ ┌────────────────▼──────────────┐ │ Error Conditions: │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION └────────────────┬──────────────┘ │ 2. Slot ownership change No other child process│ │ 3. Demotion to replica ┌────────────▼───────────┐ │ 4. FLUSHDB │SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost └────────────┬───────────┘ │ 6. AUTH failed Snapshot done│ │ 7. ERR from ESTABLISH command ┌───────────▼─────────┐ │ 8. Unpaused before failover completed │SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM) └───────────┬─────────┘ │ 10. No ack from target (timeout) PAUSE│ │ 11. Client output buffer overrun ┌──────────────▼─────────────┐ │ │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤ └──────────────┬─────────────┘ │ Buffer drained│ │ ┌──────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤ └──────────────┬────────────┘ │ Failover request granted│ │ ┌───────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤ └───────────────┬────────────┘ │ New topology received│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS│ │ └──────────────────────────┘ │ │ ┌─────────────────────────┐ │ │SLOT_MIGRATION_JOB_FAILED│◄────────┤ └─────────────────────────┘ │ │ ┌────────────────────────────┐ │ │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘ └────────────────────────────┘ ``` Co-authored-by: Binbin <binloveplay1314@qq.com> --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2025-08-11 18:02:37 -07:00
Viktor Söderqvist	89e0e64964	Auto-failover on shutdown unified config (#2292 ) In #1091, a new config `auto-failover-on-shutdown` was added. This PR changes the config to make it unified with other shutdown related options. This feature has not yet been released, so it's not a breaking change. The auto-failover-on-shutdown config is replaced by * A new "failover" option to the existing configs `shutdown-on-sigterm` and `shutdown-on-sigint`. * A new FAILOVER option to the SHUTDOWN command. Additionally, a history entry is added to the SHUTDOWN command which was missing in #2195. Follow-up of #1091. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-07-22 17:42:11 +02:00
Omkar Mestry	2cd62fa809	Add support for automatic client authentication via TLS certificate fields (#1920 ) This PR implements support for automatic client authentication based on a field in the client's TLS certificate. API Changes: * New configuration directive `tls-auth-clients-user`, values `CN` \| `off`, default `off`. CN means take username from the CommonName field in the client's certificate. * New INFO field `acl_access_denied_tls_cert` under the `Stats` section, indicating the number of failed authentications using this feature, i.e. client certificates for which no matching username was found. * New reason "tcl-cert" in the ACL log, logged when a client certificate's CommonName fails to match any existing username. Closes #1866 --------- Signed-off-by: Omkar Mestry <om.m.mestry@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Omkar Mestry <omanges@google.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-07-12 10:58:25 +02:00
Josh Soref	57b176169d	Spelling 9 (#2245 ) - Subset of #2183 --------- Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com>	2025-07-04 16:15:04 -04:00
Binbin	1e05724f59	Add SAFE option to SHUTDOWN to reject shutdown in unsafe situations (#2195 ) Add SAFE option to SHUTDOWN. If we passed SAFE, the SHUTDOWN will refuse to shutdown if it is not safe to shutdown. Like if myself is a voting primary, it will refuse to shutdown. This avoids the situation where a replica suddenly becomes the primary when we shutting down the replica, or shutting down a primary node by mistake and then causing the cluster to down. Add SAFE option to SHUTDOWN command. Add safe option to shutdown-on-sigint and shutdown-on-sigterm. Note that SAFE cannot prevent FORCE, in the case of FORCE, SAFE will print the relevant logs and do the FORCE shutdown, we allow this combination. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-07-01 11:52:55 +08:00
Binbin	4771660689	Update extended-redis-compatibility conf to remove the specific words (#2192 ) In #2189, we decided to keep it for at least another year. Closes #2189. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-06-10 19:48:03 +08:00
Binbin	767302e3ae	Update commandlog parameters conf text to mention special value settings (#2074 ) Earlier we described the `slowlog-log-slower-than` configuration option like this, we explicitly mentioned the special meaning of negative numbers (actually -1) and 0. ``` The following time is expressed in microseconds, so 1000000 is equivalent to one second. Note that a negative number disables the slow log, while a value of zero forces the logging of every command. slowlog-log-slower-than 10000 ``` And after #1294 we lost this text, we need to mention these values, and we can mention the special number for all command log's configs, also it seems we should also mention `slowlog-max-len`. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-06-10 19:47:49 +08:00
xbasel	2fe08f8646	Add multi-database support to cluster mode (#1671 ) ## cluster: add multi-database support in cluster mode Add multi-database support in cluster mode to align with standalone mode and facilitate migration. Previously, cluster mode was restricted to a single database (DB0). This change allows multiple databases while preserving the existing slot-based key distribution. ### Key Features: - Database-Agnostic Hashing. The hashing algorithm is unchanged. Identical keys always map to the same slot across all databases, ensuring consistent key distribution and compatibility with existing single-database setups. - Multi-DB commands support. SELECT, MOVE, and COPY are now supported in cluster mode. - Fully backward compatible with no API changes. - SWAPDB is not supported in cluster mode. It is unsafe due to inconsistency risks. ### Command-Level Changes: - SELECT / MOVE / COPY are now supported in cluster mode. - MOVE / COPY (with db) are rejected (TRYAGAIN error) during slot migration to prevent multi-DB inconsistencies. - SWAPDB will return an error if used when cluster mode is enabled. - GETKEYSINSLOT, COUNTKEYSINSLOT and MIGRATE will operate in the context of the selected database. This means, for example, that migrating keys in a slot will require iterating and repeating across all databases. ### Slot Migration Process: - Multi-DB support in cluster mode affects slot migration. Operators should now iterate over all the configured databases. ### Transaction Handling (MULTI/EXEC): - getNodeByQuery key lookup behavior changed: - No key lookups when queuing commands in MULTI, only cross-slot validation. - Key lookups happen at EXEC time. - SELECT inside MULTI/EXEC is now checked, ensuring key validation uses the selected DB at lookup. ### Valkey-cli: - valkey-cli has been updated to support resharding across all databases. ### Configuration: - Introduce new configuration `cluster-databases`. The new configuration controls the maximal number of databases in cluster mode. Implements https://github.com/valkey-io/valkey/issues/1319 --------- Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com> Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>	2025-05-03 21:12:02 -07:00
zhenwei pi	c9c49b4466	Introduce MPTCP for replication (#1961 ) Allow replicas to use MPTCP in the outgoing replication connection. A new yes/no config is introduced `repl-mptcp`, default `no`. For MPTCP to be used in replication, the primary needs to be configured with `mptcp yes` and the replica with `repl-mptcp yes`. Otherwise, the connection falls back to regular TCP. Follow-up of #1811. --------- Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2025-04-27 16:46:31 +02:00
zhenwei pi	4a92db9178	Introduce MPTCP (#1811 ) Multipath TCP (MPTCP) is an extension of the standard TCP protocol that allows a single transport connection to use multiple network interfaces or paths. MPTCP is useful for applications like bandwidth aggregation, failover, and more resilient connections. Linux kernel starts to support MPTCP since v5.6, it's time to support it. The test report shows that MPTCP reduces latency by ~25% in a 1% networking packet drop environment. Thanks to Matthieu Baerts <matttbe@kernel.org> for lots of review suggestions. Proposed-by: Geliang Tang <geliang@kernel.org> Tested-by: Gang Yan <yangang@kylinos.cn> Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Cc Linux kernel MPTCP maintainer @matttbe	2025-04-15 09:06:33 +02:00
Binbin	44dafba2ce	Trigger manual failover on SIGTERM / shutdown to cluster primary (#1091 ) When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. The primary does this by sending a CLUSTER FAILOVER command to the replica. We added a replicaid arg to CLUSTER FAILOVER, after receiving the command, the replica will check whether the node-id is itself, if not, the command will be ignored. The node-id is set by the replica through client setname during the replication handshake. ### New argument for CLUSTER FAILOVER So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID node-id], this arg does not intented for user use, so it will not be added to the JSON file. ### Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id During the replication handshake, replica now will use REPLCONF SET-CLUSTER-NODE-ID to inform the primary of replica node-id. ### Primary issue CLUSTER FAILOVER Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas because it is a shared replication buffer but only the replica with the mathching id will execute it. ### Add a new auto-failover-on-shutdown config People can disable this feature if they don't like it, the default is 0. This closes #939. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>	2025-04-08 15:06:54 -04:00
Binbin	e03b3f1fc8	Add cluster-manual-failover-timeout to configure the timeout for manual failover (#1690 ) Allows cluster admins to configure the cluster manual failover timeout as needed, admins can configure how long a primary would be paused in the worst case scenario such as a failover timed out due to the insufficient votes. The configuration name is cluster-manual-failover-timeout, the unit is milliseconds, and the range is [1, INT_MAX]ms. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-02-27 12:53:43 -08:00
Anastasia Alexandrova	ee8465ce3b	Fixed active-expire-effort description in conf file (#1773 ) Expired keys search happens without user actions. Therefore, the "interactively" word in the description of the active-expire-key parameter is confusing and is changed to "incrementally." modified: valkey.conf --------- Signed-off-by: Anastasia Alexadrova <anastasia.alexandrova@percona.com>	2025-02-25 11:06:27 +01:00
Binbin	8988a6135d	Update availability-zone conf comment to mention hello command (#1695 ) We add it to HELLO response in #1487. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-02-10 13:25:12 +01:00
Jim Brunner	c75e866176	move clientCron onto a separate timer (#1387 ) The `serverCron()` function contains a variety of maintenance functions and is set up as a timer job, configured to run at a certain rate (hz). The default rate is 10hz (every 100ms). One of the things that `serverCron()` does is to perform maintenance functions on connected clients. Since the number of clients is variable, and can be very large, this could cause latency spikes when the 100ms `serverCron()` task gets invoked. To combat those latency spikes, a feature called "dynamic-hz" was introduced. This feature will run `serverCron()` more often, if there are more clients. The clients get processed up to 200 at a time. The delay for `serverCron()` is shortened with the goal of processing all of the clients every second. The result of this is that some of the other (non-client) maintenance functions also get (unnecessarily) run more often. Like `cronUpdateMemoryStats()` and `databasesCron()`. Logically, it doesn't make sense to run these functions more often, just because we happen to have more clients attached. This PR separates client activities onto a separate, variable, timer. The "dynamic-hz" feature is eliminated. Now, `serverCron` will run at a standard configured rate. The separate clients cron will automatically adjust based on the number of clients. This has the added benefit that often, the 2 crons will fire during separate event loop invocations and will usually avoid the combined latency impact of doing both maintenance activities together. The new timer follows the same rules which were established with the dynamic HZ feature. * The goal is to process all of the clients once per second * We never want to process more than 200 clients in a single invocation (`MAX_CLIENTS_PER_CLOCK_TICK`) * We always process at least 5 clients at a time (`CLIENTS_CRON_MIN_ITERATIONS`) * The minimum rate is determined by HZ The delay (ms) for the new timer is also more precise, computing the number of milliseconds needed to achieve the goal of reaching all of the clients every second. The old dynamic-hz feature just performs a doubling of the HZ until the clients processing rate is achieved (i.e. delays of 100ms, 50ms, 25ms, 12ms...) --------- Signed-off-by: Jim Brunner <brunnerj@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2025-02-04 13:02:37 -08:00
Viktor Söderqvist	e9b8970e72	Relaxed RDB version check (#1604 ) New config `rdb-version-check` with values: * `strict`: Reject future RDB versions. * `relaxed`: Try parsing future RDB versions and fail only when an unknown RDB opcode or type is encountered. This can make it possible for Valkey 8.1 to try read a dump from for example Valkey 9.0 or later on a best-effort basis. The conditions for when this is expected to work can be defined when the future Valkey versions are released. Loading is expected to fail in the following cases: * If the data set contains any new key types or other data elements not supported by the current version. * If the RDB contains new representations or encodings of existing key types or other data elements. This change also prepares for the next RDB version bump. A range of RDB versions (12-79) is reserved, since it's expected to be used by foreign software RDB versions, so Valkey will not accept versions in this range even with the `relaxed` version check. The DUMP/RESTORE format has no magic string; only the RDB version number. This change also prepares for the magic string to change from REDIS to VALKEY next time we bump the RDB version. Related to #1108. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2025-01-27 18:44:24 +01:00
zhaozhao.zz	3f21705a6c	Feature COMMANDLOG to record slow execution and large request/reply (#1294 ) As discussed in PR #336. We have different types of resources like CPU, memory, network, etc. The `slowlog` can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example: 1. run "SET key value(10 megabytes)" command would not be recored in slowlog, since when processing it the SET command only insert the value's pointer into db dict. But that command eats huge memory in query buffer and bandwidth from network. In this case, just 1000 tps can cause 10GB/s network flow. 2. run "GET key" command and the key's value length is 10 megabytes. The get command can eat huge memory in output buffer and bandwidth to network. This PR introduces a new command `COMMANDLOG`, to log commands that consume significant network bandwidth, including both input and output. Users can retrieve the results using `COMMANDLOG get <count> large-request` and `COMMANDLOG get <count> large-reply`, all subcommands for `COMMANDLOG` are: * `COMMANDLOG HELP` * `COMMANDLOG GET <count> <slow\|large-request\|large-reply>` * `COMMANDLOG LEN <slow\|large-request\|large-reply>` * `COMMANDLOG RESET <slow\|large-request\|large-reply>` And the slowlog is also incorporated into the commandlog. For each of these three types, additional configs have been added for control: * `commandlog-request-larger-than` and `commandlog-large-request-max-len` represent the threshold for large requests(the unit is Bytes) and the maximum number of commands that can be recorded. * `commandlog-reply-larger-than` and `commandlog-large-reply-max-len` represent the threshold for large replies(the unit is Bytes) and the maximum number of commands that can be recorded. * `commandlog-execution-slower-than` and `commandlog-slow-execution-max-len` represent the threshold for slow executions(the unit is microseconds) and the maximum number of commands that can be recorded. * Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now set as aliases for these two new configs. --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2025-01-24 11:41:40 +08:00
Binbin	176fafcaf7	Add a note to conf about the dangers of modifying dir at runtime (#887 ) We've had security issues in the past with it, which is why we marked it as PROTECTED. But, modifying during runtime is also a dangerous action. For example, when child processes are running, persistent temp files and log files may have unexpected effects. A scenario for modifying dir at runtime is to migrate a disk failure, such as using disk-based replication to migrate a node, writing nodes.conf to save the cluster configuration. We decided to leave it as is and add a note in the conf about the dangers of modifying dir at runtime. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-08 20:28:14 +08:00
Jim Brunner	397201c48f	Refactor of ActiveDefrag to reduce latencies (#1242 ) Refer to: https://github.com/valkey-io/valkey/issues/1141 This update refactors the defrag code to: * Make the overall code more readable and maintainable * Reduce latencies incurred during defrag processing With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies. (See https://github.com/valkey-io/valkey/issues/1141 for more complete details.) This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. `activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of scope for this update and left for a future update. I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME single shard. See original github issue for performance details. --------- Signed-off-by: Jim Brunner <brunnerj@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-03 08:42:29 -08:00
zhenwei pi	4695d118dd	RDMA builtin support (#1209 ) There are several patches in this PR: * Abstract set/rewrite config bind option: `bind` option is a special config, `socket` and `tls` are using the same one. However RDMA uses the similar style but different one. Use a bit abstract work to make it flexible for both `socket` and `RDMA`. (Even for QUIC in the future.) * Introduce closeListener for connection type: closing socket by a simple syscall would be fine, RDMA has complex logic. Introduce connection type specific close listener method. * RDMA: Use valkey.conf style instead of module parameters: use `--rdma-bind` and `--rdma-port` style instead of module parameters. The module style config `rdma.bind` and `rdma.port` are removed. * RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is still kept for now. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-11-29 11:13:34 +01:00
Yanqi Lv	4986310945	Import-mode: Avoid expiration and eviction during data syncing (#1185 ) New config: `import-mode (yes\|no)` New command: `CLIENT IMPORT-SOURCE (ON\|OFF)` The config, when set to `yes`, disables eviction and deletion of expired keys, except for commands coming from a client which has marked itself as an import-source, the data source when importing data from another node, using the CLIENT IMPORT-SOURCE command. When we sync data from the source Valkey to the destination Valkey using some sync tools like [redis-shake](https://github.com/tair-opensource/RedisShake), the destination Valkey can perform expiration and eviction, which may cause data corruption. This problem has been discussed in https://github.com/redis/redis/discussions/9760#discussioncomment-1681041 and Redis already have a solution. But in Valkey we haven't fixed it by now. E.g. we call `set key 1 ex 1` on the source server and transfer this command to the destination server. Then we call `incr key` on the source server before the key expired, we will have a key on the source server with a value of 2. But when the command arrived at the destination server, the key may be expired and has deleted. So we will have a key on the destination server with a value of 1, which is inconsistent with the source server. In standalone mode, we can use writable replica to simplify the sync process. However, in cluster mode, we still need a sync tool to help us transfer the source data to the destination. The sync tool usually work as a normal client and the destination works as a primary which keep expiration and eviction. In this PR, we add a new mode named 'import-mode'. In this mode, server stop expiration and eviction just like a replica. Notice that this mode exists only in sync state to avoid data inconsistency caused by expiration and eviction. Import mode only takes effect on the primary. Sync tools can mark their clients as an import source by `CLIENT IMPORT-SOURCE`, which work like a client from primary and can visit expired keys in `lookupkey`. Notice: during the migration, other clients, apart from the import source, should not access the data imported by import source. --------- Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com> Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-19 21:53:19 +01:00
zixuan zhao	8ee7a58025	Document log format configs in valkey.conf (#1233 ) Add config options for log format and timestamp format introduced by #1022 Related to #1225 This change adds two new configs into valkey.conf: log-format log-timestamp-format --------- Signed-off-by: azuredream <zhaozixuan67@gmail.com>	2024-10-29 11:13:30 +01:00
Shivshankar	4be09e434a	Fix typo in valkey.conf file's shutdown section (#1224 ) Found typo "exists" ==> "exits" in valkey.conf in shutdown section. Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-10-25 14:03:59 +02:00
Binbin	5d70ccd70e	Make replica CLUSTER RESET flush async based on lazyfree-lazy-user-flush (#1190 ) Currently, if the replica has a lot of data, CLUSTER RESET will block for a while and report the slowlog, and it seems that there is no harm in making it async so external components can be easier when monitoring it. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-10-23 10:22:25 +08:00
Binbin	dc05a327f9	Take hz into account in activerehashing to avoid CPU spikes (#977 ) Currently in conf we describe activerehashing as: Active rehashing uses 1 millisecond every 100 milliseconds of CPU time. This is the case for hz = 10. If we change hz, the description in conf will be inaccurate. Users may notice that the server spends some CPU (used in activerehashing) at high hz but don't know why, since our cron calls are fixed to 1ms. This PR takes hz into account and fixed the CPU usage at 1% (this may not be accurate in some cases because we do 100 step rehashing in dictRehashMicroseconds but it can avoid CPU spikes in this case). This PR also improves the description of the activerehashing configuration item to explain this change. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-10-15 23:32:22 +08:00
Shivshankar	ef971a34eb	Correct the note details for deprecated config 'io-threads-do-reads' (#1150 ) Remove explicit reference to removal and just indicate to avoid using it. Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-10-11 21:21:09 -07:00
Shivshankar	079f18ad97	Add io-threads-do-reads config to deprecated config table to have no effect. (#1138 ) this fixes: https://github.com/valkey-io/valkey/issues/1116 _Issue details from #1116 by @zuiderkwast_ > This config is undocumented since #758. The default was changed to "yes" and it is quite useless to set it to "no". Yet, it can happen that some user has an old config file where it is explicitly set to "no". The result will be bad performace, since I/O threads will not do all the I/O. > > It's indeed confusing. > > 1. Either remove the whole option from the code. And thus no need for documentation. _OR:_ > 2. Introduce the option back in the configuration, just as a comment is fine. And showing the default value "yes": `# io-threads-do-reads yes` with additional text. > > _Originally posted by @melroy89 in [#1019 (reply in thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_ --------- Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-10-10 17:46:09 +02:00
kronwerk	cd8de095c4	Add flush-before-load option for repl-diskless-load (#909 ) A new option for diskless replication on the replica side. After a network failure, the replica may need to perform a full sync. The other option for diskless full sync is `swapdb`, but it uses twice as much memory, temporarily. In situations where this is not acceptable, and where losing data is acceptable, the `flush-before-load` can be useful. If the full sync fails, the old data is lost though. Therefore, the new option is marked as "dangerous". --------- Signed-off-by: kronwerk <ca11e5e22g@gmail.com> Signed-off-by: kronwerk <kronwerk@users.noreply.github.com> Co-authored-by: kronwerk <ca11e5e22g@gmail.com>	2024-10-09 13:11:53 +02:00
Shivshankar	c8aaceed46	Correct the typo in valkey.conf file (#1118 ) Correct the typo in valkey.conf file Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-10-04 13:30:59 -07:00
Shivshankar	56c90b78e3	Fix a typo in the valkey.conf (#1048 ) Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-09-21 21:22:39 +08:00
Viktor Söderqvist	ea58fbf40d	Rewrite lazyfree docs in valkey.conf to reflect that lazy is now default (#983 ) Before this doc update, the comments in valkey.conf said that DEL is a blocking command, and even refered to other synchronous freeing as "in a blocking way, like if DEL was called". This has now become confusing and incorrect, since DEL is now non-blocking by default. The comments also mentioned too much about the "old default" and only later explain that the "new default" is non-blocking. This doc update focuses on the current default and expresses it like "Starting from Valkey 8.0, lazy freeing is enabled by default", rather than using words like old and new. This is a follow-up to #913. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-09-03 10:47:23 +02:00
Amit Nagler	5fdb47c2e2	Add configuration hide-user-data-from-log to hide user data from server logs (#877 ) Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames. Added a new hide-user-data-from-log configuration item, default yes. --------- Signed-off-by: Amit Nagler <anagler123@gmail.com>	2024-09-02 09:50:36 -07:00
Binbin	70624ea63d	Change all the lazyfree configurations to yes by default (#913 ) ## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by default. There are many problems with running flush synchronously. Even in single CPU environments, the thread managers should balance between the freeing and serving incoming requests. ## Set lazy eviction, expire, server-del, user-del to yes by default We now have a del and a lazyfree del, we also have these configuration items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire, lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree is better since it reduces the risk of blocking the main thread, and because we have lazyfreeGetFreeEffort, on those with high effor (currently 64) will use lazyfree. Part of #653. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-09-02 07:07:17 -07:00
Amit Nagler	1ff2a3b6ae	Remove `dual-channel-replication` Feature Flag's Protection (#908 ) Currently, the `dual-channel-replication` feature flag is immutable if `enable-protected-configs` is enabled, which is the default behavior. This PR proposes to make the `dual-channel-replication` flag mutable, allowing it to be changed dynamically without restarting the cluster. Motivation: The ability to change the `dual-channel-replication` flag dynamically is essential for testing and validating the feature on real clusters running in production environments. By making the flag mutable, we can enable or disable the feature without disrupting the cluster's operations, facilitating easier testing and experimentation. Additionally, this change would provide more flexibility for users to enable or disable the feature based on their specific requirements or operational needs without requiring a cluster restart. --------- Signed-off-by: naglera <anagler123@gmail.com>	2024-08-27 10:18:48 -07:00
uriyage	04d76d8b02	Improve multithreaded performance with memory prefetching (#861 ) This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance. After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the `lookupKey` function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access. ### Memory Access Amortization Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately. Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency. We applied this method in the development of `dictPrefetch`, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, `dictPrefetch` is an interleaved execution of dictFind for all the keys. Implementation details When the main thread iterates over the `clients-pending-io-read`, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-08-26 21:10:44 -07:00
Ayush Sharma	b48596a914	Add support for setting the group on a unix domain socket (#901 ) Add new optional, immutable string config called `unixsocketgroup`. Change the group of the unix socket to `unixsocketgroup` after `bind()` if specified. Adds tests to validate the behavior. Fixes #873. Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>	2024-08-23 11:52:08 -07:00
Binbin	a1ac459ef1	Set repl-backlog-size from 1mb to 10mb by default (#911 ) The repl-backlog-size 1mb is too small in most cases, now network transmission and bandwidth performance have improved rapidly in more than ten years. The bigger the replication backlog, the longer the replica can endure the disconnect and later be able to perform a partial resynchronization. Part of #653. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-21 11:59:02 -04:00
Binbin	fa238dc049	Update dir in valkey.conf to mention cluster-config-file (#635 ) I think it is a good idea to mention this. The Cluster config file is written relative this directory, if the 'cluster-config-file' configuration directive is a relative path. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-30 21:13:54 +08:00
Kyle Kim (kimkyle@)	5000c050b5	Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20 ). (#712 ) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-22 18:03:28 -07:00
naglera	ff6b780fe6	Dual channel replication (#60 ) In this PR we introduce the main benefit of dual channel replication by continuously steaming the COB (client output buffers) in parallel to the RDB and thus keeping the primary's side COB small AND accelerating the overall sync process. By streaming the replication data to the replica during the full sync, we reduce 1. Memory load from the primary's node. 2. CPU load from the primary's main process. [Latest performance tests](#data) ## Motivation * Reduce primary memory load. We do that by moving the COB tracking to the replica side. This also decrease the chance for COB overruns. Note that primary's input buffer limits at the replica side are less restricted then primary's COB as the replica plays less critical part in the replication group. While increasing the primary’s COB may end up with primary reaching swap and clients suffering, at replica side we’re more at ease with it. Larger COB means better chance to sync successfully. * Reduce primary main process CPU load. By opening a new, dedicated connection for the RDB transfer, child processes can have direct access to the new connection. Due to TLS connection restrictions, this was not possible using one main connection. We eliminate the need for the child process to use the primary's child-proc -> main-proc pipeline, thus freeing up the main process to process clients queries. ## Dual Channel Replication high level interface design - Dual channel replication begins when the replica sends a `REPLCONF CAPA DUALCHANNEL` to the primary during initial handshake. This is used to state that the replica is capable of dual channel sync and that this is the replica's main channel, which is not used for snapshot transfer. - When replica lacks sufficient data for PSYNC, the primary will send `-FULLSYNCNEEDED` response instead of RDB data. As a next step, the replica creates a new connection (rdb-channel) and configures it against the primary with the appropriate capabilities and requirements. The replica then requests a sync using the RDB channel. - Prior to forking, the primary sends the replica the snapshot's end repl-offset, and attaches the replica to the replication backlog to keep repl data until the replica requests psync. The replica uses the main channel to request a PSYNC starting at the snapshot end offset. - The primary main threads sends incremental changes via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. As for the replica, the incremental changes are stored on a local buffer, while the RDB is loaded into memory. - Once the replica completes loading the rdb, it drops the rdb-connection and streams the accumulated incremental changes into memory. Repl steady state continues normally. ## New replica state machine ![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3) ## Data <a name="data"></a> ![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f) ![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6) ![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836) ## Explanation These graphs demonstrate performance improvements during full sync sessions using rdb-channel + streaming rdb directly from the background process to the replica. First graph- with at most 50 clients and light weight commands, we saw 5%-7.5% improvement in write latency during sync session. Two graphs below- full sync was tested during heavy read commands from the primary (such as sdiff, sunion on large sets). In that case, the child process writes to the replica without sharing CPU with the loaded main process. As a result, this not only improves client response time, but may also shorten sync time by about 50%. The shorter sync time results in less memory being used to store replication diffs (>60% in some of the tested cases). ## Test setup Both primary and replica in the performance tests ran on the same machine. RDB size in all tests is 3.7gb. I generated write load using valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list __rand_int__`. --------- Signed-off-by: naglera <anagler123@gmail.com> Signed-off-by: naglera <58042354+naglera@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-17 13:59:33 -07:00
Brennan	34649bd034	Configurable cluster blacklist TTL (#738 ) Allows cluster admins to configure the blacklist TTL as needed to allow sufficient time for `CLUSTER FORGET` to be executed on every node in the cluster. Config name `cluster-blacklist-ttl`; unit seconds; deault 60. --------- Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>	2024-07-13 20:38:25 +02:00
Viktor Söderqvist	a323dce890	Dual stack and client-specific IPs in cluster (#736 ) New configs: * `cluster-announce-client-ipv4` * `cluster-announce-client-ipv6` New module API function: * `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is otherwise just like its non-ForClient cousin. If configured, one of these IP addresses are reported to each client in CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing the IP (`custer-announce-ip` or the auto-detected IP) of each node. Which one is reported to the client depends on whether the client is connected over IPv4 or IPv6. Benefits: * This allows clients using IPv4 to get the IPv4 addresses of all cluster nodes and IPv6 clients to get the IPv6 clients. * This allows the IPs visible to clients to be different to the IPs used between the cluster nodes due to NAT'ing. The information is propagated in the cluster bus using new Ping extensions. (Old nodes without this feature ignore unknown Ping extensions.) This adds another dimension to CLUSTER SLOTS reply. It now depends on the client's use of TLS, the IP address family and RESP version. Refactoring: The cached connection type definition is moved from connection.h (it actually has nothing to do with the connection abstraction) to server.h and is changed to a bitmap, with one bit for each of TLS, IPv6 and RESP3. Fixes #337 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-10 13:53:52 +02:00
uriyage	bbfd041895	Async IO threads (#758 ) This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests per second, as detailed by [dan touitou](https://github.com/touitou-dan) in https://github.com/valkey-io/valkey/issues/22. This PR modifies the IO threads to be fully asynchronous, which is a first and necessary step to allow more work offloading and better utilization of the IO threads. ### Current IO threads state: Valkey IO threads were introduced in Redis 6.0 to allow better utilization of multi-core machines. Before this, Redis was single-threaded and could only use one CPU core for network and command processing. The introduction of IO threads helps in offloading the IO operations to multiple threads. Current IO Threads flow: 1. Initialization: When Redis starts, it initializes a specified number of IO threads. These threads are in addition to the main thread, each thread starts with an empty list, the main thread will populate that list in each event-loop with pending-read-clients or pending-write-clients. 2. Read Phase: The main thread accepts incoming connections and reads requests from clients. The reading of requests are offloaded to IO threads. The main thread puts the clients ready-to-read in a list and set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick the clients up, perform the read operation and parse the first incoming command. 3. Command Processing: After reading the requests, command processing is still single-threaded and handled by the main thread. 4. Write Phase: Similar to the read phase, the write phase is also be offloaded to IO threads. The main thread prepares the response in the clients’ output buffer then the main thread puts the client in the list, and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO threads then pick the clients up and perform the write operation to send the responses back to clients. 5. Synchronization: The main-thread communicate with the threads on how many jobs left per each thread with atomic counter. The main-thread doesn’t access the clients while being handled by the IO threads. Issues with current implementation: * Underutilized Cores: The current implementation of IO-threads leads to the underutilization of CPU cores. * The main thread remains responsible for a significant portion of IO-related tasks that could be offloaded to IO-threads. * When the main-thread is processing client’s commands, the IO threads are idle for a considerable amount of time. * Notably, the main thread's performance during the IO-related tasks is constrained by the speed of the slowest IO-thread. * Limited Offloading: Currently, Since the Main-threads waits synchronously for the IO threads, the Threads perform only read-parse, and write operations, with parsing done only for the first command. If the threads can do work asynchronously we may offload more work to the threads reducing the load from the main-thread. * TLS: Currently, we don't support IO threads with TLS (where offloading IO would be more beneficial) since TLS read/write operations are not thread-safe with the current implementation. ### Suggested change Non-blocking main thread - The main thread and IO threads will operate in parallel to maximize efficiency. The main thread will not be blocked by IO operations. It will continue to process commands independently of the IO thread's activities. Implementation details Inter-thread communication. * We use a static, lock-free ring buffer of fixed size (2048 jobs) for the main thread to send jobs and for the IO to receive them. If the ring buffer fills up, the main thread will handle the task itself, acting as back pressure (in case IO operations are more expensive than command processing). A static ring buffer is a better candidate than a dynamic job queue as it eliminates the need for allocation/freeing per job. * An IO job will be in the format: ` [void* function-call-back \| void data] `where data is either a client to read/write from and the function-ptr is the function to be called with the data for example readQueryFromClient using this format we can use it later to offload other types of works to the IO threads. The Ring buffer is one way from the main-thread to the IO thread, Upon read/write event the main thread will send a read/write job then in before sleep it will iterate over the pending read/write clients to checking for each client if the IO threads has already finished handling it. The IO thread signals it has finished handling a client read/write by toggling an atomic flag read_state / write_state on the client struct. Thread Safety As suggested in this solution, the IO threads are reading from and writing to the clients' buffers while the main thread may access those clients. We must ensure no race conditions or unsafe access occurs while keeping the Valkey code simple and lock free. Minimal Action in the IO Threads The main change is to limit the IO thread operations to the bare minimum. The IO thread will access only the client's struct and only the necessary fields in this struct. The IO threads will be responsible for the following: * Read Operation: The IO thread will only read and parse a single command. It will not update the server stats, handle read errors, or parsing errors. These tasks will be taken care of by the main thread. * Write Operation: The IO thread will only write the available data. It will not free the client's replies, handle write errors, or update the server statistics. To achieve this without code duplication, the read/write code has been refactored into smaller, independent components: * Functions that perform only the read/parse/write calls. * Functions that handle the read/parse/write results. This refactor accounts for the majority of the modifications in this PR. Client Struct Safe Access As we ensure that the IO threads access memory only within the client struct, we need to ensure thread safety only for the client's struct's shared fields. * Query Buffer * Command parsing - The main thread will not try to parse a command from the query buffer when a client is offloaded to the IO thread. * Client's memory checks in client-cron - The main thread will not access the client query buffer if it is offloaded and will handle the querybuf grow/shrink when the client is back. * CLIENT LIST command - The main thread will busy-wait for the IO thread to finish handling the client, falling back to the current behavior where the main thread waits for the IO thread to finish their processing. * Output Buffer * The IO thread will not change the client's bufpos and won't free the client's reply lists. These actions will be done by the main thread on the client's return from the IO thread. * bufpos / block→used: As the main thread may change the bufpos, the reply-block→used, or add/delete blocks to the reply list while the IO thread writes, we add two fields to the client struct: io_last_bufpos and io_last_reply_block. The IO thread will write until the io_last_bufpos, which was set by the main-thread before sending the client to the IO thread. If more data has been added to the cob in between, it will be written in the next write-job. In addition, the main thread will not trim or merge reply blocks while the client is offloaded. * Parsing Fields * Client's cmd, argc, argv, reqtype, etc., are set during parsing. * The main thread will indicate to the IO thread not to parse a cmd if the client is not reset. In this case, the IO thread will only read from the network and won't attempt to parse a new command. * The main thread won't access the c→cmd/c→argv in the CLIENT LIST command as stated before it will busy wait for the IO threads. * Client Flags * c→flags, which may be changed by the main thread in multiple places, won't be accessed by the IO thread. Instead, the main thread will set the c→io_flags with the information necessary for the IO thread to know the client's state. * Client Close * On freeClient, the main thread will busy wait for the IO thread to finish processing the client's read/write before proceeding to free the client. * Client's Memory Limits * The IO thread won't handle the qb/cob limits. In case a client crosses the qb limit, the IO thread will stop reading for it, letting the main thread know that the client crossed the limit. TLS TLS is currently not supported with IO threads for the following reasons: 1. Pending reads - If SSL has pending data that has already been read from the socket, there is a risk of not calling the read handler again. To handle this, a list is used to hold the pending clients. With IO threads, multiple threads can access the list concurrently. 2. Event loop modification - Currently, the TLS code registers/unregisters the file descriptor from the event loop depending on the read/write results. With IO threads, multiple threads can modify the event loop struct simultaneously. 3. The same client can be sent to 2 different threads concurrently (https://github.com/redis/redis/issues/12540). Those issues were handled in the current PR: 1. The IO thread only performs the read operation. The main thread will check for pending reads after the client returns from the IO thread and will be the only one to access the pending list. 2. The registering/unregistering of events will be similarly postponed and handled by the main thread only. 3. Each client is being sent to the same dedicated thread (c→id % num_of_threads). Sending Replies Immediately with IO threads. Currently, after processing a command, we add the client to the pending_writes_list. Only after processing all the clients do we send all the replies. Since the IO threads are now working asynchronously, we can send the reply immediately after processing the client’s requests, reducing the command latency. However, if we are using AOF=always, we must wait for the AOF buffer to be written, in which case we revert to the current behavior. IO threads dynamic adjustment Currently, we use an all-or-nothing approach when activating the IO threads. The current logic is as follows: if the number of pending write clients is greater than twice the number of threads (including the main thread), we enable all threads; otherwise, we enable none. For example, if 8 IO threads are defined, we enable all 8 threads if there are 16 pending clients; else, we enable none. It makes more sense to enable partial activation of the IO threads. If we have 10 pending clients, we will enable 5 threads, and so on. This approach allows for a more granular and efficient allocation of resources based on the current workload. In addition, the user will now be able to change the number of I/O threads at runtime. For example, when decreasing the number of threads from 4 to 2, threads 3 and 4 will be closed after flushing their job queues. Tests Currently, we run the io-threads tests with 4 IO threads (`443d80f168/.github/workflows/daily.yml (L353)`). This means that we will not activate the IO threads unless there are 8 (threads * 2) pending write clients per single loop, which is unlikely to happened in most of tests, meaning the IO threads are not currently being tested. To enforce the main thread to always offload work to the IO threads, regardless of the number of pending events, we add an events-per-io-thread configuration with a default value of 2. When set to 0, this configuration will force the main thread to always offload work to the IO threads. When we offload every single read/write operation to the IO threads, the IO-threads are running with 100% CPU when running multiple tests concurrently some tests fail as a result of larger than expected command latencies. To address this issue, we have to add some after or wait_for calls to some of the tests to ensure they pass with IO threads as well. Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-07-08 20:01:39 -07:00
John Sully	ad5704f803	Upstream the availability zone info string from KeyDB (#700 ) When Redis/Valkey/KeyDB is run in a cloud environment across multiple AZ's it is preferable to keep traffic local to an AZ both for cost reasons and for latency. This is typically done when you are enabling reads on replicas with the READONLY command. For this change we are creating a setting that is echo'd back in the info command. We do not want to add the cloud SDKs as dependencies and this is the easiest way around that. It is fairly trivial to grab the AZ from the cloud and push that into your setting file. Currently at Snapchat we have a custom client that after connecting reads this from the server and will preferentially use that server if the AZ string matches its internally configured AZ. In the future it would be ideal if we used this information when performing failover or even exposed it in cluster nodes. Signed-off-by: John Sully <john@csquare.ca>	2024-06-27 12:30:26 -07:00

1 2

59 Commits