Introduce atomic slot migration (#1949)

Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```
                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 
```

Co-authored-by: Binbin <binloveplay1314@qq.com>

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This commit is contained in:
Jacob Murphy 2025-08-11 18:02:37 -07:00 committed by GitHub
parent 725c3608f0
commit d7993b78d8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
40 changed files with 5179 additions and 408 deletions

View File

@ -44,6 +44,7 @@ set(VALKEY_SERVER_SRCS
${CMAKE_SOURCE_DIR}/src/intset.c
${CMAKE_SOURCE_DIR}/src/syncio.c
${CMAKE_SOURCE_DIR}/src/cluster.c
${CMAKE_SOURCE_DIR}/src/cluster_migrateslots.c
${CMAKE_SOURCE_DIR}/src/cluster_legacy.c
${CMAKE_SOURCE_DIR}/src/cluster_slot_stats.c
${CMAKE_SOURCE_DIR}/src/crc16.c

View File

@ -423,7 +423,7 @@ ENGINE_NAME=valkey
SERVER_NAME=$(ENGINE_NAME)-server$(PROG_SUFFIX)
ENGINE_SENTINEL_NAME=$(ENGINE_NAME)-sentinel$(PROG_SUFFIX)
ENGINE_TRACE_OBJ=trace/trace.o trace/trace_commands.o trace/trace_db.o trace/trace_cluster.o trace/trace_server.o trace/trace_rdb.o trace/trace_aof.o
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o vector.o quicklist.o ae.o anet.o dict.o hashtable.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o memory_prefetch.o io_threads.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o cluster_slot_stats.o crc16.o endianconv.o commandlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o allocator_defrag.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script.o functions.o commands.o strl.o connection.o unix.o logreqres.o rdma.o scripting_engine.o entry.o vset.o lua/script_lua.o lua/function_lua.o lua/engine_lua.o lua/debug_lua.o
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o vector.o quicklist.o ae.o anet.o dict.o hashtable.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o memory_prefetch.o io_threads.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o cluster_slot_stats.o crc16.o cluster_migrateslots.o endianconv.o commandlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o allocator_defrag.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script.o functions.o commands.o strl.o connection.o unix.o logreqres.o rdma.o scripting_engine.o entry.o vset.o lua/script_lua.o lua/function_lua.o lua/engine_lua.o lua/debug_lua.o
ENGINE_SERVER_OBJ+=$(ENGINE_TRACE_OBJ)
ENGINE_CLI_NAME=$(ENGINE_NAME)-cli$(PROG_SUFFIX)
ENGINE_CLI_OBJ=anet.o adlist.o dict.o valkey-cli.o zmalloc.o release.o ae.o serverassert.o crcspeed.o crccombine.o crc64.o siphash.o crc16.o monotonic.o cli_common.o mt19937-64.o strl.o cli_commands.o sds.o util.o sha256.o

153
src/aof.c
View File

@ -2218,6 +2218,103 @@ werr:
return 0;
}
int rewriteSelectDbRio(rio *aof, int db_num) {
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
if (rioWrite(aof, selectcmd, sizeof(selectcmd) - 1) == 0) return C_ERR;
if (rioWriteBulkLongLong(aof, db_num) == 0) return C_ERR;
return C_OK;
}
int rewriteObjectRio(rio *aof, robj *o, int db_num) {
size_t aof_bytes_before_key = aof->processed_bytes;
sds keystr;
robj key;
long long expiretime;
keystr = objectGetKey(o);
initStaticStringObject(key, keystr);
expiretime = objectGetExpire(o);
/* Save the key and associated value */
if (o->type == OBJ_STRING) {
/* Emit a SET command */
char cmd[] = "*3\r\n$3\r\nSET\r\n";
if (rioWrite(aof, cmd, sizeof(cmd) - 1) == 0) return C_ERR;
/* Key and value */
if (rioWriteBulkObject(aof, &key) == 0) return C_ERR;
if (rioWriteBulkObject(aof, o) == 0) return C_ERR;
} else if (o->type == OBJ_LIST) {
if (rewriteListObject(aof, &key, o) == 0) return C_ERR;
} else if (o->type == OBJ_SET) {
if (rewriteSetObject(aof, &key, o) == 0) return C_ERR;
} else if (o->type == OBJ_ZSET) {
if (rewriteSortedSetObject(aof, &key, o) == 0) return C_ERR;
} else if (o->type == OBJ_HASH) {
if (rewriteHashObject(aof, &key, o) == 0) return C_ERR;
} else if (o->type == OBJ_STREAM) {
if (rewriteStreamObject(aof, &key, o) == 0) return C_ERR;
} else if (o->type == OBJ_MODULE) {
if (rewriteModuleObject(aof, &key, o, db_num) == 0) return C_ERR;
} else {
serverPanic("Unknown object type");
}
/* In fork child process, we can try to release memory back to the
* OS and possibly avoid or decrease COW. We give the dismiss
* mechanism a hint about an estimated size of the object we stored. */
size_t dump_size = aof->processed_bytes - aof_bytes_before_key;
if (server.in_fork_child) dismissObject(o, dump_size);
/* Save the expire time */
if (expiretime != -1) {
char cmd[] = "*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(aof, cmd, sizeof(cmd) - 1) == 0) return C_ERR;
if (rioWriteBulkObject(aof, &key) == 0) return C_ERR;
if (rioWriteBulkLongLong(aof, expiretime) == 0) return C_ERR;
}
/* Delay before next key if required (for testing) */
if (server.rdb_key_save_delay) debugDelay(server.rdb_key_save_delay);
return C_OK;
}
int rewriteSlotToAppendOnlyFileRio(rio *aof, int db_num, int hashslot, size_t *key_count) {
long long updated_time = 0;
if (rewriteFunctions(aof) == 0) return C_ERR;
if (dbHasNoKeys(db_num)) return C_OK;
serverDb *db = server.db[db_num];
if (kvstoreHashtableSize(db->keys, hashslot) == 0) return C_OK;
/* SELECT the DB */
if (rewriteSelectDbRio(aof, db_num) == C_ERR) return C_ERR;
kvstoreHashtableIterator *iter = kvstoreGetHashtableIterator(db->keys, hashslot, HASHTABLE_ITER_SAFE | HASHTABLE_ITER_PREFETCH_VALUES);
void *next;
while (kvstoreHashtableIteratorNext(iter, &next)) {
robj *o = next;
/* Update info every 1 second (approximately).
* in order to avoid calling mstime() on each iteration, we will
* check the diff every 1024 keys */
if (key_count && ((*key_count)++ & 1023) == 0) {
long long now = mstime();
if (now - updated_time >= 1000) {
sendChildInfo(CHILD_INFO_TYPE_CURRENT_INFO, *key_count, "AOF rewrite");
updated_time = now;
}
}
if (rewriteObjectRio(aof, o, db_num) == C_ERR) return C_ERR;
}
kvstoreReleaseHashtableIterator(iter);
return C_OK;
}
int rewriteAppendOnlyFileRio(rio *aof) {
int j;
long key_count = 0;
@ -2234,69 +2331,20 @@ int rewriteAppendOnlyFileRio(rio *aof) {
sdsfree(ts);
}
if (rewriteFunctions(aof) == 0) goto werr;
if (rewriteFunctions(aof) == C_ERR) goto werr;
for (j = 0; j < server.dbnum; j++) {
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
if (dbHasNoKeys(j)) continue;
serverDb *db = server.db[j];
/* SELECT the new DB */
if (rioWrite(aof, selectcmd, sizeof(selectcmd) - 1) == 0) goto werr;
if (rioWriteBulkLongLong(aof, j) == 0) goto werr;
if (rewriteSelectDbRio(aof, j) == C_ERR) goto werr;
kvs_it = kvstoreIteratorInit(db->keys, HASHTABLE_ITER_SAFE | HASHTABLE_ITER_PREFETCH_VALUES);
/* Iterate this DB writing every entry */
void *next;
while (kvstoreIteratorNext(kvs_it, &next)) {
robj *o = next;
sds keystr;
robj key;
long long expiretime;
size_t aof_bytes_before_key = aof->processed_bytes;
keystr = objectGetKey(o);
initStaticStringObject(key, keystr);
expiretime = objectGetExpire(o);
/* Save the key and associated value */
if (o->type == OBJ_STRING) {
/* Emit a SET command */
char cmd[] = "*3\r\n$3\r\nSET\r\n";
if (rioWrite(aof, cmd, sizeof(cmd) - 1) == 0) goto werr;
/* Key and value */
if (rioWriteBulkObject(aof, &key) == 0) goto werr;
if (rioWriteBulkObject(aof, o) == 0) goto werr;
} else if (o->type == OBJ_LIST) {
if (rewriteListObject(aof, &key, o) == 0) goto werr;
} else if (o->type == OBJ_SET) {
if (rewriteSetObject(aof, &key, o) == 0) goto werr;
} else if (o->type == OBJ_ZSET) {
if (rewriteSortedSetObject(aof, &key, o) == 0) goto werr;
} else if (o->type == OBJ_HASH) {
if (rewriteHashObject(aof, &key, o) == 0) goto werr;
} else if (o->type == OBJ_STREAM) {
if (rewriteStreamObject(aof, &key, o) == 0) goto werr;
} else if (o->type == OBJ_MODULE) {
if (rewriteModuleObject(aof, &key, o, j) == 0) goto werr;
} else {
serverPanic("Unknown object type");
}
/* In fork child process, we can try to release memory back to the
* OS and possibly avoid or decrease COW. We give the dismiss
* mechanism a hint about an estimated size of the object we stored. */
size_t dump_size = aof->processed_bytes - aof_bytes_before_key;
if (server.in_fork_child) dismissObject(o, dump_size);
/* Save the expire time */
if (expiretime != -1) {
char cmd[] = "*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(aof, cmd, sizeof(cmd) - 1) == 0) goto werr;
if (rioWriteBulkObject(aof, &key) == 0) goto werr;
if (rioWriteBulkLongLong(aof, expiretime) == 0) goto werr;
}
/* Update info every 1 second (approximately).
* in order to avoid calling mstime() on each iteration, we will
@ -2309,8 +2357,7 @@ int rewriteAppendOnlyFileRio(rio *aof) {
}
}
/* Delay before next key if required (for testing) */
if (server.rdb_key_save_delay) debugDelay(server.rdb_key_save_delay);
if (rewriteObjectRio(aof, o, j) == C_ERR) goto werr;
}
kvstoreIteratorRelease(kvs_it);
}

View File

@ -104,8 +104,8 @@ void freeClientBlockingState(client *c) {
* flag is set client query buffer is not longer processed, but accumulated,
* and will be processed when the client is unblocked. */
void blockClient(client *c, int btype) {
/* Primary client should never be blocked unless pause or module */
serverAssert(!(c->flag.primary && btype != BLOCKED_MODULE && btype != BLOCKED_POSTPONE));
/* Replicated clients should never be blocked unless pause or module */
serverAssert(!(isReplicatedClient(c) && btype != BLOCKED_MODULE && btype != BLOCKED_POSTPONE));
initClientBlockingState(c);

View File

@ -1596,7 +1596,6 @@ void resetClusterStats(void) {
clusterSlotStatResetAll();
}
void clusterCommandFlushslot(client *c) {
int slot;
int lazy = server.lazyfree_lazy_user_flush;

View File

@ -28,6 +28,9 @@
#define CLUSTER_REDIR_DOWN_UNBOUND 6 /* -CLUSTERDOWN, unbound slot. */
#define CLUSTER_REDIR_DOWN_RO_STATE 7 /* -CLUSTERDOWN, allow reads. */
/* Fixed timeout value for cluster operations (milliseconds) */
#define CLUSTER_OPERATION_TIMEOUT 2000
typedef struct _clusterNode clusterNode;
struct clusterState;
@ -38,6 +41,10 @@ struct clusterState;
#define CLUSTER_MODULE_FLAG_NO_FAILOVER (1 << 1)
#define CLUSTER_MODULE_FLAG_NO_REDIRECTION (1 << 2)
/* For clusterBroadcastPong */
#define CLUSTER_BROADCAST_ALL 0 /* All known instances. */
#define CLUSTER_BROADCAST_LOCAL_REPLICAS 1 /* All replicas in my primary-replicas ring. */
/* ---------------------- API exported outside cluster.c -------------------- */
/* functions requiring mechanism specific implementations */
void clusterInit(void);
@ -62,6 +69,7 @@ void clusterUpdateMyselfAnnouncedPorts(void);
void clusterUpdateMyselfHumanNodename(void);
void clusterPropagatePublish(robj *channel, robj *message, int sharded);
void clusterBroadcastPong(int target);
unsigned long getClusterConnectionsCount(void);
int isClusterHealthy(void);
@ -118,6 +126,9 @@ void clearCachedClusterSlotsResponse(void);
unsigned int countKeysInSlotForDb(unsigned int hashslot, serverDb *db);
unsigned int countKeysInSlot(unsigned int hashslot);
int getSlotOrReply(client *c, robj *o);
int getNodeDefaultReplicationPort(clusterNode *node);
bool isAnySlotInManualImportingState(void);
bool isAnySlotInManualMigratingState(void);
/* functions with shared implementations */
int clusterNodeIsMyself(clusterNode *n);
@ -137,4 +148,13 @@ long long getNodeReplicationOffset(clusterNode *node);
sds aggregateClientOutputBuffer(client *c);
void resetClusterStats(void);
unsigned int delKeysInSlot(unsigned int hashslot, int lazy, bool propagate_del, bool send_del_event);
unsigned int propagateSlotDeletionByKeys(unsigned int hashslot);
void clusterUpdateState(void);
void clusterSaveConfigOrDie(int do_fsync);
int clusterDelSlot(int slot);
int clusterAddSlot(clusterNode *n, int slot);
int clusterBumpConfigEpochWithoutConsensus(void);
void clusterDoBeforeSleep(int flags);
#endif /* __CLUSTER_H */

View File

@ -40,6 +40,7 @@
#include "cluster.h"
#include "cluster_legacy.h"
#include "cluster_slot_stats.h"
#include "cluster_migrateslots.h"
#include "endianconv.h"
#include "connection.h"
#include "module.h"
@ -139,7 +140,7 @@ int getNodeDefaultClientPort(clusterNode *n) {
return server.tls_cluster ? n->tls_port : n->tcp_port;
}
static inline int getNodeDefaultReplicationPort(clusterNode *n) {
int getNodeDefaultReplicationPort(clusterNode *n) {
return server.tls_replication ? n->tls_port : n->tcp_port;
}
@ -161,9 +162,6 @@ static_assert(offsetof(clusterMsg, type) + sizeof(uint16_t) == RCVBUF_MIN_READ_L
#define RCVBUF_MAX_PREALLOC (1 << 20) /* 1MB */
/* Fixed timeout value for cluster operations (milliseconds) */
#define CLUSTER_OPERATION_TIMEOUT 2000
/* Cluster nodes hash table, mapping nodes addresses 1.2.3.4:6379 to
* clusterNode structures. */
dictType clusterNodesDictType = {
@ -1308,6 +1306,9 @@ void clusterInit(void) {
serverAssert(rdbRegisterAuxField("cluster-slot-states", clusterEncodeOpenSlotsAuxField,
clusterDecodeOpenSlotsAuxField) == C_OK);
/* Initialize list for slot migration jobs. */
initClusterSlotMigrationJobList();
/* Set myself->port/cport/pport to my listening ports, we'll just need to
* discover the IP address via MEET messages. */
deriveAnnouncedPorts(&myself->tcp_port, &myself->tls_port, &myself->cport);
@ -1479,6 +1480,10 @@ void clusterReset(int hard) {
/* Empty the nodes blacklist. */
dictEmpty(server.cluster->nodes_black_list, NULL);
/* Drop all incoming and outgoing links for slot import. */
clusterUpdateSlotExportsOnOwnershipChange();
clusterUpdateSlotImportsOnOwnershipChange();
/* Hard reset only: set epochs to 0, change node ID. */
if (hard) {
sds oldname;
@ -2763,6 +2768,10 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
int dirty_slots_count = 0;
int delete_dirty_slots = 0;
/* Handle importing/exporting slots which have topology updates. */
int exporting_slots_count = 0;
int importing_slots_count = 0;
/* We should detect if sender is new primary of our shard.
* We will know it if all our slots were migrated to sender, and sender
* has no slots except ours */
@ -2792,10 +2801,19 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
continue;
}
/* We rebind the slot to the new node claiming it if
* the slot was unassigned or the new node claims it with a
* greater configEpoch. */
if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch) {
/* We rebind the slot to the new node claiming it if the slot was
* unassigned or the new node claims it with a greater configEpoch.
*
* Additionally, note that during slot migration, if we have bumped
* our epoch recently (e.g. due to our own slot import) then it is
* possible the epoch on the target after bumping is <= our epoch.
* This would normally cause our node to prevent the topology change
* from being accepted. To counter this, if our node is aware of the
* migration, we will accept the topology update regardless of the
* epoch. */
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
if (!isSlotUnclaimed(j) && !areInSameShard(server.cluster->slots[j], sender)) {
serverLog(LL_NOTICE,
"Slot %d is migrated from node %.40s (%s) in shard %.40s"
@ -2812,6 +2830,10 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
dirty_slots_count++;
}
if (clusterIsSlotExporting(j)) exporting_slots_count++;
if (clusterIsSlotImporting(j)) importing_slots_count++;
if (server.cluster->slots[j] == cur_primary) {
new_primary = sender;
migrated_our_slots++;
@ -3006,6 +3028,15 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
delete_dirty_slots = 1;
}
if (exporting_slots_count) {
/* At least one slot we were exporting has a topology change. */
clusterUpdateSlotExportsOnOwnershipChange();
}
if (importing_slots_count) {
/* At least one slot we were exporting has a topology change. */
clusterUpdateSlotImportsOnOwnershipChange();
}
if (delete_dirty_slots) {
for (int j = 0; j < dirty_slots_count; j++) {
serverLog(LL_NOTICE, "Deleting keys in dirty slot %d on node %.40s (%s) in shard %.40s", dirty_slots[j],
@ -5069,6 +5100,9 @@ void clusterFailoverReplaceYourPrimary(void) {
/* 5) If there was a manual failover in progress, clear the state. */
resetManualFailover();
/* 6) Upon becoming primary, we need to ensure that data is deleted in unowned slots. */
verifyClusterConfigWithData();
/* Since we have became a new primary node, we may rely on auth_time to
* determine whether a failover is in progress, so it is best to reset it. */
server.cluster->failover_auth_time = 0;
@ -5605,6 +5639,9 @@ void clusterCron(void) {
clusterUpdateMyselfHostname();
/* Drive in progress slot import/export links. */
clusterSlotMigrationCron();
/* Clear so clusterNodeCronHandleReconnect can count the number of nodes in PFAIL. */
server.cluster->stats_pfail_nodes = 0;
/* Run through some of the operations we want to do on each cluster node. */
@ -5803,6 +5840,10 @@ void clusterBeforeSleep(void) {
}
}
if (flags & CLUSTER_TODO_HANDLE_SLOT_MIGRATION) {
clusterSlotMigrationCron();
}
/* Save the config, possibly using fsync. */
if (flags & CLUSTER_TODO_SAVE_CONFIG) {
int fsync = flags & CLUSTER_TODO_FSYNC_CONFIG;
@ -6247,6 +6288,10 @@ static void clusterSetPrimary(clusterNode *n, int closeSlots, int full_sync_requ
removeAllNotOwnedShardChannelSubscriptions();
resetManualFailover();
/* Becoming a replica cancels all in progress imports and exports */
clusterUpdateSlotExportsOnOwnershipChange();
clusterUpdateSlotImportsOnOwnershipChange();
if (server.cluster->failover_auth_time) {
/* Since we have changed to a new primary node, the previously set
* failover_auth_time should no longer be used, whether it is in
@ -6849,7 +6894,7 @@ unsigned int delKeysInSlot(unsigned int hashslot, int lazy, bool propagate_del,
dbSyncDelete(db, key);
}
// if is command, skip del propagate
if (propagate_del) propagateDeletion(db, key, lazy);
if (propagate_del) propagateDeletion(db, key, lazy, hashslot);
signalModifiedKey(NULL, db, key);
if (send_del_event) {
/* In the `cluster flushslot` scenario, the keys are actually deleted so notify everyone. */
@ -6868,6 +6913,7 @@ unsigned int delKeysInSlot(unsigned int hashslot, int lazy, bool propagate_del,
}
kvstoreReleaseHashtableIterator(kvs_di);
}
server.server_del_keys_in_slot = 0;
serverAssert(server.execution_nesting == before_execution_nesting);
return j;
@ -7040,6 +7086,15 @@ int clusterParseSetSlotCommand(client *c, int *slot_out, clusterNode **node_out,
return 0;
}
if (clusterIsAnySlotImporting()) {
addReplyError(c, "Slot import in progress.");
return 0;
}
if (clusterIsAnySlotExporting()) {
addReplyError(c, "Slot export in progress.");
return 0;
}
/* If 'myself' is a replica, 'c' must be the primary client. */
serverAssert(!nodeIsReplica(myself) || c == server.primary);
@ -7470,8 +7525,13 @@ int clusterCommandSpecial(client *c) {
serverLog(LL_NOTICE, "Stop replication and turning myself into empty primary (request from '%s').", client);
sdsfree(client);
clusterSetNodeAsPrimary(myself);
clusterPromoteSelfToPrimary();
/* Flush the data before promoting myself, since promotion will try
* to delete data in unowned slots, and we know all data will be
* removed anyways. */
flushAllDataAndResetRDB(server.repl_replica_lazy_flush ? EMPTYDB_ASYNC : EMPTYDB_NO_FLAGS);
clusterPromoteSelfToPrimary();
clusterCloseAllSlots();
resetManualFailover();
@ -7675,6 +7735,18 @@ int clusterCommandSpecial(client *c) {
} else if (!strcasecmp(c->argv[1]->ptr, "flushslot") && (c->argc == 3 || c->argc == 4)) {
/* CLUSTER FLUSHSLOT <slot> [ASYNC|SYNC] */
clusterCommandFlushslot(c);
} else if (!strcasecmp(c->argv[1]->ptr, "migrateslots") && c->argc > 3) {
/* CLUSTER MIGRATESLOTS SLOTSRANGE <start slot> <end slot> [<start slot> <end slot> ...] NODE <node> [SLOTSRANGE ... NODE ...] */
clusterCommandMigrateSlots(c);
} else if (!strcasecmp(c->argv[1]->ptr, "getslotmigrations") && c->argc == 2) {
/* CLUSTER GETSLOTMIGRATIONS */
clusterCommandGetSlotMigrations(c);
} else if (!strcasecmp(c->argv[1]->ptr, "cancelslotmigrations") && c->argc == 2) {
/* CLUSTER CANCELSLOTMIGRATIONS */
clusterCommandCancelSlotMigrations(c);
} else if (!strcasecmp(c->argv[1]->ptr, "syncslots") && c->argc > 2) {
/* CLUSTER SYNCSLOTS <subcommand>*/
clusterCommandSyncSlots(c);
} else {
return 0;
}
@ -7717,6 +7789,12 @@ const char **clusterCommandExtendedHelp(void) {
"LINKS",
" Return information about all network links between this node and its peers.",
" Output format is an array where each array element is a map containing attributes of a link",
"MIGRATESLOTS SLOTSRANGE start-slot end-slot [start-slot end-slot ...] NODE node-id [SLOTSRANGE start-slot end-slot [start-slot end-slot ...] NODE node-id ...]",
" Migrate the specified slot ranges from this node to the specified node.",
"CANCELSLOTMIGRATIONS ALL",
" Cancel all migrations.",
"GETSLOTMIGRATIONS",
" Get information about ongoing and recently finished slot imports and exports.",
NULL};
return help;
@ -7763,6 +7841,8 @@ int clusterAllowFailoverCmd(client *c) {
void clusterPromoteSelfToPrimary(void) {
replicationUnsetPrimary();
/* Upon becoming primary, we need to ensure that data is deleted in unowned slots. */
verifyClusterConfigWithData();
}
int detectAndUpdateCachedNodeHealth(void) {
@ -7859,3 +7939,14 @@ int clusterDecodeOpenSlotsAuxField(int rdbflags, sds s) {
}
return C_OK;
}
/* Returns if any slot has been put in IMPORTING state via SETSLOT command. */
bool isAnySlotInManualImportingState(void) {
return dictSize(server.cluster->importing_slots_from) > 0;
}
/* Returns if any slot has been put in MIGRATING state via SETSLOT command. */
bool isAnySlotInManualMigratingState(void) {
return dictSize(server.cluster->migrating_slots_to) > 0;
}

View File

@ -25,6 +25,7 @@
#define CLUSTER_TODO_FSYNC_CONFIG (1 << 3)
#define CLUSTER_TODO_HANDLE_MANUALFAILOVER (1 << 4)
#define CLUSTER_TODO_BROADCAST_ALL (1 << 5)
#define CLUSTER_TODO_HANDLE_SLOT_MIGRATION (1 << 6)
/* clusterLink encapsulates everything needed to talk with a remote node. */
typedef struct clusterLink {
@ -377,6 +378,11 @@ typedef struct slotStat {
uint64_t network_bytes_out;
} slotStat;
typedef struct slotRange {
int start_slot;
int end_slot;
} slotRange;
struct clusterState {
clusterNode *myself; /* This node */
uint64_t currentEpoch;
@ -389,6 +395,9 @@ struct clusterState {
dict *migrating_slots_to;
dict *importing_slots_from;
clusterNode *slots[CLUSTER_SLOTS];
list *slot_migration_jobs; /* List storing all slot migration jobs. Stored
* in order from most recent to least recently
* created. */
/* The following fields are used to take the replica state on elections. */
mstime_t failover_auth_time; /* Time of previous or next election. */
int failover_auth_count; /* Number of votes received so far. */

2125
src/cluster_migrateslots.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,37 @@
#ifndef __CLUSTER_MIGRATESLOTS_H
#define __CLUSTER_MIGRATESLOTS_H
#include "server.h"
#include "cluster.h"
#include "cluster_legacy.h"
/* Forward declaration to allow use as an argument below */
typedef struct slotMigrationJob slotMigrationJob;
bool isImportSlotMigrationJob(slotMigrationJob *job);
void clusterHandleSlotMigrationClientClose(slotMigrationJob *job);
void clusterHandleSlotMigrationClientOOM(slotMigrationJob *job);
void clusterFeedSlotExportJobs(int dbid, robj **argv, int argc, int slot);
bool clusterIsSlotImporting(int slot);
bool clusterIsSlotExporting(int slot);
bool clusterIsAnySlotImporting(void);
bool clusterIsAnySlotExporting(void);
void clusterMarkImportingSlotsInDb(serverDb *db);
bool clusterSlotMigrationShouldInstallWriteHandler(client *c);
void initClusterSlotMigrationJobList(void);
void clusterSlotMigrationCron(void);
void clusterCommandMigrateSlots(client *c);
void clusterCommandSyncSlots(client *c);
void clusterCommandGetSlotMigrations(client *c);
void clusterCommandCancelSlotMigrations(client *c);
void clusterHandleSlotExportBackgroundSaveDone(int bgsaveerr);
void clusterUpdateSlotExportsOnOwnershipChange(void);
void clusterUpdateSlotImportsOnOwnershipChange(void);
void clusterCleanupSlotMigrationLog(void);
void clusterHandleFlushDuringSlotMigration(void);
size_t clusterGetTotalSlotExportBufferMemory(void);
bool clusterSlotFailoverGranted(int slot);
void clusterFailAllSlotExportsWithMessage(char *message);
void clusterHandleSlotMigrationErrorResponse(slotMigrationJob *job);
#endif /* __CLUSTER_MIGRATESLOTS_H */

View File

@ -394,6 +394,23 @@ const char *CLUSTER_BUMPEPOCH_Tips[] = {
#define CLUSTER_BUMPEPOCH_Keyspecs NULL
#endif
/********** CLUSTER CANCELSLOTMIGRATIONS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER CANCELSLOTMIGRATIONS history */
#define CLUSTER_CANCELSLOTMIGRATIONS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER CANCELSLOTMIGRATIONS tips */
#define CLUSTER_CANCELSLOTMIGRATIONS_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER CANCELSLOTMIGRATIONS key specs */
#define CLUSTER_CANCELSLOTMIGRATIONS_Keyspecs NULL
#endif
/********** CLUSTER COUNT_FAILURE_REPORTS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
@ -611,6 +628,23 @@ struct COMMAND_ARG CLUSTER_GETKEYSINSLOT_Args[] = {
{MAKE_ARG("count",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/********** CLUSTER GETSLOTMIGRATIONS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER GETSLOTMIGRATIONS history */
#define CLUSTER_GETSLOTMIGRATIONS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER GETSLOTMIGRATIONS tips */
#define CLUSTER_GETSLOTMIGRATIONS_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER GETSLOTMIGRATIONS key specs */
#define CLUSTER_GETSLOTMIGRATIONS_Keyspecs NULL
#endif
/********** CLUSTER HELP ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
@ -714,6 +748,42 @@ struct COMMAND_ARG CLUSTER_MEET_Args[] = {
{MAKE_ARG("cluster-bus-port",ARG_TYPE_INTEGER,-1,NULL,NULL,"4.0.0",CMD_ARG_OPTIONAL,0,NULL)},
};
/********** CLUSTER MIGRATESLOTS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER MIGRATESLOTS history */
#define CLUSTER_MIGRATESLOTS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER MIGRATESLOTS tips */
#define CLUSTER_MIGRATESLOTS_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER MIGRATESLOTS key specs */
#define CLUSTER_MIGRATESLOTS_Keyspecs NULL
#endif
/* CLUSTER MIGRATESLOTS migration_group range argument table */
struct COMMAND_ARG CLUSTER_MIGRATESLOTS_migration_group_range_Subargs[] = {
{MAKE_ARG("start-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("end-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER MIGRATESLOTS migration_group argument table */
struct COMMAND_ARG CLUSTER_MIGRATESLOTS_migration_group_Subargs[] = {
{MAKE_ARG("slotsrange-token",ARG_TYPE_PURE_TOKEN,-1,"SLOTSRANGE",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("range",ARG_TYPE_BLOCK,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,2,NULL),.subargs=CLUSTER_MIGRATESLOTS_migration_group_range_Subargs},
{MAKE_ARG("node-token",ARG_TYPE_PURE_TOKEN,-1,"NODE",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("node-id",ARG_TYPE_STRING,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER MIGRATESLOTS argument table */
struct COMMAND_ARG CLUSTER_MIGRATESLOTS_Args[] = {
{MAKE_ARG("migration-group",ARG_TYPE_BLOCK,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,4,NULL),.subargs=CLUSTER_MIGRATESLOTS_migration_group_Subargs},
};
/********** CLUSTER MYID ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
@ -1045,11 +1115,29 @@ const char *CLUSTER_SLOTS_Tips[] = {
#define CLUSTER_SLOTS_Keyspecs NULL
#endif
/********** CLUSTER SYNCSLOTS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER SYNCSLOTS history */
#define CLUSTER_SYNCSLOTS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER SYNCSLOTS tips */
#define CLUSTER_SYNCSLOTS_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER SYNCSLOTS key specs */
#define CLUSTER_SYNCSLOTS_Keyspecs NULL
#endif
/* CLUSTER command table */
struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("addslots","Assigns new hash slots to a node.","O(N) where N is the total number of hash slot arguments","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_ADDSLOTS_History,0,CLUSTER_ADDSLOTS_Tips,0,clusterCommand,-3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_ADDSLOTS_Keyspecs,0,NULL,1),.args=CLUSTER_ADDSLOTS_Args},
{MAKE_CMD("addslotsrange","Assigns new hash slot ranges to a node.","O(N) where N is the total number of the slots between the start slot and end slot arguments.","7.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_ADDSLOTSRANGE_History,0,CLUSTER_ADDSLOTSRANGE_Tips,0,clusterCommand,-4,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_ADDSLOTSRANGE_Keyspecs,0,NULL,1),.args=CLUSTER_ADDSLOTSRANGE_Args},
{MAKE_CMD("bumpepoch","Advances the cluster config epoch.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_BUMPEPOCH_History,0,CLUSTER_BUMPEPOCH_Tips,1,clusterCommand,2,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_BUMPEPOCH_Keyspecs,0,NULL,0)},
{MAKE_CMD("cancelslotmigrations","Cancel all current ongoing slot migration operations.","O(N), where N is the number of slot migration operations being cancelled.","9.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_CANCELSLOTMIGRATIONS_History,0,CLUSTER_CANCELSLOTMIGRATIONS_Tips,0,clusterCommand,2,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_CANCELSLOTMIGRATIONS_Keyspecs,0,NULL,0)},
{MAKE_CMD("count-failure-reports","Returns the number of active failure reports active for a node.","O(N) where N is the number of failure reports","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_COUNT_FAILURE_REPORTS_History,0,CLUSTER_COUNT_FAILURE_REPORTS_Tips,1,clusterCommand,3,CMD_ADMIN|CMD_STALE,0,CLUSTER_COUNT_FAILURE_REPORTS_Keyspecs,0,NULL,1),.args=CLUSTER_COUNT_FAILURE_REPORTS_Args},
{MAKE_CMD("countkeysinslot","Returns the number of keys in a hash slot.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_COUNTKEYSINSLOT_History,0,CLUSTER_COUNTKEYSINSLOT_Tips,0,clusterCommand,3,CMD_STALE,0,CLUSTER_COUNTKEYSINSLOT_Keyspecs,0,NULL,1),.args=CLUSTER_COUNTKEYSINSLOT_Args},
{MAKE_CMD("delslots","Sets hash slots as unbound for a node.","O(N) where N is the total number of hash slot arguments","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_DELSLOTS_History,0,CLUSTER_DELSLOTS_Tips,0,clusterCommand,-3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_DELSLOTS_Keyspecs,0,NULL,1),.args=CLUSTER_DELSLOTS_Args},
@ -1059,11 +1147,13 @@ struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("flushslots","Deletes all slots information from a node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_FLUSHSLOTS_History,0,CLUSTER_FLUSHSLOTS_Tips,0,clusterCommand,2,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_FLUSHSLOTS_Keyspecs,0,NULL,0)},
{MAKE_CMD("forget","Removes a node from the nodes table.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_FORGET_History,0,CLUSTER_FORGET_Tips,0,clusterCommand,3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_FORGET_Keyspecs,0,NULL,1),.args=CLUSTER_FORGET_Args},
{MAKE_CMD("getkeysinslot","Returns the key names in a hash slot.","O(N) where N is the number of requested keys","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_GETKEYSINSLOT_History,0,CLUSTER_GETKEYSINSLOT_Tips,1,clusterCommand,4,CMD_STALE,0,CLUSTER_GETKEYSINSLOT_Keyspecs,0,NULL,2),.args=CLUSTER_GETKEYSINSLOT_Args},
{MAKE_CMD("getslotmigrations","Get the status of ongoing and recently finished slot import and export operations.","O(N), where N is the number of active slot import and export jobs.","9.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_GETSLOTMIGRATIONS_History,0,CLUSTER_GETSLOTMIGRATIONS_Tips,0,clusterCommand,2,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_GETSLOTMIGRATIONS_Keyspecs,0,NULL,0)},
{MAKE_CMD("help","Returns helpful text about the different subcommands.","O(1)","5.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_HELP_History,0,CLUSTER_HELP_Tips,0,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_HELP_Keyspecs,0,NULL,0)},
{MAKE_CMD("info","Returns information about the state of a node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_INFO_History,0,CLUSTER_INFO_Tips,1,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_INFO_Keyspecs,0,NULL,0)},
{MAKE_CMD("keyslot","Returns the hash slot for a key.","O(N) where N is the number of bytes in the key","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_KEYSLOT_History,0,CLUSTER_KEYSLOT_Tips,0,clusterCommand,3,CMD_STALE,0,CLUSTER_KEYSLOT_Keyspecs,0,NULL,1),.args=CLUSTER_KEYSLOT_Args},
{MAKE_CMD("links","Returns a list of all TCP links to and from peer nodes.","O(N) where N is the total number of Cluster nodes","7.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_LINKS_History,0,CLUSTER_LINKS_Tips,1,clusterCommand,2,CMD_STALE,0,CLUSTER_LINKS_Keyspecs,0,NULL,0)},
{MAKE_CMD("meet","Forces a node to handshake with another node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MEET_History,1,CLUSTER_MEET_Tips,0,clusterCommand,-4,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_MEET_Keyspecs,0,NULL,3),.args=CLUSTER_MEET_Args},
{MAKE_CMD("migrateslots","Migrate the given slots from this node to the specified nodes.","O(N) where N is the total number of the slots between all start slot and end slot arguments.","9.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MIGRATESLOTS_History,0,CLUSTER_MIGRATESLOTS_Tips,0,clusterCommand,-4,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_MIGRATESLOTS_Keyspecs,0,NULL,1),.args=CLUSTER_MIGRATESLOTS_Args},
{MAKE_CMD("myid","Returns the ID of a node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MYID_History,0,CLUSTER_MYID_Tips,0,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_MYID_Keyspecs,0,NULL,0)},
{MAKE_CMD("myshardid","Returns the shard ID of a node.","O(1)","7.2.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MYSHARDID_History,0,CLUSTER_MYSHARDID_Tips,1,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_MYSHARDID_Keyspecs,0,NULL,0)},
{MAKE_CMD("nodes","Returns the cluster configuration for a node.","O(N) where N is the total number of Cluster nodes","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_NODES_History,0,CLUSTER_NODES_Tips,1,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_NODES_Keyspecs,0,NULL,0)},
@ -1077,6 +1167,7 @@ struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("slaves","Lists the replica nodes of a primary node.","O(N) where N is the number of replicas.","3.0.0",CMD_DOC_DEPRECATED,"`CLUSTER REPLICAS`","5.0.0","cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLAVES_History,0,CLUSTER_SLAVES_Tips,1,clusterCommand,3,CMD_ADMIN|CMD_STALE,0,CLUSTER_SLAVES_Keyspecs,0,NULL,1),.args=CLUSTER_SLAVES_Args},
{MAKE_CMD("slot-stats","Return an array of slot usage statistics for slots assigned to the current node.","O(N) where N is the total number of slots based on arguments. O(N*log(N)) with ORDERBY subcommand.","8.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLOT_STATS_History,0,CLUSTER_SLOT_STATS_Tips,2,clusterSlotStatsCommand,-4,CMD_STALE|CMD_LOADING,0,CLUSTER_SLOT_STATS_Keyspecs,0,NULL,1),.args=CLUSTER_SLOT_STATS_Args},
{MAKE_CMD("slots","Returns the mapping of cluster slots to nodes.","O(N) where N is the total number of Cluster nodes","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLOTS_History,2,CLUSTER_SLOTS_Tips,1,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_SLOTS_Keyspecs,0,NULL,0)},
{MAKE_CMD("syncslots","A container for internal slot migration commands.","Depends on subcommand.","9.0.0",CMD_DOC_SYSCMD,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SYNCSLOTS_History,0,CLUSTER_SYNCSLOTS_Tips,0,clusterCommand,-3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_SYNCSLOTS_Keyspecs,0,NULL,0)},
{0}
};

View File

@ -0,0 +1,19 @@
{
"CANCELSLOTMIGRATIONS": {
"summary": "Cancel all current ongoing slot migration operations.",
"complexity": "O(N), where N is the number of slot migration operations being cancelled.",
"group": "cluster",
"since": "9.0.0",
"container": "CLUSTER",
"function": "clusterCommand",
"arity": 2,
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
],
"reply_schema": {
"const": "OK"
}
}
}

View File

@ -0,0 +1,67 @@
{
"GETSLOTMIGRATIONS": {
"summary": "Get the status of ongoing and recently finished slot import and export operations.",
"complexity": "O(N), where N is the number of active slot import and export jobs.",
"group": "cluster",
"since": "9.0.0",
"container": "CLUSTER",
"function": "clusterCommand",
"arity": 2,
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
],
"reply_schema": {
"description": "A nested list of maps, one for each migration, with keys and values representing migration fields.",
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"link_name": {
"type": "string",
"description": "A 40-byte randomly generated hex string representing the migration.",
"pattern": "^[0-9a-fA-F]{40}$"
},
"operation": {
"oneOf": [
{
"const": "IMPORT"
},
{
"const": "EXPORT"
}
]
},
"slot_ranges": {
"description": "Slot ranges in the format <start>-<end> inclusive, each range separated by a space.",
"type": "string",
"pattern": "^([0-9]+-[0-9]+)( [0-9]+-[0-9]+)*$"
},
"node": {
"type": "string"
},
"create_time": {
"description": "Creation time, in seconds since the unix epoch.",
"type": "integer"
},
"last_update_time": {
"description": "Last update time, in seconds since the unix epoch.",
"type": "integer"
},
"last_ack_time": {
"description": "Last received ack time, in seconds since the unix epoch.",
"type": "integer"
},
"state": {
"type": "string"
},
"message": {
"type": "string"
}
}
}
}
}
}

View File

@ -0,0 +1,55 @@
{
"MIGRATESLOTS": {
"summary": "Migrate the given slots from this node to the specified nodes.",
"complexity": "O(N) where N is the total number of the slots between all start slot and end slot arguments.",
"group": "cluster",
"since": "9.0.0",
"arity": -4,
"container": "CLUSTER",
"function": "clusterCommand",
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
],
"arguments": [
{
"name": "migration-group",
"type": "block",
"multiple": true,
"arguments": [
{
"name": "slotsrange-token",
"type": "pure-token",
"token": "SLOTSRANGE",
"description": "One or more ranges of slots to be migrated."
},
{
"name": "range",
"type": "block",
"multiple": true,
"arguments": [
{ "name": "start-slot", "type": "integer" },
{ "name": "end-slot", "type": "integer" }
],
"description": "Start and end slot for single range"
},
{
"name": "node-token",
"type": "pure-token",
"token": "NODE"
},
{
"name": "node-id",
"type": "string",
"pattern": "^[0-9a-fA-F]{40}$",
"description": "40 character node name of the node to migrate to"
}
]
}
],
"reply_schema": {
"const": "OK"
}
}
}

View File

@ -0,0 +1,19 @@
{
"SYNCSLOTS": {
"summary": "A container for internal slot migration commands.",
"complexity": "Depends on subcommand.",
"group": "cluster",
"since": "9.0.0",
"container": "CLUSTER",
"function": "clusterCommand",
"doc_flags": [
"SYSCMD"
],
"arity": -3,
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
]
}
}

View File

@ -38,6 +38,12 @@
"cluster.links": {
"type": "integer"
},
"cluster.slot_import": {
"type": "integer"
},
"cluster.slot_export": {
"type": "integer"
},
"aof.buffer": {
"type": "integer"
},

View File

@ -34,6 +34,7 @@
#include "connection.h"
#include "bio.h"
#include "module.h"
#include "cluster_migrateslots.h"
#include <fcntl.h>
#include <sys/stat.h>
@ -3331,6 +3332,7 @@ standardConfig static_configs[] = {
createULongConfig("commandlog-large-reply-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.commandlog[COMMANDLOG_TYPE_LARGE_REPLY].max_len, 128, INTEGER_CONFIG, NULL, NULL),
createULongConfig("acllog-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.acllog_max_len, 128, INTEGER_CONFIG, NULL, NULL),
createULongConfig("cluster-blacklist-ttl", NULL, MODIFIABLE_CONFIG, 0, ULONG_MAX, server.cluster_blacklist_ttl, 60, INTEGER_CONFIG, NULL, NULL),
createULongConfig("cluster-slot-migration-log-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.cluster_slot_migration_log_max_len, 1000, INTEGER_CONFIG, NULL, NULL),
/* Long Long configs */
createLongLongConfig("busy-reply-threshold", "lua-time-limit", MODIFIABLE_CONFIG, 0, LONG_MAX, server.busy_reply_threshold, 5000, INTEGER_CONFIG, NULL, NULL), /* milliseconds */
@ -3363,6 +3365,7 @@ standardConfig static_configs[] = {
createSizeTConfig("tracking-table-max-keys", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.tracking_table_max_keys, 1000000, INTEGER_CONFIG, NULL, NULL), /* Default: 1 million keys max. */
createSizeTConfig("client-query-buffer-limit", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, 1024 * 1024, LONG_MAX, server.client_max_querybuf_len, 1024 * 1024 * 1024, MEMORY_CONFIG, NULL, NULL), /* Default: 1GB max query buffer. */
createSSizeTConfig("maxmemory-clients", NULL, MODIFIABLE_CONFIG, -100, SSIZE_MAX, server.maxmemory_clients, 0, MEMORY_CONFIG | PERCENT_CONFIG, NULL, applyClientMaxMemoryUsage),
createSSizeTConfig("slot-migration-max-failover-repl-bytes", NULL, MODIFIABLE_CONFIG, -1, SSIZE_MAX, server.slot_migration_max_failover_repl_bytes, 0, MEMORY_CONFIG, NULL, NULL),
/* Other configs */
createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60 * 60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */

View File

@ -29,6 +29,7 @@
#include "server.h"
#include "cluster.h"
#include "cluster_migrateslots.h"
#include "latency.h"
#include "script.h"
#include "functions.h"
@ -46,9 +47,7 @@ static keyStatus expireIfNeeded(serverDb *db, robj *key, robj *val, int flags);
static int keyIsExpiredWithDictIndex(serverDb *db, robj *key, int dict_index);
static int objectIsExpired(robj *val);
static void dbSetValue(serverDb *db, robj *key, robj **valref, int overwrite, void **oldref);
static int getKVStoreIndexForKey(sds key);
static robj *dbFindWithDictIndex(serverDb *db, sds key, int dict_index);
static robj *dbFindExpiresWithDictIndex(serverDb *db, sds key, int dict_index);
/* Update LFU when an object is accessed.
* Firstly, decrement the counter if the decrement time is reached.
@ -242,7 +241,7 @@ void dbAdd(serverDb *db, robj *key, robj **valref) {
}
/* Returns which dict index should be used with kvstore for a given key. */
static int getKVStoreIndexForKey(sds key) {
int getKVStoreIndexForKey(sds key) {
return server.cluster_enabled ? getKeySlot(key) : 0;
}
@ -262,16 +261,16 @@ int getKeySlot(sds key) {
* so we must always recompute the slot for commands coming from the primary.
*/
if (server.current_client && server.current_client->slot >= 0 && server.current_client->flag.executing_command &&
!server.current_client->flag.primary) {
!isReplicatedClient(server.current_client)) {
debugServerAssertWithInfo(server.current_client, NULL,
(int)keyHashSlot(key, (int)sdslen(key)) == server.current_client->slot);
return server.current_client->slot;
}
int slot = keyHashSlot(key, (int)sdslen(key));
/* For the case of replicated commands from primary, getNodeByQuery() never gets called,
* and thus c->slot never gets populated. That said, if this command ends up accessing a key,
* we are able to backfill c->slot here, where the key's hash calculation is made. */
if (server.current_client && server.current_client->flag.primary) {
/* For the case of commands from clients we must obey, getNodeByQuery() never gets called,
* and thus c->slot never gets populated. That said, if this command ends up accessing
* a key, we are able to backfill c->slot here, where the key's hash calculation is made. */
if (server.current_client && mustObeyClient(server.current_client)) {
server.current_client->slot = slot;
}
return slot;
@ -447,6 +446,7 @@ robj *dbRandomKey(serverDb *db) {
while (1) {
void *entry;
int randomDictIndex = kvstoreGetFairRandomHashtableIndex(db->keys);
if (randomDictIndex == KVSTORE_INDEX_NOT_FOUND) return NULL;
if (!kvstoreHashtableFairRandomEntry(db->keys, randomDictIndex, &entry)) return NULL;
robj *valkey = entry;
sds key = objectGetKey(valkey);
@ -823,6 +823,13 @@ void flushdbCommand(client *c) {
int flags;
if (getFlushCommandFlags(c, &flags) == C_ERR) return;
if (clusterIsAnySlotImporting() || clusterIsAnySlotExporting()) {
/* In progress migrations will be cancelled, and should be retried by
* operators. */
clusterHandleFlushDuringSlotMigration();
}
/* flushdb should not flush the functions */
server.dirty += emptyData(c->db->id, flags | EMPTYDB_NOFUNCTIONS, NULL);
@ -846,6 +853,13 @@ void flushdbCommand(client *c) {
void flushallCommand(client *c) {
int flags;
if (getFlushCommandFlags(c, &flags) == C_ERR) return;
if (clusterIsAnySlotImporting() || clusterIsAnySlotExporting()) {
/* In progress migrations will be cancelled, and should be retried by
* operators. */
clusterHandleFlushDuringSlotMigration();
}
/* flushall should not flush the functions */
flushAllDataAndResetRDB(flags | EMPTYDB_NOFUNCTIONS);
@ -925,6 +939,11 @@ void keysCommand(client *c) {
allkeys = (pattern[0] == '*' && plen == 1);
if (server.cluster_enabled && !allkeys) {
pslot = patternHashSlot(pattern, plen);
if (pslot != -1 && clusterIsSlotImporting(pslot)) {
/* Short circuit if requested slot is being imported. */
setDeferredArrayLen(c, replylen, 0);
return;
}
}
kvstoreHashtableIterator *kvs_di = NULL;
kvstoreIterator *kvs_it = NULL;
@ -986,7 +1005,7 @@ int objectTypeCompare(robj *o, long long target) {
}
/* Hashtable scan callback used by scanCallback when scanning the keyspace. */
void keysScanCallback(void *privdata, void *entry) {
void keysScanCallback(void *privdata, void *entry, int didx) {
scanData *data = (scanData *)privdata;
robj *obj = entry;
data->sampled++;
@ -1009,7 +1028,7 @@ void keysScanCallback(void *privdata, void *entry) {
if (objectIsExpired(obj)) {
robj kobj;
initStaticStringObject(kobj, key);
if (expireIfNeeded(data->db, &kobj, obj, 0) != KEY_VALID) {
if (expireIfNeededWithDictIndex(data->db, &kobj, obj, 0, didx) != KEY_VALID) {
return;
}
}
@ -1898,7 +1917,7 @@ void deleteExpiredKeyAndPropagateWithDictIndex(serverDb *db, robj *keyobj, int d
latencyTraceIfNeeded(db, expire_del, expire_latency);
notifyKeyspaceEvent(NOTIFY_EXPIRED, "expired", keyobj, db->id);
signalModifiedKey(NULL, db, keyobj);
propagateDeletion(db, keyobj, server.lazyfree_lazy_expire);
propagateDeletion(db, keyobj, server.lazyfree_lazy_expire, dict_index);
server.stat_expiredkeys++;
}
@ -1941,7 +1960,7 @@ void deleteExpiredKeyFromOverwriteAndPropagate(client *c, robj *keyobj) {
* postExecutionUnitOperations, preferably just after a
* single deletion batch, so that DEL/UNLINK will NOT be wrapped
* in MULTI/EXEC */
void propagateDeletion(serverDb *db, robj *key, int lazy) {
void propagateDeletion(serverDb *db, robj *key, int lazy, int slot) {
robj *argv[2];
argv[0] = lazy ? shared.unlink : shared.del;
@ -1951,7 +1970,7 @@ void propagateDeletion(serverDb *db, robj *key, int lazy) {
* Even if module executed a command without asking for propagation. */
int prev_replication_allowed = server.replication_allowed;
server.replication_allowed = 1;
alsoPropagate(db->id, argv, 2, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(db->id, argv, 2, PROPAGATE_AOF | PROPAGATE_REPL, slot);
server.replication_allowed = prev_replication_allowed;
}
@ -1962,7 +1981,7 @@ static const size_t EXPIRE_BULK_LIMIT = 1024; /* Maximum number of fields to act
* This function builds and propagates a single HDEL command with multiple fields
* for the given hash object `o`. It temporarily enables replication (if needed),
* constructs the command using the field names, and sends it via alsoPropagate(). */
static void propagateFieldsDeletion(serverDb *db, robj *o, size_t n_fields, robj *fields[]) {
static void propagateFieldsDeletion(serverDb *db, robj *o, size_t n_fields, robj *fields[], int didx) {
int prev_replication_allowed = server.replication_allowed;
server.replication_allowed = 1;
@ -1976,7 +1995,7 @@ static void propagateFieldsDeletion(serverDb *db, robj *o, size_t n_fields, robj
argv[argc++] = fields[i];
}
alsoPropagate(db->id, argv, argc, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(db->id, argv, argc, PROPAGATE_AOF | PROPAGATE_REPL, didx);
server.replication_allowed = prev_replication_allowed;
for (int i = 0; i < argc; i++) {
decrRefCount(argv[i]);
@ -1993,7 +2012,7 @@ static void propagateFieldsDeletion(serverDb *db, robj *o, size_t n_fields, robj
*
* Batching avoids large stack allocations while allowing max_entries to be arbitrarily large.
* Returns the total number of expired fields removed. */
size_t dbReclaimExpiredFields(robj *o, serverDb *db, mstime_t now, unsigned long max_entries) {
size_t dbReclaimExpiredFields(robj *o, serverDb *db, mstime_t now, unsigned long max_entries, int didx) {
size_t total_expired = 0;
bool deleteKey = false;
@ -2016,11 +2035,11 @@ size_t dbReclaimExpiredFields(robj *o, serverDb *db, mstime_t now, unsigned long
robj *keyobj = createStringObjectFromSds(objectGetKey(o));
/* Note that even though if might have been more efficient to only propagate del in case the key has no more items left,
* we must keep consistency in order to allow the replica to report hdel notifications before del. */
propagateFieldsDeletion(db, o, expired, entries);
propagateFieldsDeletion(db, o, expired, entries, didx);
notifyKeyspaceEvent(NOTIFY_EXPIRED, "hexpired", keyobj, db->id);
if (deleteKey) {
dbDelete(db, keyobj);
propagateDeletion(db, keyobj, server.lazyfree_lazy_expire);
propagateDeletion(db, keyobj, server.lazyfree_lazy_expire, didx);
notifyKeyspaceEvent(NOTIFY_GENERIC, "del", keyobj, db->id);
} else {
if (!hashTypeHasVolatileFields(o)) dbUntrackKeyWithVolatileItems(db, o);
@ -2193,7 +2212,7 @@ robj *dbFind(serverDb *db, sds key) {
return dbFindWithDictIndex(db, key, dict_index);
}
static robj *dbFindExpiresWithDictIndex(serverDb *db, sds key, int dict_index) {
robj *dbFindExpiresWithDictIndex(serverDb *db, sds key, int dict_index) {
void *existing = NULL;
kvstoreHashtableFind(db->expires, dict_index, key, &existing);
return existing;
@ -2208,7 +2227,7 @@ unsigned long long dbSize(serverDb *db) {
return kvstoreSize(db->keys);
}
unsigned long long dbScan(serverDb *db, unsigned long long cursor, hashtableScanFunction scan_cb, void *privdata) {
unsigned long long dbScan(serverDb *db, unsigned long long cursor, kvstoreScanFunction scan_cb, void *privdata) {
return kvstoreScan(db->keys, cursor, -1, scan_cb, NULL, privdata);
}

View File

@ -513,6 +513,10 @@ void debugCommand(client *c) {
"CLIENT-ENFORCE-REPLY-LIST <0|1>",
"When set to 1, it enforces the use of the client reply list directly",
" and avoids using the client's static buffer.",
"SLOTMIGRATION PREVENT-PAUSE <0|1>",
" When set to 1, slot migrations will be prevented from pausing on the source node.",
"SLOTMIGRATION PREVENT-FAILOVER <0|1>",
" When set to 1, slot migrations will be prevented from performing the slot-level failover on the target node.",
NULL};
addExtendedReplyHelp(c, help, clusterDebugCommandExtendedHelp());
} else if (!strcasecmp(c->argv[1]->ptr, "segfault")) {
@ -622,6 +626,16 @@ void debugCommand(client *c) {
} else if (!strcasecmp(c->argv[1]->ptr, "disable-cluster-reconnection") && c->argc == 3) {
server.debug_cluster_disable_reconnection = atoi(c->argv[2]->ptr);
addReply(c, shared.ok);
} else if (!strcasecmp(c->argv[1]->ptr, "slotmigration")) {
if (!strcasecmp(c->argv[2]->ptr, "prevent-pause")) {
server.debug_slot_migration_prevent_pause = atoi(c->argv[3]->ptr);
} else if (!strcasecmp(c->argv[2]->ptr, "prevent-failover")) {
server.debug_slot_migration_prevent_failover = atoi(c->argv[3]->ptr);
} else {
addReplySubcommandSyntaxError(c);
return;
}
addReply(c, shared.ok);
} else if (!strcasecmp(c->argv[1]->ptr, "object") && (c->argc == 3 || c->argc == 4)) {
robj *val;
char *strenc;

View File

@ -33,6 +33,7 @@
#include "server.h"
#include "bio.h"
#include "script.h"
#include "cluster_migrateslots.h"
#include <math.h>
/* ----------------------------------------------------------------------------
@ -146,6 +147,8 @@ int evictionPoolPopulate(serverDb *db, kvstore *samplekvs, struct evictionPoolEn
void *samples[server.maxmemory_samples];
int slot = kvstoreGetFairRandomHashtableIndex(samplekvs);
/* We may get not found if there are no keys */
if (slot == KVSTORE_INDEX_NOT_FOUND) return 0;
count = kvstoreHashtableSampleEntries(samplekvs, slot, &samples[0], server.maxmemory_samples);
for (j = 0; j < count; j++) {
unsigned long long idle;
@ -350,6 +353,11 @@ size_t freeMemoryGetNotCountedMemory(void) {
if (server.aof_state != AOF_OFF) {
overhead += sdsAllocSize(server.aof_buf);
}
if (clusterIsAnySlotExporting()) {
overhead += clusterGetTotalSlotExportBufferMemory();
}
return overhead;
}
@ -555,6 +563,7 @@ int performEvictions(void) {
static unsigned int next_db = 0;
sds bestkey = NULL;
int bestdbid;
int bestslot;
serverDb *db;
robj *valkey;
@ -607,6 +616,7 @@ int performEvictions(void) {
kvs = server.db[bestdbid]->expires;
}
void *entry = NULL;
bool found = kvstoreHashtableFind(kvs, pool[k].slot, pool[k].key, &entry);
/* Remove the entry from the pool. */
@ -619,6 +629,7 @@ int performEvictions(void) {
if (found) {
valkey = entry;
bestkey = objectGetKey(valkey);
bestslot = pool[k].slot;
break;
} else {
/* Ghost... Iterate again. */
@ -644,10 +655,12 @@ int performEvictions(void) {
kvs = db->expires;
}
int slot = kvstoreGetFairRandomHashtableIndex(kvs);
if (slot == KVSTORE_INDEX_NOT_FOUND) continue; /* No keys in this DB. */
void *entry;
if (kvstoreHashtableRandomEntry(kvs, slot, &entry)) {
bestkey = objectGetKey((robj *)entry);
bestdbid = j;
bestslot = slot;
break;
}
}
@ -679,7 +692,7 @@ int performEvictions(void) {
server.stat_evictedkeys++;
signalModifiedKey(NULL, db, keyobj);
notifyKeyspaceEvent(NOTIFY_EVICTED, "evicted", keyobj, db->id);
propagateDeletion(db, keyobj, server.lazyfree_lazy_eviction);
propagateDeletion(db, keyobj, server.lazyfree_lazy_eviction, bestslot);
exitExecutionUnit();
postExecutionUnitOperations();
decrRefCount(keyobj);

View File

@ -36,6 +36,8 @@
*/
#include "server.h"
#include "cluster.h"
#include "cluster_migrateslots.h"
/*-----------------------------------------------------------------------------
* Incremental collection of expired keys.
@ -60,14 +62,14 @@ static double avg_ttl_factor[16] = {0.98, 0.9604, 0.941192, 0.922368, 0.903921,
*
* The parameter 'now' is the current time in milliseconds as is passed
* to the function to avoid too many gettimeofday() syscalls. */
int activeExpireCycleTryExpire(serverDb *db, robj *val, long long now) {
int activeExpireCycleTryExpire(serverDb *db, robj *val, long long now, int didx) {
long long t = objectGetExpire(val);
serverAssert(t >= 0);
if (now > t) {
enterExecutionUnit(1, 0);
sds key = objectGetKey(val);
robj *keyobj = createStringObject(key, sdslen(key));
deleteExpiredKeyAndPropagate(db, keyobj);
deleteExpiredKeyAndPropagateWithDictIndex(db, keyobj, didx);
decrRefCount(keyobj);
exitExecutionUnit();
return 1;
@ -140,11 +142,11 @@ typedef struct activeExpireFieldIterator {
unsigned long cursor; /* Cursor for keys with volatile items (field-level TTL) */
} activeExpireFieldIterator;
void expireScanCallback(void *privdata, void *entry) {
void expireScanCallback(void *privdata, void *entry, int didx) {
robj *val = entry;
expireScanData *data = privdata;
long long ttl = objectGetExpire(val) - data->now;
if (activeExpireCycleTryExpire(data->db, val, data->now)) {
if (activeExpireCycleTryExpire(data->db, val, data->now, didx)) {
data->expired++;
/* Propagate the DEL command */
postExecutionUnitOperations();
@ -159,13 +161,13 @@ void expireScanCallback(void *privdata, void *entry) {
/* Expires up to `max_entries` fields from a hash with volatile fields.
* Sets `has_more_expired_entries` if more remain. Updates stats. */
void fieldExpireScanCallback(void *privdata, void *volaKey) {
void fieldExpireScanCallback(void *privdata, void *volaKey, int didx) {
expireScanData *data = privdata;
robj *o = volaKey;
serverAssert(o);
serverAssert(hashTypeHasVolatileFields(o));
mstime_t now = server.mstime;
size_t expired_fields = dbReclaimExpiredFields(o, data->db, now, data->max_entries);
size_t expired_fields = dbReclaimExpiredFields(o, data->db, now, data->max_entries, didx);
if (expired_fields) {
data->has_more_expired_entries = (expired_fields == data->max_entries);
data->expired++;
@ -265,7 +267,7 @@ static long long activeExpireCycleJob(enum activeExpiryType jobType, int cycleTy
int db_done = 0; /* The scan of the current DB is done? */
int update_avg_ttl_times = 0, repeat = 0;
hashtableScanFunction scan_cb;
kvstoreScanFunction scan_cb;
kvstore *kvs = NULL;
if (db) {
@ -548,10 +550,11 @@ void expireReplicaKeys(void) {
while (dbids && dbid < server.dbnum) {
if ((dbids & 1) != 0) {
serverDb *db = server.db[dbid];
robj *expire = db == NULL ? NULL : dbFindExpires(db, keyname);
int didx = getKVStoreIndexForKey(keyname);
robj *expire = db == NULL ? NULL : dbFindExpiresWithDictIndex(db, keyname, didx);
int expired = 0;
if (expire && activeExpireCycleTryExpire(db, expire, start)) {
if (expire && activeExpireCycleTryExpire(db, expire, start, didx)) {
expired = 1;
/* Propagate the DEL (writable replicas do not propagate anything to other replicas,
* but they might propagate to AOF) and trigger module hooks. */
@ -649,7 +652,11 @@ int checkAlreadyExpired(long long when) {
* (possibly in the past) and wait for an explicit DEL from the primary.
*
* If the server is a primary and in the import mode, we also add the already
* expired key and wait for an explicit DEL from the import source. */
* expired key and wait for an explicit DEL from the import source.
*
* If the server is receiving the key from a slot migration, we will accept
* expired keys and wait for the source to propagate deletion. */
if (server.current_client && server.current_client->slot_migration_job) return 0;
return (when <= commandTimeSnapshot() && !server.loading && !server.primary_host && !server.import_mode);
}
@ -966,6 +973,9 @@ expirationPolicy getExpirationPolicyWithFlags(int flags) {
if (server.primary_host != NULL) {
if (server.current_client && (server.current_client->flag.primary)) return POLICY_IGNORE_EXPIRE;
if (!(flags & EXPIRE_FORCE_DELETE_EXPIRED)) return POLICY_KEEP_EXPIRED;
} else if (server.current_client && server.current_client->slot_migration_job) {
/* Slot migration client should be treated like a primary */
return POLICY_IGNORE_EXPIRE;
} else if (server.import_mode) {
/* If we are running in the import mode on a primary, instead of
* evicting the expired key from the database, we return ASAP:

View File

@ -101,6 +101,7 @@ typedef void (*hashtableScanFunction)(void *privdata, void *entry);
#define HASHTABLE_ITER_SAFE (1 << 0)
#define HASHTABLE_ITER_PREFETCH_VALUES (1 << 1)
#define HASHTABLE_ITER_SKIP_VALIDATION (1 << 2)
#define HASHTABLE_ITER_INCLUDE_IMPORTING (1 << 3)
/* --- Prototypes --- */

View File

@ -388,7 +388,7 @@ int trySendReadToIOThreads(client *c) {
c->cur_tid = tid;
c->read_flags = canParseCommand(c) ? 0 : READ_FLAGS_DONT_PARSE;
c->read_flags |= authRequired(c) ? READ_FLAGS_AUTH_REQUIRED : 0;
c->read_flags |= c->flag.primary ? READ_FLAGS_PRIMARY : 0;
c->read_flags |= isReplicatedClient(c) ? READ_FLAGS_REPLICATED : 0;
c->io_read_state = CLIENT_PENDING_IO;
connSetPostponeUpdateState(c->conn, 1);

View File

@ -45,6 +45,7 @@
#include "zmalloc.h"
#include "kvstore.h"
#include "serverassert.h"
#include "dict.h"
#include "monotonic.h"
#define UNUSED(V) ((void)V)
@ -67,6 +68,8 @@ struct _kvstore {
* given hashtable-index. */
size_t overhead_hashtable_lut; /* Overhead of all hashtables in bytes. */
size_t overhead_hashtable_rehashing; /* Overhead of hash tables rehashing in bytes. */
hashtable *importing; /* The set of hashtable indexes that are being imported */
unsigned long long importing_key_count; /* Total number of importing keys in this kvstore. */
};
/* Structure for kvstore iterator that allows iterating across multiple hashtables. */
@ -75,6 +78,8 @@ struct _kvstoreIterator {
long long didx;
long long next_didx;
hashtableIterator di;
uint8_t flags;
hashtableIterator *importing_iter;
};
/* Structure for kvstore hashtable iterator that allows iterating the corresponding hashtable. */
@ -90,6 +95,8 @@ typedef struct {
kvstore *kvs;
} kvstoreHashtableMetadata;
hashtableType intHashtableType = {.instant_rehashing = 1};
/**********************************/
/*** Helpers **********************/
/**********************************/
@ -126,8 +133,8 @@ static unsigned long long cumulativeKeyCountRead(kvstore *kvs, int didx) {
static void addHashtableIndexToCursor(kvstore *kvs, int didx, unsigned long long *cursor) {
if (kvs->num_hashtables == 1) return;
/* didx can be -1 when iteration is over and there are no more hashtables to visit. */
if (didx < 0) return;
/* didx can be KVSTORE_INDEX_NOT_FOUND when iteration is over and there are no more hashtables to visit. */
if (didx == KVSTORE_INDEX_NOT_FOUND) return;
*cursor = (*cursor << kvs->num_hashtables_bits) | didx;
}
@ -138,10 +145,22 @@ static int getAndClearHashtableIndexFromCursor(kvstore *kvs, unsigned long long
return didx;
}
int kvstoreIsImporting(kvstore *kvs, int didx) {
assert(didx < kvs->num_hashtables);
return hashtableFind(kvs->importing, (void *)(intptr_t)didx, NULL);
}
/* Updates binary index tree (also known as Fenwick tree), increasing key count for a given hashtable.
* You can read more about this data structure here https://en.wikipedia.org/wiki/Fenwick_tree
* Time complexity is O(log(kvs->num_hashtables)). */
static void cumulativeKeyCountAdd(kvstore *kvs, int didx, long delta) {
/* Fast return for importing dictionaries, which will be accumulated in
* metrics once we are done importing. */
if (kvstoreIsImporting(kvs, didx)) {
kvs->importing_key_count += delta;
return;
}
kvs->key_count += delta;
hashtable *ht = kvstoreGetHashtable(kvs, didx);
@ -282,6 +301,7 @@ kvstore *kvstoreCreate(hashtableType *type, int num_hashtables_bits, int flags)
kvs->num_hashtables_bits = num_hashtables_bits;
kvs->num_hashtables = 1 << kvs->num_hashtables_bits;
kvs->hashtables = zcalloc(sizeof(hashtable *) * kvs->num_hashtables);
kvs->importing = hashtableCreate(&intHashtableType);
kvs->rehashing = listCreate();
kvs->hashtable_size_index = kvs->num_hashtables > 1 ? zcalloc(sizeof(unsigned long long) * (kvs->num_hashtables + 1)) : NULL;
if (!(kvs->flags & KVSTORE_ALLOCATE_HASHTABLES_ON_DEMAND)) {
@ -301,6 +321,9 @@ void kvstoreEmpty(kvstore *kvs, void(callback)(hashtable *)) {
freeHashtableIfNeeded(kvs, didx);
}
hashtableEmpty(kvs->importing, NULL);
kvs->importing_key_count = 0;
listEmpty(kvs->rehashing);
kvs->key_count = 0;
@ -321,6 +344,7 @@ void kvstoreRelease(kvstore *kvs) {
}
assert(kvs->overhead_hashtable_lut == 0);
zfree(kvs->hashtables);
hashtableRelease(kvs->importing);
listRelease(kvs->rehashing);
if (kvs->hashtable_size_index) zfree(kvs->hashtable_size_index);
@ -336,6 +360,10 @@ unsigned long long int kvstoreSize(kvstore *kvs) {
}
}
unsigned long long int kvstoreImportingSize(kvstore *kvs) {
return kvs->importing_key_count;
}
/* This method provides the cumulative sum of all the hash table buckets
* across hash tables in a database. */
unsigned long kvstoreBuckets(kvstore *kvs) {
@ -358,6 +386,17 @@ size_t kvstoreMemUsage(kvstore *kvs) {
return mem;
}
typedef struct kvstoreScanCallbackData {
kvstoreScanFunction scan_cb;
void *privdata;
int didx;
} kvstoreScanCallbackData;
void hashtableScanToKvstoreScanCallback(void *privdata, void *entry) {
kvstoreScanCallbackData *cb_data = privdata;
cb_data->scan_cb(cb_data->privdata, entry, cb_data->didx);
}
/*
* This method is used to iterate over the elements of the entire kvstore specifically across hashtables.
* It's a three pronged approach.
@ -374,7 +413,7 @@ size_t kvstoreMemUsage(kvstore *kvs) {
unsigned long long kvstoreScan(kvstore *kvs,
unsigned long long cursor,
int onlydidx,
hashtableScanFunction scan_cb,
kvstoreScanFunction scan_cb,
kvstoreScanShouldSkipHashtable *skip_cb,
void *privdata) {
unsigned long long next_cursor = 0;
@ -396,10 +435,11 @@ unsigned long long kvstoreScan(kvstore *kvs,
}
hashtable *ht = kvstoreGetHashtable(kvs, didx);
kvstoreScanCallbackData cb_data = {.scan_cb = scan_cb, .privdata = privdata, .didx = didx};
int skip = !ht || (skip_cb && skip_cb(ht));
int skip = !ht || (skip_cb && skip_cb(ht)) || kvstoreIsImporting(kvs, didx);
if (!skip) {
next_cursor = hashtableScan(ht, cursor, scan_cb, privdata);
next_cursor = hashtableScan(ht, cursor, hashtableScanToKvstoreScanCallback, &cb_data);
/* In hashtableScan, scan_cb may delete entries (e.g., in active expire case). */
freeHashtableIfNeeded(kvs, didx);
}
@ -408,7 +448,7 @@ unsigned long long kvstoreScan(kvstore *kvs,
if (onlydidx >= 0) return 0;
didx = kvstoreGetNextNonEmptyHashtableIndex(kvs, didx);
}
if (didx == -1) {
if (didx == KVSTORE_INDEX_NOT_FOUND) {
return 0;
}
addHashtableIndexToCursor(kvs, didx, &next_cursor);
@ -417,7 +457,7 @@ unsigned long long kvstoreScan(kvstore *kvs,
/*
* This functions increases size of kvstore to match desired number.
* It resizes all individual hash tables, unless skip_cb indicates otherwise.
* It resizes all individual hash tables, unless predicate indicates otherwise.
*
* Based on the parameter `try_expand`, appropriate hashtable expand API is invoked.
* if try_expand is set to 1, `hashtableTryExpand` is used else `hashtableExpand`.
@ -445,7 +485,10 @@ bool kvstoreExpand(kvstore *kvs, uint64_t newsize, int try_expand, kvstoreExpand
* returned is proportional to the number of elements that hash table holds.
* This function guarantees that it returns a hashtable-index of a non-empty
* hashtable, unless the entire kvstore is empty. Time complexity of this
* function is O(log(kvs->num_hashtables)). */
* function is O(log(kvs->num_hashtables)).
*
* Note that importing hashtables are excluded from random hashtable lookups. If
* there is no viable hashtable, KVSTORE_INDEX_NOT_FOUND is returned. */
int kvstoreGetFairRandomHashtableIndex(kvstore *kvs) {
unsigned long target = kvstoreSize(kvs) ? (random() % kvstoreSize(kvs)) + 1 : 0;
return kvstoreFindHashtableIndexByKeyIndex(kvs, target);
@ -509,6 +552,8 @@ void kvstoreGetStats(kvstore *kvs, char *buf, size_t bufsize, int full) {
*
* The return value is 0 based hashtable-index, and the range of the target is [1..kvstoreSize], kvstoreSize inclusive.
*
* If the target is 0, or the kvstore is empty, returns KVSTORE_INDEX_NOT_FOUND, indicating no such hashtable.
*
* To find the hashtable, we start with the root node of the binary index tree and search through its children
* from the highest index (2^num_hashtables_bits in our case) to the lowest index. At each node, we check if the target
* value is greater than the node's value. If it is, we remove the node's value from the target and recursively
@ -516,7 +561,8 @@ void kvstoreGetStats(kvstore *kvs, char *buf, size_t bufsize, int full) {
* Time complexity of this function is O(log(kvs->num_hashtables))
*/
int kvstoreFindHashtableIndexByKeyIndex(kvstore *kvs, unsigned long target) {
if (kvs->num_hashtables == 1 || kvstoreSize(kvs) == 0) return 0;
if (kvs->num_hashtables == 1) return 0;
if (kvstoreSize(kvs) == 0 || target == 0) return KVSTORE_INDEX_NOT_FOUND;
assert(target <= kvstoreSize(kvs));
int result = 0, bit_mask = 1 << kvs->num_hashtables_bits;
@ -544,14 +590,14 @@ int kvstoreGetFirstNonEmptyHashtableIndex(kvstore *kvs) {
return kvstoreFindHashtableIndexByKeyIndex(kvs, 1);
}
/* Returns next non-empty hashtable index strictly after given one, or -1 if provided didx is the last one. */
/* Returns next non-empty hashtable index strictly after given one, or KVSTORE_INDEX_NOT_FOUND if provided didx is the last one. */
int kvstoreGetNextNonEmptyHashtableIndex(kvstore *kvs, int didx) {
if (kvs->num_hashtables == 1) {
assert(didx == 0);
return -1;
return KVSTORE_INDEX_NOT_FOUND;
}
unsigned long long next_key = cumulativeKeyCountRead(kvs, didx) + 1;
return next_key <= kvstoreSize(kvs) ? kvstoreFindHashtableIndexByKeyIndex(kvs, next_key) : -1;
return next_key <= kvstoreSize(kvs) ? kvstoreFindHashtableIndexByKeyIndex(kvs, next_key) : KVSTORE_INDEX_NOT_FOUND;
}
int kvstoreNumNonEmptyHashtables(kvstore *kvs) {
@ -572,8 +618,10 @@ int kvstoreNumHashtables(kvstore *kvs) {
kvstoreIterator *kvstoreIteratorInit(kvstore *kvs, uint8_t flags) {
kvstoreIterator *kvs_it = zmalloc(sizeof(*kvs_it));
kvs_it->kvs = kvs;
kvs_it->didx = -1;
kvs_it->didx = KVSTORE_INDEX_NOT_FOUND;
kvs_it->next_didx = kvstoreGetFirstNonEmptyHashtableIndex(kvs_it->kvs); /* Finds first non-empty hashtable index. */
kvs_it->flags = flags;
kvs_it->importing_iter = NULL;
hashtableInitIterator(&kvs_it->di, NULL, flags);
return kvs_it;
}
@ -583,16 +631,46 @@ void kvstoreIteratorRelease(kvstoreIterator *kvs_it) {
hashtableIterator *iter = &kvs_it->di;
hashtableResetIterator(iter);
/* In the safe iterator context, we may delete entries. */
freeHashtableIfNeeded(kvs_it->kvs, kvs_it->didx);
if (kvs_it->didx != KVSTORE_INDEX_NOT_FOUND) {
freeHashtableIfNeeded(kvs_it->kvs, kvs_it->didx);
}
if (kvs_it->importing_iter) {
hashtableReleaseIterator(kvs_it->importing_iter);
}
zfree(kvs_it);
}
static int kvstoreIteratorNextImportingHashtableIndex(kvstoreIterator *kvs_it) {
if (kvs_it->importing_iter == NULL) {
kvs_it->importing_iter = hashtableCreateIterator(kvs_it->kvs->importing, 0);
}
intptr_t didx;
while (hashtableNext(kvs_it->importing_iter, (void **)&didx)) {
if (kvstoreHashtableSize(kvs_it->kvs, didx)) {
return didx;
}
}
return KVSTORE_INDEX_NOT_FOUND;
}
/* Returns next hash table from the iterator, or NULL if iteration is complete. */
static hashtable *kvstoreIteratorNextHashtable(kvstoreIterator *kvs_it) {
if (kvs_it->next_didx == -1) return NULL;
int next_hashtable_index = kvs_it->next_didx;
/* Since importing dictionaries are removed from the binary index tree,
* we will not iterate over them during normal iteration. However, if the
* iterator requested iteration over importing keys, we do those after we
* have exhausted all other hashtables. */
if (next_hashtable_index == KVSTORE_INDEX_NOT_FOUND && kvs_it->flags & HASHTABLE_ITER_INCLUDE_IMPORTING) {
next_hashtable_index = kvstoreIteratorNextImportingHashtableIndex(kvs_it);
}
if (next_hashtable_index == KVSTORE_INDEX_NOT_FOUND) {
return NULL;
}
/* The hashtable may be deleted during the iteration process, so here need to check for NULL. */
if (kvs_it->didx != -1 && kvstoreGetHashtable(kvs_it->kvs, kvs_it->didx)) {
if (kvs_it->didx != KVSTORE_INDEX_NOT_FOUND && kvstoreGetHashtable(kvs_it->kvs, kvs_it->didx)) {
/* Before we move to the next hashtable, reset the iter of the previous hashtable. */
hashtableIterator *iter = &kvs_it->di;
hashtableResetIterator(iter);
@ -600,8 +678,10 @@ static hashtable *kvstoreIteratorNextHashtable(kvstoreIterator *kvs_it) {
freeHashtableIfNeeded(kvs_it->kvs, kvs_it->didx);
}
kvs_it->didx = kvs_it->next_didx;
kvs_it->next_didx = kvstoreGetNextNonEmptyHashtableIndex(kvs_it->kvs, kvs_it->didx);
kvs_it->didx = next_hashtable_index;
if (kvs_it->next_didx != KVSTORE_INDEX_NOT_FOUND) {
kvs_it->next_didx = kvstoreGetNextNonEmptyHashtableIndex(kvs_it->kvs, kvs_it->didx);
}
return kvs_it->kvs->hashtables[kvs_it->didx];
}
@ -612,7 +692,7 @@ int kvstoreIteratorGetCurrentHashtableIndex(kvstoreIterator *kvs_it) {
/* Fetches the next element and returns true. Returns false if there are no more elements. */
bool kvstoreIteratorNext(kvstoreIterator *kvs_it, void **next) {
if (kvs_it->didx != -1 && hashtableNext(&kvs_it->di, next)) {
if (kvs_it->didx != KVSTORE_INDEX_NOT_FOUND && hashtableNext(&kvs_it->di, next)) {
return true;
} else {
/* No current hashtable or reached the end of the hash table. */
@ -842,3 +922,26 @@ bool kvstoreHashtableDelete(kvstore *kvs, int didx, const void *key) {
}
return ret;
}
/* kvstoreSetIsImporting sets a hashtable as importing. Importing hashtables
* are not included in hashtable metrics and are excluded from scanning and
* random key lookup. */
void kvstoreSetIsImporting(kvstore *kvs, int didx, int is_importing) {
assert(didx < kvs->num_hashtables);
hashtable *ht = kvstoreGetHashtable(kvs, didx);
if (is_importing) {
/* Importing should only be marked on empty hashtables */
assert(!ht || hashtableSize(ht) == 0);
hashtableAdd(kvs->importing, (void *)(intptr_t)didx);
return;
}
hashtableDelete(kvs->importing, (void *)(intptr_t)didx);
/* Once we mark a hashtable as not importing, we need to begin tracking in
* the kvstore metadata */
if (ht && hashtableSize(ht) != 0) {
cumulativeKeyCountAdd(kvs, didx, hashtableSize(ht));
}
}

View File

@ -12,19 +12,24 @@ typedef struct _kvstoreHashtableIterator kvstoreHashtableIterator;
typedef int(kvstoreScanShouldSkipHashtable)(hashtable *d);
/* Return 1 if we should skip the hashtable in expand. */
typedef int(kvstoreExpandShouldSkipHashtableIndex)(int didx);
typedef void (*kvstoreScanFunction)(void *privdata, void *entry, int didx);
#define KVSTORE_ALLOCATE_HASHTABLES_ON_DEMAND (1 << 0)
#define KVSTORE_FREE_EMPTY_HASHTABLES (1 << 1)
#define KVSTORE_INDEX_NOT_FOUND (-1)
kvstore *kvstoreCreate(hashtableType *type, int num_hashtables_bits, int flags);
void kvstoreEmpty(kvstore *kvs, void(callback)(hashtable *));
void kvstoreRelease(kvstore *kvs);
unsigned long long kvstoreSize(kvstore *kvs);
unsigned long long kvstoreImportingSize(kvstore *kvs);
unsigned long kvstoreBuckets(kvstore *kvs);
size_t kvstoreMemUsage(kvstore *kvs);
unsigned long long kvstoreScan(kvstore *kvs,
unsigned long long cursor,
int onlydidx,
hashtableScanFunction scan_cb,
kvstoreScanFunction scan_cb,
kvstoreScanShouldSkipHashtable *skip_cb,
void *privdata);
bool kvstoreExpand(kvstore *kvs, uint64_t newsize, int try_expand, kvstoreExpandShouldSkipHashtableIndex *skip_cb);
@ -66,6 +71,7 @@ bool kvstoreHashtableRandomEntry(kvstore *kvs, int didx, void **found);
bool kvstoreHashtableFairRandomEntry(kvstore *kvs, int didx, void **found);
unsigned int kvstoreHashtableSampleEntries(kvstore *kvs, int didx, void **dst, unsigned int count);
bool kvstoreHashtableExpand(kvstore *kvs, int didx, unsigned long size);
void kvstoreSetIsImporting(kvstore *kvs, int didx, int is_importing);
unsigned long kvstoreHashtableScanDefrag(kvstore *kvs,
int didx,
unsigned long v,

View File

@ -69,6 +69,7 @@
#include "module.h"
#include "io_threads.h"
#include "scripting_engine.h"
#include "cluster_migrateslots.h"
#include <dlfcn.h>
#include <sys/stat.h>
#include <sys/wait.h>
@ -3644,14 +3645,23 @@ int VM_ReplyWithLongDouble(ValkeyModuleCtx *ctx, long double ld) {
* The command returns VALKEYMODULE_ERR if the format specifiers are invalid
* or the command name does not belong to a known command. */
int VM_Replicate(ValkeyModuleCtx *ctx, const char *cmdname, const char *fmt, ...) {
struct serverCommand *cmd;
struct serverCommand *cmd = NULL;
robj **argv = NULL;
int argc = 0, flags = 0, j;
va_list ap;
int slot = -1;
if (!ctx->module || !(ctx->module->options & VALKEYMODULE_OPTIONS_SKIP_COMMAND_VALIDATION)) {
bool skip_validation = ctx->module &&
(ctx->module->options & VALKEYMODULE_OPTIONS_SKIP_COMMAND_VALIDATION);
bool slot_export_in_progress = clusterIsAnySlotExporting();
if (!skip_validation || slot_export_in_progress) {
cmd = lookupCommandByCString((char *)cmdname);
if (!cmd) return VALKEYMODULE_ERR;
if (!cmd) {
if (!skip_validation) return VALKEYMODULE_ERR;
/* For modules that skip validation, instead of making them fail
* only when a slot migration is active, we just fail the migration. */
clusterFailAllSlotExportsWithMessage("A module replicated an unknown command");
}
}
/* Create the client and dispatch the command. */
@ -3660,6 +3670,14 @@ int VM_Replicate(ValkeyModuleCtx *ctx, const char *cmdname, const char *fmt, ...
va_end(ap);
if (argv == NULL) return VALKEYMODULE_ERR;
if (cmd && slot_export_in_progress) {
int read_flags;
slot = clusterSlotByCommand(cmd, argv, argc, &read_flags);
if (slot == -1 && read_flags & READ_FLAGS_CROSSSLOT) {
clusterFailAllSlotExportsWithMessage("A module replicated a cross-slot command");
}
}
/* Select the propagation target. Usually is AOF + replicas, however
* the caller can exclude one or the other using the "A" or "R"
* modifiers. */
@ -3667,7 +3685,7 @@ int VM_Replicate(ValkeyModuleCtx *ctx, const char *cmdname, const char *fmt, ...
if (!(flags & VALKEYMODULE_ARGV_NO_AOF)) target |= PROPAGATE_AOF;
if (!(flags & VALKEYMODULE_ARGV_NO_REPLICAS)) target |= PROPAGATE_REPL;
alsoPropagate(ctx->client->db->id, argv, argc, target);
alsoPropagate(ctx->client->db->id, argv, argc, target, slot);
/* Release the argv. */
for (j = 0; j < argc; j++) decrRefCount(argv[j]);
@ -3688,7 +3706,7 @@ int VM_Replicate(ValkeyModuleCtx *ctx, const char *cmdname, const char *fmt, ...
*
* The function always returns VALKEYMODULE_OK. */
int VM_ReplicateVerbatim(ValkeyModuleCtx *ctx) {
alsoPropagate(ctx->client->db->id, ctx->client->argv, ctx->client->argc, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(ctx->client->db->id, ctx->client->argv, ctx->client->argc, PROPAGATE_AOF | PROPAGATE_REPL, ctx->client->slot);
server.dirty++;
return VALKEYMODULE_OK;
}
@ -3981,8 +3999,8 @@ int VM_GetContextFlags(ValkeyModuleCtx *ctx) {
if (ctx) {
if (ctx->client) {
if (ctx->client->flag.deny_blocking) flags |= VALKEYMODULE_CTX_FLAGS_DENY_BLOCKING;
/* Module command received from PRIMARY, is replicated. */
if (ctx->client->flag.primary) flags |= VALKEYMODULE_CTX_FLAGS_REPLICATED;
/* Module command received from PRIMARY or slot import, is replicated. */
if (isReplicatedClient(ctx->client)) flags |= VALKEYMODULE_CTX_FLAGS_REPLICATED;
if (ctx->client->resp == 3) {
flags |= VALKEYMODULE_CTX_FLAGS_RESP3;
}
@ -11094,7 +11112,8 @@ typedef struct ValkeyModuleScanCursor {
int done;
} ValkeyModuleScanCursor;
static void moduleScanCallback(void *privdata, void *element) {
static void moduleScanCallback(void *privdata, void *element, int didx) {
UNUSED(didx);
ScanCBData *data = privdata;
robj *val = element;
sds key = objectGetKey(val);

View File

@ -30,6 +30,7 @@
#include "server.h"
#include "cluster.h"
#include "cluster_slot_stats.h"
#include "cluster_migrateslots.h"
#include "script.h"
#include "intset.h"
#include "sds.h"
@ -310,6 +311,7 @@ client *createClient(connection *conn) {
c->ctime = c->last_interaction = server.unixtime;
c->duration = 0;
clientSetDefaultAuth(c);
c->slot_migration_job = NULL;
c->reply = listCreate();
c->deferred_reply = NULL;
c->deferred_reply_errors = NULL;
@ -379,7 +381,8 @@ void putClientInPendingWriteQueue(client *c) {
if (!c->flag.pending_write &&
(!c->repl_data ||
c->repl_data->repl_state == REPL_STATE_NONE ||
(isReplicaReadyForReplData(c) && !c->repl_data->repl_start_cmd_stream_on_ack))) {
(isReplicaReadyForReplData(c) && !c->repl_data->repl_start_cmd_stream_on_ack)) &&
clusterSlotMigrationShouldInstallWriteHandler(c)) {
/* Here instead of installing the write handler, we just flag the
* client and put it into a list of clients that have something
* to write to the socket. This way before re-entering the event
@ -829,7 +832,7 @@ void afterErrorReply(client *c, const char *s, size_t len, int flags) {
* the commands sent by the primary. However it is useful to log such events since
* they are rare and may hint at errors in a script or a bug in the server. */
int ctype = getClientType(c);
if (ctype == CLIENT_TYPE_PRIMARY || ctype == CLIENT_TYPE_REPLICA || c->id == CLIENT_ID_AOF) {
if (ctype == CLIENT_TYPE_PRIMARY || ctype == CLIENT_TYPE_REPLICA || c->id == CLIENT_ID_AOF || ctype == CLIENT_TYPE_SLOT_IMPORT || ctype == CLIENT_TYPE_SLOT_EXPORT) {
char *to, *from;
if (c->id == CLIENT_ID_AOF) {
@ -838,9 +841,17 @@ void afterErrorReply(client *c, const char *s, size_t len, int flags) {
} else if (ctype == CLIENT_TYPE_PRIMARY) {
to = "primary";
from = "replica";
} else {
} else if (ctype == CLIENT_TYPE_REPLICA) {
to = "replica";
from = "primary";
} else if (ctype == CLIENT_TYPE_SLOT_IMPORT) {
to = "slot-import-source";
from = "slot-import-target";
} else if (ctype == CLIENT_TYPE_SLOT_EXPORT) {
to = "slot-export-target";
from = "slot-export-source";
} else {
serverAssert(0);
}
if (len > 4096) len = 4096;
@ -869,6 +880,9 @@ void afterErrorReply(client *c, const char *s, size_t len, int flags) {
" after processing the command '%s'",
from, to, cmdname ? cmdname : "<unknown>");
}
if (ctype == CLIENT_TYPE_SLOT_IMPORT || ctype == CLIENT_TYPE_SLOT_EXPORT) {
clusterHandleSlotMigrationErrorResponse(c->slot_migration_job);
}
}
}
@ -1848,8 +1862,15 @@ void unlinkClient(client *c) {
removeClientFromPendingCommandsBatch(c);
/* Check if this is a replica waiting for diskless replication (rdb pipe),
* in which case it needs to be cleaned from that list */
if (c->repl_data && c->flag.replica && c->repl_data->repl_state == REPLICA_STATE_WAIT_BGSAVE_END && server.rdb_pipe_conns) {
* in which case it needs to be cleaned from that list.
*
* Alternatively, if this is a slot migration job for an export operation, we need to
* always check if this was the target. The state of the migration isn't relevant since the
* snapshot child may take some time to die, during which the migration will continue past
* the snapshot state. */
if (c->repl_data && server.rdb_pipe_conns &&
((c->flag.replica && c->repl_data->repl_state == REPLICA_STATE_WAIT_BGSAVE_END) ||
(c->slot_migration_job && !isImportSlotMigrationJob(c->slot_migration_job)))) {
int i;
int still_alive = 0;
for (i = 0; i < server.rdb_pipe_numconns; i++) {
@ -1860,7 +1881,11 @@ void unlinkClient(client *c) {
if (server.rdb_pipe_conns[i]) still_alive++;
}
if (still_alive == 0) {
serverLog(LL_NOTICE, "Diskless rdb transfer, last replica dropped, killing fork child.");
if (c->slot_migration_job && !isImportSlotMigrationJob(c->slot_migration_job)) {
serverLog(LL_NOTICE, "Slot migration snapshot, migration target dropped, killing fork child.");
} else {
serverLog(LL_NOTICE, "Diskless rdb transfer, last replica dropped, killing fork child.");
}
killRDBChild();
}
}
@ -1921,7 +1946,7 @@ void clearClientConnectionState(client *c) {
c->flag.replica = 0;
}
serverAssert(!(c->flag.replica || c->flag.primary));
serverAssert(!(c->flag.replica || c->flag.primary || c->slot_migration_job));
if (c->flag.tracking) disableTracking(c);
selectDb(c, 0);
@ -2008,6 +2033,11 @@ void freeClient(client *c) {
serverLog(LL_NOTICE, "Connection with replica %s lost.", replicationGetReplicaName(c));
}
/* Handle slot migration connection closed. */
if (c->slot_migration_job) {
clusterHandleSlotMigrationClientClose(c->slot_migration_job);
}
/* Free the query buffer */
if (c->querybuf && c->querybuf == thread_shared_qb) {
sdsclear(c->querybuf);
@ -2142,8 +2172,8 @@ void beforeNextClient(client *c) {
* blocked client as well */
/* Trim the query buffer to the current position. */
if (c->flag.primary) {
/* If the client is a primary, trim the querybuf to repl_applied,
if (isReplicatedClient(c)) {
/* If the client is replicated, trim the querybuf to repl_applied,
* since primary client is very special, its querybuf not only
* used to parse command, but also proxy to sub-replicas.
*
@ -2728,7 +2758,11 @@ void releaseReplyReferences(client *c) {
static void _postWriteToClient(client *c) {
if (c->nwritten <= 0) return;
server.stat_net_output_bytes += c->nwritten;
if (getClientType(c) == CLIENT_TYPE_SLOT_EXPORT) {
server.stat_net_cluster_slot_export_bytes += c->nwritten;
} else {
server.stat_net_output_bytes += c->nwritten;
}
int last_written = 0;
if (c->bufpos > 0) {
@ -2791,11 +2825,11 @@ int postWriteToClient(client *c) {
}
if (c->nwritten > 0) {
c->net_output_bytes += c->nwritten;
/* For clients representing primaries we don't count sending data
* as an interaction, since we always send REPLCONF ACK commands
/* For replicated clients we don't count sending data
* as an interaction, since we always send ACK commands
* that take some time to just fill the socket output buffer.
* We just rely on data / pings received for timeout detection. */
if (!c->flag.primary) c->last_interaction = server.unixtime;
if (!isReplicatedClient(c)) c->last_interaction = server.unixtime;
}
if (!clientHasPendingReplies(c)) {
resetLastWrittenBuf(c);
@ -2883,9 +2917,13 @@ int handleReadResult(client *c) {
c->last_interaction = server.unixtime;
c->net_input_bytes += c->nread;
if (c->flag.primary) {
if (isReplicatedClient(c)) {
c->repl_data->read_reploff += c->nread;
server.stat_net_repl_input_bytes += c->nread;
if (getClientType(c) == CLIENT_TYPE_PRIMARY) {
server.stat_net_repl_input_bytes += c->nread;
} else {
server.stat_net_cluster_slot_import_bytes += c->nread;
}
} else {
server.stat_net_input_bytes += c->nread;
}
@ -2928,10 +2966,16 @@ void handleParseError(client *c) {
} else if (flags & READ_FLAGS_ERROR_UNBALANCED_QUOTES) {
addReplyError(c, "Protocol error: unbalanced quotes in request");
setProtocolError("unbalanced quotes in inline request", c);
} else if (flags & READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_PRIMARY) {
serverLog(LL_WARNING, "WARNING: Receiving inline protocol from primary, primary stream corruption? Closing the "
"primary connection and discarding the cached primary.");
setProtocolError("Master using the inline protocol. Desync?", c);
} else if (flags & READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_REPLICATED_CLIENT) {
if (getClientType(c) == CLIENT_TYPE_SLOT_IMPORT) {
serverLog(LL_WARNING, "WARNING: Receiving inline protocol from slot import, import stream corruption? Closing the "
"slot import connection.");
setProtocolError("Import using the inline protocol. Desync?", c);
} else {
serverLog(LL_WARNING, "WARNING: Receiving inline protocol from primary, primary stream corruption? Closing the "
"primary connection and discarding the cached primary.");
setProtocolError("Master using the inline protocol. Desync?", c);
}
} else {
serverAssertWithInfo(c, NULL, "Unknown parsing error");
}
@ -2942,7 +2986,7 @@ int isParsingError(client *c) {
READ_FLAGS_ERROR_INVALID_MULTIBULK_LEN | READ_FLAGS_ERROR_UNAUTHENTICATED_MULTIBULK_LEN |
READ_FLAGS_ERROR_UNAUTHENTICATED_BULK_LEN | READ_FLAGS_ERROR_MBULK_INVALID_BULK_LEN |
READ_FLAGS_ERROR_BIG_BULK_COUNT | READ_FLAGS_ERROR_MBULK_UNEXPECTED_CHARACTER |
READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_PRIMARY | READ_FLAGS_ERROR_UNBALANCED_QUOTES);
READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_REPLICATED_CLIENT | READ_FLAGS_ERROR_UNBALANCED_QUOTES);
}
/* This function is called after the query-buffer was parsed.
@ -3208,7 +3252,7 @@ void processInlineBuffer(client *c) {
int argc, j, linefeed_chars = 1;
sds *argv, aux;
size_t querylen;
int is_primary = c->read_flags & READ_FLAGS_PRIMARY;
int is_replicated = c->read_flags & READ_FLAGS_REPLICATED;
/* Search for end of line */
newline = strchr(c->querybuf + c->qb_pos, '\n');
@ -3245,9 +3289,9 @@ void processInlineBuffer(client *c) {
*
* However there is an exception: primaries may send us just a newline
* to keep the connection active. */
if (querylen != 0 && is_primary) {
if (querylen != 0 && is_replicated) {
sdsfreesplitres(argv, argc);
c->read_flags |= READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_PRIMARY;
c->read_flags |= READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_REPLICATED_CLIENT;
return;
}
@ -3294,7 +3338,7 @@ void processInlineBuffer(client *c) {
* CLIENT_PROTOCOL_ERROR. */
#define PROTO_DUMP_LEN 128
static void setProtocolError(const char *errstr, client *c) {
if (server.verbosity <= LL_VERBOSE || c->flag.primary) {
if (server.verbosity <= LL_VERBOSE || isReplicatedClient(c)) {
sds client = catClientInfoString(sdsempty(), c, server.hide_user_data_from_log);
/* Sample some protocol to given an idea about what was inside. */
@ -3319,7 +3363,7 @@ static void setProtocolError(const char *errstr, client *c) {
}
}
/* Log all the client and protocol info. */
int loglevel = (c->flag.primary) ? LL_WARNING : LL_VERBOSE;
int loglevel = (isReplicatedClient(c)) ? LL_WARNING : LL_VERBOSE;
serverLog(loglevel, "Protocol error (%s) from client: %s. Query buffer: %s", errstr, client, buf);
sdsfree(client);
}
@ -3338,7 +3382,7 @@ void processMultibulkBuffer(client *c) {
char *newline = NULL;
int ok;
long long ll;
int is_primary = c->read_flags & READ_FLAGS_PRIMARY;
int is_replicated = c->read_flags & READ_FLAGS_REPLICATED;
int auth_required = c->read_flags & READ_FLAGS_AUTH_REQUIRED;
if (c->multibulklen == 0) {
@ -3442,7 +3486,7 @@ void processMultibulkBuffer(client *c) {
size_t bulklen_slen = newline - (c->querybuf + c->qb_pos + 1);
ok = string2ll(c->querybuf + c->qb_pos + 1, bulklen_slen, &ll);
if (!ok || ll < 0 || (!(is_primary) && ll > server.proto_max_bulk_len)) {
if (!ok || ll < 0 || (!(is_replicated) && ll > server.proto_max_bulk_len)) {
c->read_flags |= READ_FLAGS_ERROR_MBULK_INVALID_BULK_LEN;
return;
} else if (ll > 16384 && auth_required) {
@ -3451,8 +3495,8 @@ void processMultibulkBuffer(client *c) {
}
c->qb_pos = newline - c->querybuf + 2;
if (!(is_primary) && ll >= PROTO_MBULK_BIG_ARG) {
/* When the client is not a primary client (because primary
if (!(is_replicated) && ll >= PROTO_MBULK_BIG_ARG) {
/* When the client is not a replicated client (because replicated
* client's querybuf can only be trimmed after data applied
* and sent to replicas).
*
@ -3496,10 +3540,10 @@ void processMultibulkBuffer(client *c) {
c->argv = zrealloc(c->argv, sizeof(robj *) * c->argv_len);
}
/* Optimization: if a non-primary client's buffer contains JUST our bulk element
/* Optimization: if a non-replicated client's buffer contains JUST our bulk element
* instead of creating a new object by *copying* the sds we
* just use the current sds string. */
if (!is_primary && c->qb_pos == 0 && c->bulklen >= PROTO_MBULK_BIG_ARG &&
if (!is_replicated && c->qb_pos == 0 && c->bulklen >= PROTO_MBULK_BIG_ARG &&
sdslen(c->querybuf) == (size_t)(c->bulklen + 2)) {
c->argv[c->argc++] = createObject(OBJ_STRING, c->querybuf);
c->argv_len_sum += c->bulklen;
@ -3548,18 +3592,18 @@ void commandProcessed(client *c) {
if (!c->repl_data) return;
long long prev_offset = c->repl_data->reploff;
if (c->flag.primary && !c->flag.multi) {
if (isReplicatedClient(c) && !c->flag.multi) {
/* Update the applied replication offset of our primary. */
c->repl_data->reploff = c->repl_data->read_reploff - sdslen(c->querybuf) + c->qb_pos;
}
/* If the client is a primary we need to compute the difference
/* If the client is replicated we need to compute the difference
* between the applied offset before and after processing the buffer,
* to understand how much of the replication stream was actually
* applied to the primary state: this quantity, and its corresponding
* applied to the state: this quantity, and its corresponding
* part of the replication stream, will be propagated to the
* sub-replicas and to the replication backlog. */
if (c->flag.primary) {
if (isReplicatedClient(c)) {
long long applied = c->repl_data->reploff - prev_offset;
if (applied) {
replicationFeedStreamFromPrimaryStream(c->querybuf + c->repl_data->repl_applied, applied);
@ -3663,11 +3707,11 @@ int canParseCommand(client *c) {
* commands to execute in c->argv. */
if (c->flag.pending_command) return 0;
/* Don't process input from the primary while there is a busy script
* condition on the replica. We want just to accumulate the replication
/* Don't process input from replicated clients while there is a busy script
* condition on this node. We want just to accumulate the replication
* stream (instead of replying -BUSY like we do with other clients) and
* later resume the processing. */
if (isInsideYieldingLongCommand() && c->flag.primary) return 0;
if (isInsideYieldingLongCommand() && isReplicatedClient(c)) return 0;
/* CLIENT_CLOSE_AFTER_REPLY closes the connection once the reply is
* written to the client. Make sure to not let the reply grow after
@ -3686,7 +3730,7 @@ int processInputBuffer(client *c) {
break;
}
c->read_flags = c->flag.primary ? READ_FLAGS_PRIMARY : 0;
c->read_flags = isReplicatedClient(c) ? READ_FLAGS_REPLICATED : 0;
c->read_flags |= authRequired(c) ? READ_FLAGS_AUTH_REQUIRED : 0;
parseCommand(c);
@ -3733,7 +3777,7 @@ static bool readToQueryBuf(client *c) {
/* If the replica RDB client is marked as closed ASAP, do not try to read from it */
if (c->flag.close_asap) return false;
int is_primary = c->read_flags & READ_FLAGS_PRIMARY;
int is_replicated = c->read_flags & READ_FLAGS_REPLICATED;
readlen = PROTO_IOBUF_LEN;
qblen = c->querybuf ? sdslen(c->querybuf) : 0;
@ -3752,9 +3796,9 @@ static bool readToQueryBuf(client *c) {
* for example once we resume a blocked client after CLIENT PAUSE. */
if (remaining > 0) readlen = remaining;
/* Primary client needs expand the readlen when meet BIG_ARG(see #9100),
/* Replicated client needs expand the readlen when meet BIG_ARG(see #9100),
* but doesn't need align to the next arg, we can read more data. */
if (c->flag.primary && readlen < PROTO_IOBUF_LEN) readlen = PROTO_IOBUF_LEN;
if (isReplicatedClient(c) && readlen < PROTO_IOBUF_LEN) readlen = PROTO_IOBUF_LEN;
}
if (c->querybuf == NULL) {
@ -3767,7 +3811,7 @@ static bool readToQueryBuf(client *c) {
* Although we have ensured that c->querybuf will not be expanded in the current
* thread_shared_qb, we still add this check for code robustness. */
int use_thread_shared_qb = (c->querybuf == thread_shared_qb) ? 1 : 0;
if (!is_primary && // primary client's querybuf can grow greedy.
if (!is_replicated && // replicated clients' querybuf can grow greedy.
(big_arg || sdsalloc(c->querybuf) < PROTO_IOBUF_LEN)) {
/* When reading a BIG_ARG we won't be reading more than that one arg
* into the query buffer, so we don't need to pre-allocate more than we
@ -3794,7 +3838,7 @@ static bool readToQueryBuf(client *c) {
sdsIncrLen(c->querybuf, c->nread);
qblen = sdslen(c->querybuf);
if (c->querybuf_peak < qblen) c->querybuf_peak = qblen;
if (!is_primary) {
if (!is_replicated) {
/* The commands cached in the MULTI/EXEC queue have not been executed yet,
* so they are also considered a part of the query buffer in a broader sense.
*
@ -5554,6 +5598,7 @@ int getClientType(client *c) {
* want the expose them as normal clients. */
if (c->flag.replica && !c->flag.monitor) return CLIENT_TYPE_REPLICA;
if (c->flag.pubsub) return CLIENT_TYPE_PUBSUB;
if (c->slot_migration_job) return isImportSlotMigrationJob(c->slot_migration_job) ? CLIENT_TYPE_SLOT_IMPORT : CLIENT_TYPE_SLOT_EXPORT;
return CLIENT_TYPE_NORMAL;
}
@ -5578,6 +5623,8 @@ char *getClientTypeName(int class) {
case CLIENT_TYPE_REPLICA: return "slave";
case CLIENT_TYPE_PUBSUB: return "pubsub";
case CLIENT_TYPE_PRIMARY: return "master";
case CLIENT_TYPE_SLOT_IMPORT: return "slot-import";
case CLIENT_TYPE_SLOT_EXPORT: return "slot-export";
default: return NULL;
}
}
@ -5602,6 +5649,12 @@ int checkClientOutputBufferLimits(client *c) {
* like normal clients. */
if (class == CLIENT_TYPE_PRIMARY) class = CLIENT_TYPE_NORMAL;
/* Slot import clients are treated as normal as well */
if (class == CLIENT_TYPE_SLOT_IMPORT) class = CLIENT_TYPE_NORMAL;
/* Slot export clients are treated as replicas */
if (class == CLIENT_TYPE_SLOT_EXPORT) class = CLIENT_TYPE_REPLICA;
/* Note that it doesn't make sense to set the replica clients output buffer
* limit lower than the repl-backlog-size config (partial sync will succeed
* and then replica will get disconnected).
@ -5719,6 +5772,8 @@ char *getPausedReason(pause_purpose purpose) {
return "shutdown_in_progress";
case PAUSE_DURING_FAILOVER:
return "failover_in_progress";
case PAUSE_DURING_SLOT_MIGRATION:
return "slot_migration_in_progress";
case NUM_PAUSE_PURPOSES:
return "none";
default:
@ -5843,6 +5898,10 @@ uint32_t isPausedActionsWithUpdate(uint32_t actions_bitmask) {
return (server.paused_actions & actions_bitmask);
}
uint32_t getPausedActionsWithPurpose(pause_purpose purpose) {
return server.client_pause_per_purpose[purpose].paused_actions;
}
/* This function is called by the server in order to process a few events from
* time to time while blocked into some not interruptible operation.
* This allows to reply to clients with the -LOADING error while loading the
@ -6084,7 +6143,7 @@ void ioThreadReadQueryFromClient(void *data) {
done:
/* Only trim query buffer for non-primary clients
* Primary client's buffer is handled by main thread using repl_applied position */
if (!(c->read_flags & READ_FLAGS_PRIMARY)) {
if (!(c->read_flags & READ_FLAGS_REPLICATED)) {
trimClientQueryBuffer(c);
}
atomic_thread_fence(memory_order_release);

View File

@ -1361,6 +1361,10 @@ struct serverMemOverhead *getMemoryOverheadData(void) {
server.stat_clients_type_memory[CLIENT_TYPE_NORMAL];
mem_total += mh->clients_normal;
mh->cluster_slot_import = server.stat_clients_type_memory[CLIENT_TYPE_SLOT_IMPORT];
mh->cluster_slot_export = server.stat_clients_type_memory[CLIENT_TYPE_SLOT_EXPORT];
mem_total += mh->cluster_slot_import + mh->cluster_slot_export;
mh->cluster_links = server.stat_cluster_links_memory;
mem_total += mh->cluster_links;
@ -1725,7 +1729,7 @@ void memoryCommand(client *c) {
} else if (!strcasecmp(c->argv[1]->ptr, "stats") && c->argc == 2) {
struct serverMemOverhead *mh = getMemoryOverheadData();
addReplyMapLen(c, 31 + mh->num_dbs);
addReplyMapLen(c, 33 + mh->num_dbs);
addReplyBulkCString(c, "peak.allocated");
addReplyLongLong(c, mh->peak_allocated);
@ -1748,6 +1752,12 @@ void memoryCommand(client *c) {
addReplyBulkCString(c, "cluster.links");
addReplyLongLong(c, mh->cluster_links);
addReplyBulkCString(c, "cluster.slot_import");
addReplyLongLong(c, mh->cluster_slot_import);
addReplyBulkCString(c, "cluster.slot_export");
addReplyLongLong(c, mh->cluster_slot_export);
addReplyBulkCString(c, "aof.buffer");
addReplyLongLong(c, mh->aof_buffer);

252
src/rdb.c
View File

@ -44,6 +44,8 @@
#include "bio.h"
#include "zmalloc.h"
#include "module.h"
#include "cluster.h"
#include "cluster_migrateslots.h"
#include <math.h>
#include <fcntl.h>
@ -1355,7 +1357,7 @@ ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter) {
serverDb *db = server.db[dbid];
if (db == NULL) return 0;
unsigned long long int db_size = kvstoreSize(db->keys);
unsigned long long int db_size = kvstoreSize(db->keys) + kvstoreImportingSize(db->keys);
if (db_size == 0) return 0;
/* Write the SELECT DB opcode */
@ -1365,7 +1367,7 @@ ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter) {
written += res;
/* Write the RESIZE DB opcode. */
unsigned long long expires_size = kvstoreSize(db->expires);
unsigned long long expires_size = kvstoreSize(db->expires) + kvstoreImportingSize(db->expires);
if ((res = rdbSaveType(rdb, RDB_OPCODE_RESIZEDB)) < 0) goto werr;
written += res;
if ((res = rdbSaveLen(rdb, db_size)) < 0) goto werr;
@ -1373,7 +1375,7 @@ ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter) {
if ((res = rdbSaveLen(rdb, expires_size)) < 0) goto werr;
written += res;
kvs_it = kvstoreIteratorInit(db->keys, HASHTABLE_ITER_SAFE | HASHTABLE_ITER_PREFETCH_VALUES);
kvs_it = kvstoreIteratorInit(db->keys, HASHTABLE_ITER_SAFE | HASHTABLE_ITER_PREFETCH_VALUES | HASHTABLE_ITER_INCLUDE_IMPORTING);
int last_slot = -1;
/* Iterate this DB writing every entry */
void *next;
@ -3562,6 +3564,10 @@ void backgroundSaveDoneHandler(int exitcode, int bysignal) {
/* Possibly there are replicas waiting for a BGSAVE in order to be served
* (the first stage of SYNC is a bulk transfer of dump.rdb) */
updateReplicasWaitingBgsave((!bysignal && exitcode == 0) ? C_OK : C_ERR, type);
/* Slot export should also be notified, in case this was a export related
* snapshot */
clusterHandleSlotExportBackgroundSaveDone((!bysignal && exitcode == 0) ? C_OK : C_ERR);
}
/* Kill the RDB saving child using SIGUSR1 (so that the parent will know
@ -3577,15 +3583,14 @@ void killRDBChild(void) {
* - rdbRemoveTempFile */
}
/* Spawn an RDB child that writes the RDB to the sockets of the replicas
* that are currently in REPLICA_STATE_WAIT_BGSAVE_START state. */
int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
listNode *ln;
listIter li;
/* Save snapshot to the provided connections, spawning a child process and
* running the provided function.
*
* Connections array provided will be freed after the save is completed, and
* should not be freed by the caller. */
int saveSnapshotToConnectionSockets(rdbSnapshotOptions options) {
pid_t childpid;
int pipefds[2], rdb_pipe_write = -1, safe_to_exit_pipe = -1;
int dual_channel = (req & REPLICA_REQ_RDB_CHANNEL);
if (hasActiveChildProcess()) return C_ERR;
serverAssert(server.rdb_pipe_read == -1 && server.rdb_child_exit_pipe == -1);
@ -3593,7 +3598,7 @@ int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
* drained the pipe. */
if (server.rdb_pipe_conns) return C_ERR;
if (!dual_channel) {
if (options.use_pipe) {
/* Before to fork, create a pipe that is used to transfer the rdb bytes to
* the parent, we can't let it write directly to the sockets, since in case
* of TLS we must let the parent handle a continuous TLS state when the
@ -3612,6 +3617,109 @@ int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
safe_to_exit_pipe = pipefds[0]; /* read end */
server.rdb_child_exit_pipe = pipefds[1]; /* write end */
}
server.rdb_pipe_conns = NULL;
if (options.use_pipe) {
server.rdb_pipe_conns = options.conns;
server.rdb_pipe_numconns = options.connsnum;
server.rdb_pipe_numconns_writing = 0;
}
/* Create the child process. */
if ((childpid = serverFork(CHILD_TYPE_RDB)) == 0) {
/* Child */
int retval, dummy;
rio rdb;
if (!options.use_pipe) {
rioInitWithConnset(&rdb, options.conns, options.connsnum);
} else {
rioInitWithFd(&rdb, rdb_pipe_write);
}
/* Close the reading part, so that if the parent crashes, the child will
* get a write error and exit. */
if (options.use_pipe) close(server.rdb_pipe_read);
if (strstr(server.exec_argv[0], "redis-server") != NULL) {
serverSetProcTitle("redis-rdb-to-slaves");
} else {
serverSetProcTitle("valkey-rdb-to-replicas");
}
serverSetCpuAffinity(server.bgsave_cpulist);
if (options.skip_checksum) rdb.flags |= RIO_FLAG_SKIP_RDB_CHECKSUM;
retval = options.snapshot_func(options.req, &rdb, options.privdata);
if (retval == C_OK && rioFlush(&rdb) == 0) retval = C_ERR;
if (retval == C_OK) {
sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
}
if (!options.use_pipe) {
rioFreeConnset(&rdb);
} else {
rioFreeFd(&rdb);
/* wake up the reader, tell it we're done. */
close(rdb_pipe_write);
close(server.rdb_child_exit_pipe); /* close write end so that we can detect the close on the parent. */
}
zfree(options.conns);
/* hold exit until the parent tells us it's safe. we're not expecting
* to read anything, just get the error when the pipe is closed. */
if (options.use_pipe) dummy = read(safe_to_exit_pipe, pipefds, 1);
UNUSED(dummy);
exitFromChild((retval == C_OK) ? 0 : 1);
} else {
/* Parent */
if (childpid == -1) {
serverLog(LL_WARNING, "Can't save in background: fork: %s", strerror(errno));
if (options.use_pipe) {
close(rdb_pipe_write);
close(server.rdb_pipe_read);
close(server.rdb_child_exit_pipe);
}
zfree(options.conns);
if (!options.use_pipe) {
closeChildInfoPipe();
} else {
server.rdb_pipe_conns = NULL;
server.rdb_pipe_numconns = 0;
server.rdb_pipe_numconns_writing = 0;
}
} else {
serverLog(LL_NOTICE, "Background RDB transfer started by pid %ld to %s%s", (long)childpid,
!options.use_pipe ? "direct socket to replica" : "pipe through parent process",
options.skip_checksum ? " while skipping RDB checksum for this transfer" : "");
server.rdb_save_time_start = time(NULL);
server.rdb_child_type = RDB_CHILD_TYPE_SOCKET;
if (!options.use_pipe) {
/* For dual channel sync, the main process no longer requires these RDB connections. */
zfree(options.conns);
} else {
close(rdb_pipe_write); /* close write in parent so that it can detect the close on the child. */
if (aeCreateFileEvent(server.el, server.rdb_pipe_read, AE_READABLE, rdbPipeReadHandler, NULL) ==
AE_ERR) {
serverPanic("Unrecoverable error creating server.rdb_pipe_read file event.");
}
}
}
if (options.use_pipe) close(safe_to_exit_pipe);
return (childpid == -1) ? C_ERR : C_OK;
}
return C_OK; /* Unreached. */
}
int childSnapshotUsingRDB(int req, rio *rdb, void *privdata) {
return rdbSaveRioWithEOFMark(req, rdb, NULL, (rdbSaveInfo *)privdata);
}
/* Spawn an RDB child that writes the RDB to the sockets of the replicas
* that are currently in REPLICA_STATE_WAIT_BGSAVE_START state. */
int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
listNode *ln;
listIter li;
int dual_channel = (req & REPLICA_REQ_RDB_CHANNEL);
/*
* For replicas with repl_state == REPLICA_STATE_WAIT_BGSAVE_END and replica_req == req:
* Check replica capabilities, if every replica supports skipping RDB checksum, primary should also skip checksum.
@ -3619,15 +3727,10 @@ int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
*/
int skip_rdb_checksum = 1;
/* Collect the connections of the replicas we want to transfer
* the RDB to, which are in WAIT_BGSAVE_START state. */
* the RDB to, which are i WAIT_BGSAVE_START state. */
int connsnum = 0;
connection **conns = zmalloc(sizeof(connection *) * listLength(server.replicas));
server.rdb_pipe_conns = NULL;
if (!dual_channel) {
server.rdb_pipe_conns = conns;
server.rdb_pipe_numconns = 0;
server.rdb_pipe_numconns_writing = 0;
}
/* Filter replica connections pending full sync (ie. in WAIT_BGSAVE_START state). */
listRewind(server.replicas, &li);
while ((ln = listNext(&li))) {
@ -3646,110 +3749,37 @@ int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
addRdbReplicaToPsyncWait(replica);
/* Put the socket in blocking mode to simplify RDB transfer. */
connBlock(replica->conn);
} else {
server.rdb_pipe_numconns++;
}
replicationSetupReplicaForFullResync(replica, getPsyncInitialOffset());
}
// do not skip RDB checksum on the primary if connection doesn't have integrity check or if the replica doesn't support it
/* do not skip RDB checksum on the primary if connection doesn't have integrity check or if the replica doesn't support it */
if (!connIsIntegrityChecked(replica->conn) || !(replica->repl_data->replica_capa & REPLICA_CAPA_SKIP_RDB_CHECKSUM))
skip_rdb_checksum = 0;
}
/* Create the child process. */
if ((childpid = serverFork(CHILD_TYPE_RDB)) == 0) {
/* Child */
int retval, dummy;
rio rdb;
if (dual_channel) {
rioInitWithConnset(&rdb, conns, connsnum);
} else {
rioInitWithFd(&rdb, rdb_pipe_write);
}
/* Close the reading part, so that if the parent crashes, the child will
* get a write error and exit. */
if (!dual_channel) close(server.rdb_pipe_read);
if (strstr(server.exec_argv[0], "redis-server") != NULL) {
serverSetProcTitle("redis-rdb-to-slaves");
} else {
serverSetProcTitle("valkey-rdb-to-replicas");
}
serverSetCpuAffinity(server.bgsave_cpulist);
if (skip_rdb_checksum) rdb.flags |= RIO_FLAG_SKIP_RDB_CHECKSUM;
retval = rdbSaveRioWithEOFMark(req, &rdb, NULL, rsi);
if (retval == C_OK && rioFlush(&rdb) == 0) retval = C_ERR;
if (retval == C_OK) {
sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
}
if (dual_channel) {
rioFreeConnset(&rdb);
} else {
rioFreeFd(&rdb);
/* wake up the reader, tell it we're done. */
close(rdb_pipe_write);
close(server.rdb_child_exit_pipe); /* close write end so that we can detect the close on the parent. */
}
zfree(conns);
/* hold exit until the parent tells us it's safe. we're not expecting
* to read anything, just get the error when the pipe is closed. */
if (!dual_channel) dummy = read(safe_to_exit_pipe, pipefds, 1);
UNUSED(dummy);
exitFromChild((retval == C_OK) ? 0 : 1);
} else {
/* Parent */
if (childpid == -1) {
serverLog(LL_WARNING, "Can't save in background: fork: %s", strerror(errno));
/* Undo the state change. The caller will perform cleanup on
* all the replicas in BGSAVE_START state, but an early call to
* replicationSetupReplicaForFullResync() turned it into BGSAVE_END */
listRewind(server.replicas, &li);
while ((ln = listNext(&li))) {
client *replica = ln->value;
if (replica->repl_data->repl_state == REPLICA_STATE_WAIT_BGSAVE_END) {
replica->repl_data->repl_state = REPLICA_STATE_WAIT_BGSAVE_START;
}
}
if (!dual_channel) {
close(rdb_pipe_write);
close(server.rdb_pipe_read);
close(server.rdb_child_exit_pipe);
}
zfree(conns);
if (dual_channel) {
closeChildInfoPipe();
} else {
server.rdb_pipe_conns = NULL;
server.rdb_pipe_numconns = 0;
server.rdb_pipe_numconns_writing = 0;
}
} else {
serverLog(LL_NOTICE, "Background RDB transfer started by pid %ld to %s%s", (long)childpid,
dual_channel ? "direct socket to replica" : "pipe through parent process",
skip_rdb_checksum ? " while skipping RDB checksum for this transfer" : "");
server.rdb_save_time_start = time(NULL);
server.rdb_child_type = RDB_CHILD_TYPE_SOCKET;
if (dual_channel) {
/* For dual channel sync, the main process no longer requires these RDB connections. */
zfree(conns);
} else {
close(rdb_pipe_write); /* close write in parent so that it can detect the close on the child. */
if (aeCreateFileEvent(server.el, server.rdb_pipe_read, AE_READABLE, rdbPipeReadHandler, NULL) ==
AE_ERR) {
serverPanic("Unrecoverable error creating server.rdb_pipe_read file event.");
}
rdbSnapshotOptions options = {
.conns = conns,
.connsnum = connsnum,
.use_pipe = !dual_channel,
.req = req,
.skip_checksum = skip_rdb_checksum,
.privdata = rsi,
.snapshot_func = childSnapshotUsingRDB};
if (saveSnapshotToConnectionSockets(options) != C_OK) {
/* Undo the state change. The caller will perform cleanup on
* all the replicas in BGSAVE_START state, but an early call to
* replicationSetupReplicaForFullResync() turned it into BGSAVE_END */
listRewind(server.replicas, &li);
while ((ln = listNext(&li))) {
client *replica = ln->value;
if (replica->repl_data->repl_state == REPLICA_STATE_WAIT_BGSAVE_END) {
replica->repl_data->repl_state = REPLICA_STATE_WAIT_BGSAVE_START;
}
}
if (!dual_channel) close(safe_to_exit_pipe);
return (childpid == -1) ? C_ERR : C_OK;
return C_ERR;
}
return C_OK; /* Unreached. */
return C_OK;
}
void saveCommand(client *c) {

View File

@ -117,6 +117,17 @@ enum RdbType {
};
/* NOTE: WHEN ADDING NEW RDB TYPE, UPDATE rdb_type_string[] */
typedef int (*ChildSnapshotFunc)(int req, rio *rdb, void *privdata);
typedef struct rdbSnapshotOptions {
int connsnum; /* Number of connections. */
connection **conns; /* Connections to send the snapshot to. */
int use_pipe; /* Use pipe to send the snapshot. */
int req; /* See REPLICA_REQ_* in server.h. */
int skip_checksum; /* Skip checksum when sending the snapshot. */
ChildSnapshotFunc snapshot_func; /* Function to call to take the snapshot. */
void *privdata; /* Private data to pass to snapshot_func. */
} rdbSnapshotOptions;
/* Test if a type is an object type. */
#define rdbIsObjectType(t) (((t) >= 0 && (t) <= 7) || ((t) >= 9 && (t) < RDB_TYPE_LAST))
@ -198,5 +209,6 @@ int rdbFunctionLoad(rio *rdb, int ver, functionsLibCtx *lib_ctx, int rdbflags, s
int rdbSaveRio(int req, rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi);
ssize_t rdbSaveFunctions(rio *rdb);
rdbSaveInfo *rdbPopulateSaveInfo(rdbSaveInfo *rsi);
int saveSnapshotToConnectionSockets(rdbSnapshotOptions options);
#endif

View File

@ -71,7 +71,7 @@ void syncWithPrimary(connection *conn);
int RDBGeneratedByReplication = 0;
/* --------------------------- Utility functions ---------------------------- */
static ConnectionType *connTypeOfReplication(void) {
ConnectionType *connTypeOfReplication(void) {
if (server.tls_replication) {
return connectionTypeTls();
}
@ -533,7 +533,6 @@ void feedReplicationBuffer(char *s, size_t len) {
* replicationFeedStreamFromPrimaryStream() */
void replicationFeedReplicas(int dictid, robj **argv, int argc) {
int j, len;
char llstr[LONG_STR_SIZE];
/* In case we propagate a command that doesn't touch keys (PING, REPLCONF) we
* pass dbid=-1 that indicate there is no need to replicate `select` command. */
@ -565,19 +564,7 @@ void replicationFeedReplicas(int dictid, robj **argv, int argc) {
/* Send SELECT command to every replica if needed. */
if (dictid != -1 && server.replicas_eldb != dictid) {
robj *selectcmd;
/* For a few DBs we have pre-computed SELECT command. */
if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {
selectcmd = shared.select[dictid];
} else {
int dictid_len;
dictid_len = ll2string(llstr, sizeof(llstr), dictid);
selectcmd = createObject(
OBJ_STRING, sdscatprintf(sdsempty(), "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n", dictid_len, llstr));
}
robj *selectcmd = generateSelectCommand(dictid);
feedReplicationBufferWithObject(selectcmd);
/* Although the SELECT command is not associated with any slot,
@ -585,7 +572,7 @@ void replicationFeedReplicas(int dictid, robj **argv, int argc) {
* To cancel-out this accumulation, below adjustment is made. */
clusterSlotStatsDecrNetworkBytesOutForReplication(sdslen(selectcmd->ptr));
if (dictid < 0 || dictid >= PROTO_SHARED_SELECT_CMDS) decrRefCount(selectcmd);
decrRefCount(selectcmd);
server.replicas_eldb = dictid;
}
@ -1732,7 +1719,11 @@ void rdbPipeWriteHandler(struct connection *conn) {
return;
} else {
replica->repl_data->repldboff += nwritten;
server.stat_net_repl_output_bytes += nwritten;
if (getClientType(replica) == CLIENT_TYPE_SLOT_EXPORT) {
server.stat_net_cluster_slot_export_bytes += nwritten;
} else {
server.stat_net_repl_output_bytes += nwritten;
}
if (replica->repl_data->repldboff < server.rdb_pipe_bufflen) {
replica->repl_data->repl_last_partial_write = server.unixtime;
return; /* more data to write.. */
@ -1806,7 +1797,11 @@ void rdbPipeReadHandler(struct aeEventLoop *eventLoop, int fd, void *clientData,
/* Note: when use diskless replication, 'repldboff' is the offset
* of 'rdb_pipe_buff' sent rather than the offset of entire RDB. */
replica->repl_data->repldboff = nwritten;
server.stat_net_repl_output_bytes += nwritten;
if (getClientType(replica) == CLIENT_TYPE_SLOT_EXPORT) {
server.stat_net_cluster_slot_export_bytes += nwritten;
} else {
server.stat_net_repl_output_bytes += nwritten;
}
}
/* If we were unable to write all the data to one of the replicas,
* setup write handler (and disable pipe read handler, below) */
@ -2670,7 +2665,7 @@ done:
}
}
char *receiveSynchronousResponse(connection *conn) {
sds receiveSynchronousResponse(connection *conn) {
char buf[256];
/* Read the reply from the server. */
if (connSyncReadLine(conn, buf, sizeof(buf), server.repl_syncio_timeout * 1000) == -1) {
@ -2682,7 +2677,7 @@ char *receiveSynchronousResponse(connection *conn) {
}
/* Send a pre-formatted multi-bulk command to the connection. */
char *sendCommandRaw(connection *conn, sds cmd) {
sds sendCommandRaw(connection *conn, sds cmd) {
if (connSyncWrite(conn, cmd, sdslen(cmd), server.repl_syncio_timeout * 1000) == -1) {
return sdscatprintf(sdsempty(), "-Writing to master: %s", connGetLastError(conn));
}
@ -2698,7 +2693,7 @@ char *sendCommandRaw(connection *conn, sds cmd) {
* The command returns an sds string representing the result of the
* operation. On error the first byte is a "-".
*/
char *sendCommand(connection *conn, ...) {
sds sendCommand(connection *conn, ...) {
va_list ap;
sds cmd = sdsempty();
sds cmdargs = sdsempty();
@ -2721,7 +2716,7 @@ char *sendCommand(connection *conn, ...) {
sdsfree(cmdargs);
va_end(ap);
char *err = sendCommandRaw(conn, cmd);
sds err = sendCommandRaw(conn, cmd);
sdsfree(cmd);
if (err) return err;
return NULL;
@ -2736,7 +2731,7 @@ char *sendCommand(connection *conn, ...) {
* The command returns an sds string representing the result of the
* operation. On error the first byte is a "-".
*/
char *sendCommandArgv(connection *conn, int argc, char **argv, size_t *argv_lens) {
sds sendCommandArgv(connection *conn, int argc, char **argv, size_t *argv_lens) {
sds cmd = sdsempty();
char *arg;
int i;
@ -2751,7 +2746,7 @@ char *sendCommandArgv(connection *conn, int argc, char **argv, size_t *argv_lens
cmd = sdscatlen(cmd, arg, len);
cmd = sdscatlen(cmd, "\r\n", 2);
}
char *err = sendCommandRaw(conn, cmd);
sds err = sendCommandRaw(conn, cmd);
sdsfree(cmd);
if (err) return err;
return NULL;
@ -2845,22 +2840,38 @@ int sendCurrentOffsetToReplica(client *replica) {
return C_OK;
}
sds replicationSendAuth(connection *conn) {
char *args[] = {"AUTH", NULL, NULL};
size_t lens[] = {4, 0, 0};
int argc = 1;
if (server.primary_user) {
args[argc] = server.primary_user;
lens[argc] = strlen(server.primary_user);
argc++;
}
args[argc] = server.primary_auth;
lens[argc] = sdslen(server.primary_auth);
argc++;
return sendCommandArgv(conn, argc, args, lens);
}
robj *generateSelectCommand(int dictid) {
/* For a few DBs we have pre-computed SELECT command. */
if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {
return shared.select[dictid];
}
char llstr[LONG_STR_SIZE];
int dictid_len;
dictid_len = ll2string(llstr, sizeof(llstr), dictid);
return createObject(
OBJ_STRING, sdscatfmt(sdsempty(), "*2\r\n$6\r\nSELECT\r\n$%i\r\n%s\r\n", dictid_len, llstr));
}
static int dualChannelReplHandleHandshake(connection *conn, sds *err) {
dualChannelServerLog(LL_DEBUG, "Received first reply from primary using rdb connection.");
/* AUTH with the primary if required. */
if (server.primary_auth) {
char *args[] = {"AUTH", NULL, NULL};
size_t lens[] = {4, 0, 0};
int argc = 1;
if (server.primary_user) {
args[argc] = server.primary_user;
lens[argc] = strlen(server.primary_user);
argc++;
}
args[argc] = server.primary_auth;
lens[argc] = sdslen(server.primary_auth);
argc++;
*err = sendCommandArgv(conn, argc, args, lens);
*err = replicationSendAuth(conn);
if (*err) {
dualChannelServerLog(LL_WARNING, "Sending command to primary in dual channel replication handshake: %s", *err);
return C_ERR;
@ -3604,18 +3615,7 @@ int syncWithPrimaryHandleSendHandshakeState(connection *conn) {
sds err;
/* AUTH with the primary if required. */
if (server.primary_auth) {
char *args[3] = {"AUTH", NULL, NULL};
size_t lens[3] = {4, 0, 0};
int argc = 1;
if (server.primary_user) {
args[argc] = server.primary_user;
lens[argc] = strlen(server.primary_user);
argc++;
}
args[argc] = server.primary_auth;
lens[argc] = sdslen(server.primary_auth);
argc++;
err = sendCommandArgv(conn, argc, args, lens);
err = replicationSendAuth(conn);
if (err) goto err;
}

View File

@ -36,6 +36,7 @@
#include "monotonic.h"
#include "cluster.h"
#include "cluster_slot_stats.h"
#include "cluster_migrateslots.h"
#include "commandlog.h"
#include "bio.h"
#include "latency.h"
@ -911,8 +912,8 @@ int clientsCronResizeQueryBuffer(client *c) {
if (idletime > 2) {
/* 1) Query is idle for a long time. */
size_t remaining = sdslen(c->querybuf) - c->qb_pos;
if (!c->flag.primary && !remaining) {
/* If the client is not a primary and no data is pending,
if (!isReplicatedClient(c) && !remaining) {
/* If the client is not replicated and no data is pending,
* The client can safely use the shared query buffer in the next read - free the client's querybuf. */
sdsfree(c->querybuf);
/* By setting the querybuf to NULL, the client will use the shared query buffer in the next read.
@ -1489,10 +1490,10 @@ long long serverCron(struct aeEventLoop *eventLoop, long long id, void *clientDa
monotime current_time = getMonotonicUs();
long long factor = 1000000; // us
trackInstantaneousMetric(STATS_METRIC_COMMAND, server.stat_numcommands, current_time, factor);
trackInstantaneousMetric(STATS_METRIC_NET_INPUT, server.stat_net_input_bytes + server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes,
trackInstantaneousMetric(STATS_METRIC_NET_INPUT, server.stat_net_input_bytes + server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes + server.stat_net_cluster_slot_import_bytes,
current_time, factor);
trackInstantaneousMetric(STATS_METRIC_NET_OUTPUT,
server.stat_net_output_bytes + server.stat_net_repl_output_bytes, current_time,
server.stat_net_output_bytes + server.stat_net_repl_output_bytes + server.stat_net_cluster_slot_export_bytes, current_time,
factor);
trackInstantaneousMetric(STATS_METRIC_NET_INPUT_REPLICATION, server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes, current_time,
factor);
@ -2743,6 +2744,8 @@ void resetServerStats(void) {
server.stat_net_repl_input_bytes = 0;
server.bio_stat_net_repl_input_bytes = 0;
server.stat_net_repl_output_bytes = 0;
server.stat_net_cluster_slot_export_bytes = 0;
server.stat_net_cluster_slot_import_bytes = 0;
server.stat_unexpected_error_replies = 0;
server.stat_total_error_replies = 0;
server.stat_dump_payload_sanitizations = 0;
@ -2788,6 +2791,9 @@ serverDb *createDatabase(int id) {
db->keys = kvstoreCreate(&kvstoreKeysHashtableType, slot_count_bits, flags);
db->expires = kvstoreCreate(&kvstoreExpiresHashtableType, slot_count_bits, flags);
db->keys_with_volatile_items = kvstoreCreate(&kvstoreExpiresHashtableType, slot_count_bits, flags);
if (clusterIsAnySlotImporting()) {
clusterMarkImportingSlotsInDb(db);
}
db->blocking_keys = dictCreate(&keylistDictType);
db->blocking_keys_unblock_on_nokey = dictCreate(&objectKeyPointerValueDictType);
db->ready_keys = dictCreate(&objectKeyPointerValueDictType);
@ -3330,7 +3336,7 @@ void resetErrorTableStats(void) {
/* ========================== OP Array API ============================ */
int serverOpArrayAppend(serverOpArray *oa, int dbid, robj **argv, int argc, int target) {
int serverOpArrayAppend(serverOpArray *oa, int dbid, robj **argv, int argc, int target, int slot) {
serverOp *op;
int prev_capacity = oa->capacity;
@ -3346,6 +3352,7 @@ int serverOpArrayAppend(serverOpArray *oa, int dbid, robj **argv, int argc, int
op->argv = argv;
op->argc = argc;
op->target = target;
op->slot = slot;
oa->numops++;
return oa->numops;
}
@ -3462,9 +3469,14 @@ struct serverCommand *lookupCommandOrOriginal(robj **argv, int argc) {
return cmd;
}
/* Determines if commands on this client are replicated from some source */
int isReplicatedClient(client *c) {
return c->flag.primary || (c->slot_migration_job && isImportSlotMigrationJob(c->slot_migration_job));
}
/* Commands arriving from the primary client or AOF client, should never be rejected. */
int mustObeyClient(client *c) {
return c->id == CLIENT_ID_AOF || c->flag.primary;
return c->id == CLIENT_ID_AOF || isReplicatedClient(c);
}
static int shouldPropagate(int target) {
@ -3474,6 +3486,7 @@ static int shouldPropagate(int target) {
if (server.aof_state != AOF_OFF) return 1;
}
if (target & PROPAGATE_REPL) {
if (clusterIsAnySlotExporting()) return 1;
if (server.primary_host == NULL && (server.repl_backlog || listLength(server.replicas) != 0)) return 1;
}
@ -3486,7 +3499,7 @@ static int shouldPropagate(int target) {
* flags are an xor between:
* + PROPAGATE_NONE (no propagation of command at all)
* + PROPAGATE_AOF (propagate into the AOF file if is enabled)
* + PROPAGATE_REPL (propagate into the replication link)
* + PROPAGATE_REPL (propagate into replication links, including slot migration jobs)
*
* This is an internal low-level function and should not be called!
*
@ -3495,7 +3508,7 @@ static int shouldPropagate(int target) {
* dbid value of -1 is saved to indicate that the called do not want
* to replicate SELECT for this command (used for database neutral commands).
*/
static void propagateNow(int dbid, robj **argv, int argc, int target) {
static void propagateNow(int dbid, robj **argv, int argc, int target, int slot) {
if (!shouldPropagate(target)) return;
/* This needs to be unreachable since the dataset should be fixed during
@ -3522,8 +3535,20 @@ static void propagateNow(int dbid, robj **argv, int argc, int target) {
serverAssert(!isPausedActions(PAUSE_ACTION_REPLICA) || server.client_pause_in_transaction ||
server.server_del_keys_in_slot);
if (server.aof_state != AOF_OFF && target & PROPAGATE_AOF) feedAppendOnlyFile(dbid, argv, argc);
if (target & PROPAGATE_REPL) replicationFeedReplicas(dbid, argv, argc);
int propagate_to_aof = server.aof_state != AOF_OFF && target & PROPAGATE_AOF;
/* When AOF is on, we want to propagate to replicas even when we have no
* replicas for the WAITAOF implementation. But otherwise, we only propagate
* when we have replicas. */
int propagate_to_repl = target & PROPAGATE_REPL;
if (propagate_to_repl && !propagate_to_aof) {
propagate_to_repl = server.primary_host == NULL && (server.repl_backlog || listLength(server.replicas) != 0);
}
int propagate_to_slot_migration = target & PROPAGATE_REPL && clusterIsAnySlotExporting();
if (propagate_to_aof) feedAppendOnlyFile(dbid, argv, argc);
if (propagate_to_repl) replicationFeedReplicas(dbid, argv, argc);
if (propagate_to_slot_migration) clusterFeedSlotExportJobs(dbid, argv, argc, slot);
}
/* Used inside commands to schedule the propagation of additional commands
@ -3537,18 +3562,22 @@ static void propagateNow(int dbid, robj **argv, int argc, int target) {
* so it is up to the caller to release the passed argv (but it is usually
* stack allocated). The function automatically increments ref count of
* passed objects, so the caller does not need to. */
void alsoPropagate(int dbid, robj **argv, int argc, int target) {
void alsoPropagate(int dbid, robj **argv, int argc, int target, int slot) {
robj **argvcopy;
int j;
if (!shouldPropagate(target)) return;
/* Don't propagate commands on slot migration clients, these will be proxied
* in replicationFeedStreamFromPrimaryStream() */
if (server.current_client != NULL && server.current_client->slot_migration_job) return;
argvcopy = zmalloc(sizeof(robj *) * argc);
for (j = 0; j < argc; j++) {
argvcopy[j] = argv[j];
incrRefCount(argv[j]);
}
serverOpArrayAppend(&server.also_propagate, dbid, argvcopy, argc, target);
serverOpArrayAppend(&server.also_propagate, dbid, argvcopy, argc, target, slot);
}
/* It is possible to call the function forceCommandPropagation() inside a
@ -3614,18 +3643,18 @@ static void propagatePendingCommands(void) {
if (transaction) {
/* We use dbid=-1 to indicate we do not want to replicate SELECT.
* It'll be inserted together with the next command (inside the MULTI) */
propagateNow(-1, &shared.multi, 1, PROPAGATE_AOF | PROPAGATE_REPL);
propagateNow(-1, &shared.multi, 1, PROPAGATE_AOF | PROPAGATE_REPL, -1);
}
for (j = 0; j < server.also_propagate.numops; j++) {
rop = &server.also_propagate.ops[j];
serverAssert(rop->target);
propagateNow(rop->dbid, rop->argv, rop->argc, rop->target);
propagateNow(rop->dbid, rop->argv, rop->argc, rop->target, rop->slot);
}
if (transaction) {
/* We use dbid=-1 to indicate we do not want to replicate select */
propagateNow(-1, &shared.exec, 1, PROPAGATE_AOF | PROPAGATE_REPL);
propagateNow(-1, &shared.exec, 1, PROPAGATE_AOF | PROPAGATE_REPL, -1);
}
serverOpArrayFree(&server.also_propagate);
@ -3885,7 +3914,7 @@ void call(client *c, int flags) {
/* Call alsoPropagate() only if at least one of AOF / replication
* propagation is needed. */
if (propagate_flags != PROPAGATE_NONE) alsoPropagate(c->db->id, c->argv, c->argc, propagate_flags);
if (propagate_flags != PROPAGATE_NONE) alsoPropagate(c->db->id, c->argv, c->argc, propagate_flags, c->slot);
}
/* Restore the old replication flags, since call() can be executed
@ -4299,6 +4328,11 @@ int processCommand(client *c) {
if (server.current_client == NULL) return C_ERR;
if (out_of_memory && is_denyoom_command) {
if (c->slot_migration_job != NULL) {
clusterHandleSlotMigrationClientOOM(c->slot_migration_job);
return C_ERR;
}
rejectCommand(c, shared.oomerr);
return C_OK;
}
@ -5938,6 +5972,8 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) {
"mem_clients_slaves:%zu\r\n", mh->clients_replicas,
"mem_clients_normal:%zu\r\n", mh->clients_normal,
"mem_cluster_links:%zu\r\n", mh->cluster_links,
"mem_cluster_slot_import:%zu\r\n", mh->cluster_slot_import,
"mem_cluster_slot_export:%zu\r\n", mh->cluster_slot_export,
"mem_aof_buffer:%zu\r\n", mh->aof_buffer,
"mem_allocator:%s\r\n", ZMALLOC_LIB,
"mem_overhead_db_hashtable_rehashing:%zu\r\n", mh->overhead_db_hashtable_rehashing,
@ -6054,10 +6090,12 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) {
"total_connections_received:%lld\r\n", server.stat_numconnections,
"total_commands_processed:%lld\r\n", server.stat_numcommands,
"instantaneous_ops_per_sec:%lld\r\n", getInstantaneousMetric(STATS_METRIC_COMMAND),
"total_net_input_bytes:%lld\r\n", server.stat_net_input_bytes + server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes,
"total_net_output_bytes:%lld\r\n", server.stat_net_output_bytes + server.stat_net_repl_output_bytes,
"total_net_input_bytes:%lld\r\n", server.stat_net_input_bytes + server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes + server.stat_net_cluster_slot_import_bytes,
"total_net_output_bytes:%lld\r\n", server.stat_net_output_bytes + server.stat_net_repl_output_bytes + server.stat_net_cluster_slot_export_bytes,
"total_net_repl_input_bytes:%lld\r\n", server.stat_net_repl_input_bytes + server.bio_stat_net_repl_input_bytes,
"total_net_repl_output_bytes:%lld\r\n", server.stat_net_repl_output_bytes,
"total_net_cluster_slot_import_bytes:%lld\r\n", server.stat_net_cluster_slot_import_bytes,
"total_net_cluster_slot_export_bytes:%lld\r\n", server.stat_net_cluster_slot_export_bytes,
"instantaneous_input_kbps:%.2f\r\n", (float)getInstantaneousMetric(STATS_METRIC_NET_INPUT) / 1024,
"instantaneous_output_kbps:%.2f\r\n", (float)getInstantaneousMetric(STATS_METRIC_NET_OUTPUT) / 1024,
"instantaneous_input_repl_kbps:%.2f\r\n", (float)getInstantaneousMetric(STATS_METRIC_NET_INPUT_REPLICATION) / 1024,

View File

@ -180,15 +180,17 @@ struct hdr_histogram;
#define RIO_CONNSET_WRITE_MAX_CHUNK_SIZE 16384
/* Instantaneous metrics tracking. */
#define STATS_METRIC_SAMPLES 16 /* Number of samples per metric. */
#define STATS_METRIC_COMMAND 0 /* Number of commands executed. */
#define STATS_METRIC_NET_INPUT 1 /* Bytes read to network. */
#define STATS_METRIC_NET_OUTPUT 2 /* Bytes written to network. */
#define STATS_METRIC_NET_INPUT_REPLICATION 3 /* Bytes read to network during replication. */
#define STATS_METRIC_NET_OUTPUT_REPLICATION 4 /* Bytes written to network during replication. */
#define STATS_METRIC_EL_CYCLE 5 /* Number of eventloop cycled. */
#define STATS_METRIC_EL_DURATION 6 /* Eventloop duration. */
#define STATS_METRIC_COUNT 7
#define STATS_METRIC_SAMPLES 16 /* Number of samples per metric. */
typedef enum {
STATS_METRIC_COMMAND = 0, /* Number of commands executed. */
STATS_METRIC_NET_INPUT, /* Bytes read to network. */
STATS_METRIC_NET_OUTPUT, /* Bytes written to network. */
STATS_METRIC_NET_INPUT_REPLICATION, /* Bytes read to network during replication. */
STATS_METRIC_NET_OUTPUT_REPLICATION, /* Bytes written to network during replication. */
STATS_METRIC_EL_CYCLE, /* Number of eventloop cycled. */
STATS_METRIC_EL_DURATION, /* Eventloop duration. */
STATS_METRIC_COUNT /* Total count */
} instantaneous_metric_type;
/* Protocol and I/O related defines */
#define PROTO_IOBUF_LEN (1024 * 16) /* Generic I/O buffer size */
@ -377,14 +379,16 @@ typedef enum blocking_type {
/* Client classes for client limits, currently used only for
* the max-client-output-buffer limit implementation. */
#define CLIENT_TYPE_NORMAL 0 /* Normal req-reply clients + MONITORs */
#define CLIENT_TYPE_REPLICA 1 /* Replicas. */
#define CLIENT_TYPE_PUBSUB 2 /* Clients subscribed to PubSub channels. */
#define CLIENT_TYPE_PRIMARY 3 /* Primary. */
#define CLIENT_TYPE_COUNT 4 /* Total number of client types. */
#define CLIENT_TYPE_OBUF_COUNT 3 /* Number of clients to expose to output \
buffer configuration. Just the first \
three: normal, replica, pubsub. */
#define CLIENT_TYPE_NORMAL 0 /* Normal req-reply clients + MONITORs */
#define CLIENT_TYPE_REPLICA 1 /* Replicas. */
#define CLIENT_TYPE_PUBSUB 2 /* Clients subscribed to PubSub channels. */
#define CLIENT_TYPE_PRIMARY 3 /* Primary. */
#define CLIENT_TYPE_SLOT_IMPORT 4 /* Slot import. */
#define CLIENT_TYPE_SLOT_EXPORT 5 /* Slot export. */
#define CLIENT_TYPE_COUNT 6 /* Total number of client types. */
#define CLIENT_TYPE_OBUF_COUNT 3 /* Number of clients to expose to output \
buffer configuration. Just the first \
three: normal, replica, pubsub. */
/* Type of commandlog */
typedef enum {
@ -632,6 +636,7 @@ typedef enum {
PAUSE_BY_CLIENT_COMMAND = 0,
PAUSE_DURING_SHUTDOWN,
PAUSE_DURING_FAILOVER,
PAUSE_DURING_SLOT_MIGRATION,
NUM_PAUSE_PURPOSES /* This value is the number of purposes above. */
} pause_purpose;
@ -1237,6 +1242,9 @@ typedef struct LastWrittenBuf {
* This length differs from bufpos in case of copy avoidance */
} LastWrittenBuf;
/* Forward declaration of slotMigrationJob */
typedef struct slotMigrationJob slotMigrationJob;
typedef struct client {
/* Basic client information and connection. */
uint64_t id; /* Client incremental unique ID. */
@ -1260,11 +1268,12 @@ typedef struct client {
time_t last_interaction; /* Time of the last interaction, used for timeout */
serverDb *db; /* Pointer to currently SELECTed DB. */
/* Client state structs. */
ClientPubSubData *pubsub_data; /* Required for: pubsub commands and tracking. lazily initialized when first needed */
ClientReplicationData *repl_data; /* Required for Replication operations. lazily initialized when first needed */
ClientModuleData *module_data; /* Required for Module operations. lazily initialized when first needed */
multiState *mstate; /* MULTI/EXEC state, lazily initialized when first needed */
blockingState *bstate; /* Blocking state, lazily initialized when first needed */
ClientPubSubData *pubsub_data; /* Required for: pubsub commands and tracking. lazily initialized when first needed */
ClientReplicationData *repl_data; /* Required for Replication operations. lazily initialized when first needed */
ClientModuleData *module_data; /* Required for Module operations. lazily initialized when first needed */
multiState *mstate; /* MULTI/EXEC state, lazily initialized when first needed */
blockingState *bstate; /* Blocking state, lazily initialized when first needed */
slotMigrationJob *slot_migration_job; /* Pointer to the slot migration job, or NULL. */
/* Output buffer and reply handling */
long duration; /* Current command duration. Used for measuring latency of blocking/non-blocking cmds */
char *buf; /* Output buffer */
@ -1387,7 +1396,7 @@ struct sharedObjectsStruct {
*mbulkhdr[OBJ_SHARED_BULKHDR_LEN], /* "*<value>\r\n" */
*bulkhdr[OBJ_SHARED_BULKHDR_LEN], /* "$<value>\r\n" */
*maphdr[OBJ_SHARED_BULKHDR_LEN], /* "%<value>\r\n" */
*sethdr[OBJ_SHARED_BULKHDR_LEN]; /* "~<value>\r\n" */
*sethdr[OBJ_SHARED_BULKHDR_LEN];
sds minstring, maxstring;
};
@ -1433,7 +1442,7 @@ extern clientBufferLimitsConfig clientBufferLimitsDefaults[CLIENT_TYPE_OBUF_COUN
* after the propagation of the executed command. */
typedef struct serverOp {
robj **argv;
int argc, dbid, target;
int argc, dbid, target, slot;
} serverOp;
/* Defines an array of Operations. There is an API to add to this
@ -1458,6 +1467,8 @@ struct serverMemOverhead {
size_t clients_replicas;
size_t clients_normal;
size_t cluster_links;
size_t cluster_slot_import;
size_t cluster_slot_export;
size_t aof_buffer;
size_t lua_caches;
size_t functions_caches;
@ -1791,6 +1802,8 @@ struct valkeyServer {
long long stat_net_repl_input_bytes; /* Bytes read during replication, added to stat_net_input_bytes in 'info'. */
/* Bytes written during replication, added to stat_net_output_bytes in 'info'. */
long long stat_net_repl_output_bytes;
long long stat_net_cluster_slot_import_bytes; /* Bytes read from slot import sources. */
long long stat_net_cluster_slot_export_bytes; /* Bytes written to slot export sources. */
size_t stat_current_cow_peak; /* Peak size of copy on write bytes. */
size_t stat_current_cow_bytes; /* Copy on write bytes while child is active. */
monotime stat_current_cow_updated; /* Last update time of stat_current_cow_bytes */
@ -2176,12 +2189,20 @@ struct valkeyServer {
* the cluster after it is forgotten with CLUSTER FORGET. */
int cluster_slot_stats_enabled; /* Cluster slot usage statistics tracking enabled. */
mstime_t cluster_mf_timeout; /* Milliseconds to do a manual failover. */
unsigned long cluster_slot_migration_log_max_len; /* Maximum count of migrations to display in the
* migration log, after which we will clear finished
* migrations. */
ssize_t slot_migration_max_failover_repl_bytes; /* Maximum amount of in flight bytes for a slot migration
* failover to be attempted. */
/* Debug config that goes along with cluster_drop_packet_filter. When set, the link is closed on packet drop. */
uint32_t debug_cluster_close_link_on_packet_drop : 1;
/* Debug config to control the random ping. When set, we will disable the random ping in clusterCron. */
uint32_t debug_cluster_disable_random_ping : 1;
/* Debug config to control the reconnection. When set, we will disable the reconnection in clusterCron. */
uint32_t debug_cluster_disable_reconnection : 1;
/* Debug config to expose intermediary slot migration states. */
uint32_t debug_slot_migration_prevent_pause : 1;
uint32_t debug_slot_migration_prevent_failover : 1;
sds cached_cluster_slot_info[CACHE_CONN_TYPE_MAX]; /* Index in array is a bitwise or of CACHE_CONN_TYPE_* */
/* Scripting */
mstime_t busy_reply_threshold; /* Script / module timeout in milliseconds */
@ -2712,12 +2733,12 @@ void dictVanillaFree(void *val);
#define READ_FLAGS_ERROR_BIG_BULK_COUNT (1 << 6)
#define READ_FLAGS_ERROR_MBULK_UNEXPECTED_CHARACTER (1 << 7)
#define READ_FLAGS_ERROR_MBULK_INVALID_BULK_LEN (1 << 8)
#define READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_PRIMARY (1 << 9)
#define READ_FLAGS_ERROR_UNEXPECTED_INLINE_FROM_REPLICATED_CLIENT (1 << 9)
#define READ_FLAGS_ERROR_UNBALANCED_QUOTES (1 << 10)
#define READ_FLAGS_INLINE_ZERO_QUERY_LEN (1 << 11)
#define READ_FLAGS_PARSING_NEGATIVE_MBULK_LEN (1 << 12)
#define READ_FLAGS_PARSING_COMPLETED (1 << 13)
#define READ_FLAGS_PRIMARY (1 << 14)
#define READ_FLAGS_REPLICATED (1 << 14)
#define READ_FLAGS_DONT_PARSE (1 << 15)
#define READ_FLAGS_AUTH_REQUIRED (1 << 16)
#define READ_FLAGS_COMMAND_NOT_FOUND (1 << 17)
@ -2831,6 +2852,7 @@ void unpauseActions(pause_purpose purpose);
uint32_t isPausedActions(uint32_t action_bitmask);
uint32_t isPausedActionsWithUpdate(uint32_t action_bitmask);
char *getPausedReason(pause_purpose purpose);
uint32_t getPausedActionsWithPurpose(pause_purpose purpose);
mstime_t getPausedActionTimeout(uint32_t action, pause_purpose *purpose);
void updatePausedActions(void);
void unblockPostponedClients(void);
@ -3055,11 +3077,16 @@ void clearFailoverState(void);
void updateFailoverStatus(void);
void abortFailover(const char *err);
const char *getFailoverStateString(void);
sds getReplicaPortString(void);
int sendCurrentOffsetToReplica(client *replica);
void addRdbReplicaToPsyncWait(client *replica);
void initClientReplicationData(client *c);
void freeClientReplicationData(client *c);
void replicaReceiveRDBFromPrimaryToDisk(connection *conn, int is_dual_channel);
sds replicationSendAuth(connection *conn);
sds receiveSynchronousResponse(connection *conn);
ConnectionType *connTypeOfReplication(void);
robj *generateSelectCommand(int dictid);
/* Generic persistence functions */
void startLoadingFile(size_t size, char *filename, int rdbflags);
@ -3099,6 +3126,7 @@ void aofOpenIfNeededOnServerStart(void);
void aofManifestFree(aofManifest *am);
int aofDelHistoryFiles(void);
int aofRewriteLimited(void);
int rewriteSlotToAppendOnlyFileRio(rio *aof, int db_num, int hashslot, size_t *key_count);
/* Child info */
void openChildInfoPipe(void);
@ -3270,7 +3298,7 @@ int commandCheckArity(struct serverCommand *cmd, int argc, sds *err);
void startCommandExecution(void);
int incrCommandStatsOnError(struct serverCommand *cmd, int flags);
void call(client *c, int flags);
void alsoPropagate(int dbid, robj **argv, int argc, int target);
void alsoPropagate(int dbid, robj **argv, int argc, int target, int slot);
void postExecutionUnitOperations(void);
void serverOpArrayFree(serverOpArray *oa);
void forceCommandPropagation(client *c, int flags);
@ -3283,6 +3311,7 @@ int prepareForShutdown(client *c, int flags);
void replyToClientsBlockedOnShutdown(void);
int abortShutdown(void);
void afterCommand(client *c);
int isReplicatedClient(client *c);
int mustObeyClient(client *c);
#ifdef __GNUC__
void _serverLog(int level, const char *fmt, ...) __attribute__((format(printf, 2, 3)));
@ -3330,12 +3359,14 @@ int getKeySlot(sds key);
int calculateKeySlot(sds key);
/* kvstore wrappers */
int getKVStoreIndexForKey(sds key);
int dbExpand(serverDb *db, uint64_t db_size, int try_expand);
int dbExpandExpires(serverDb *db, uint64_t db_size, int try_expand);
robj *dbFind(serverDb *db, sds key);
robj *dbFindExpires(serverDb *db, sds key);
robj *dbFindExpiresWithDictIndex(serverDb *db, sds key, int dict_index);
unsigned long long dbSize(serverDb *db);
unsigned long long dbScan(serverDb *db, unsigned long long cursor, hashtableScanFunction scan_cb, void *privdata);
unsigned long long dbScan(serverDb *db, unsigned long long cursor, kvstoreScanFunction scan_cb, void *privdata);
/* Set data type */
robj *setTypeCreate(sds value, size_t size_hint);
@ -3500,9 +3531,10 @@ int setModuleNumericConfig(ModuleConfig *config, long long val, const char **err
/* db.c -- Keyspace access API */
int removeExpire(serverDb *db, robj *key);
void deleteExpiredKeyAndPropagate(serverDb *db, robj *keyobj);
void deleteExpiredKeyAndPropagateWithDictIndex(serverDb *db, robj *keyobj, int dict_index);
void deleteExpiredKeyFromOverwriteAndPropagate(client *c, robj *keyobj);
void propagateDeletion(serverDb *db, robj *key, int lazy);
size_t dbReclaimExpiredFields(robj *o, serverDb *db, mstime_t now, unsigned long max_entries);
void propagateDeletion(serverDb *db, robj *key, int lazy, int slot);
size_t dbReclaimExpiredFields(robj *o, serverDb *db, mstime_t now, unsigned long max_entries, int didx);
int keyIsExpired(serverDb *db, robj *key);
long long getExpire(serverDb *db, robj *key);
robj *setExpire(client *c, serverDb *db, robj *key, long long when);
@ -3544,6 +3576,7 @@ robj *dbUnshareStringValue(serverDb *db, robj *key, robj *o);
#define EMPTYDB_NO_FLAGS 0 /* No flags. */
#define EMPTYDB_ASYNC (1 << 0) /* Reclaim memory in another thread. */
#define EMPTYDB_NOFUNCTIONS (1 << 1) /* Indicate not to flush the functions. */
typedef int(emptyDataHashtableFilter)(int didx);
long long emptyData(int dbnum, int flags, void(callback)(hashtable *));
long long emptyDbStructure(serverDb **dbarray, int dbnum, int async, void(callback)(hashtable *));
void resetDbExpiryState(serverDb *db);
@ -3898,6 +3931,7 @@ void sunsubscribeCommand(client *c);
void watchCommand(client *c);
void unwatchCommand(client *c);
void clusterCommand(client *c);
void clusterFlushslotCommand(client *c);
void clusterSlotStatsCommand(client *c);
void restoreCommand(client *c);
void migrateCommand(client *c);

View File

@ -833,7 +833,7 @@ void spopWithCountCommand(client *c) {
}
/* Replicate/AOF this command as an SREM operation */
if (propindex == 2 + batchsize) {
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
for (unsigned long j = 2; j < propindex; j++) {
decrRefCount(propargv[j]);
}
@ -855,7 +855,7 @@ void spopWithCountCommand(client *c) {
propindex++;
/* Replicate/AOF this command as an SREM operation */
if (propindex == 2 + batchsize) {
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
for (unsigned long j = 2; j < propindex; j++) {
decrRefCount(propargv[j]);
}
@ -916,7 +916,7 @@ void spopWithCountCommand(client *c) {
}
/* Replicate/AOF this command as an SREM operation */
if (propindex == 2 + batchsize) {
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
for (unsigned long i = 2; i < propindex; i++) {
decrRefCount(propargv[i]);
}
@ -931,7 +931,7 @@ void spopWithCountCommand(client *c) {
/* Replicate/AOF the remaining elements as an SREM operation */
if (propindex != 2) {
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, propargv, propindex, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
for (unsigned long i = 2; i < propindex; i++) {
decrRefCount(propargv[i]);
}

View File

@ -1544,7 +1544,7 @@ void streamPropagateXCLAIM(client *c, robj *key, streamCG *group, robj *groupnam
argv[12] = shared.lastid;
argv[13] = createObjectFromStreamID(&group->last_id);
alsoPropagate(c->db->id, argv, 14, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, argv, 14, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
decrRefCount(argv[3]);
decrRefCount(argv[7]);
@ -1568,7 +1568,7 @@ void streamPropagateGroupID(client *c, robj *key, streamCG *group, robj *groupna
argv[5] = shared.entriesread;
argv[6] = createStringObjectFromLongLong(group->entries_read);
alsoPropagate(c->db->id, argv, 7, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, argv, 7, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
decrRefCount(argv[4]);
decrRefCount(argv[6]);
@ -1588,7 +1588,7 @@ void streamPropagateConsumerCreation(client *c, robj *key, robj *groupname, sds
argv[3] = groupname;
argv[4] = createObject(OBJ_STRING, sdsdup(consumername));
alsoPropagate(c->db->id, argv, 5, PROPAGATE_AOF | PROPAGATE_REPL);
alsoPropagate(c->db->id, argv, 5, PROPAGATE_AOF | PROPAGATE_REPL, c->slot);
decrRefCount(argv[4]);
}

View File

@ -5,24 +5,6 @@ proc log_file_matches {log pattern} {
string match $pattern $content
}
proc get_client_id_by_last_cmd {r cmd} {
set client_list [$r client list]
set client_id ""
set lines [split $client_list "\n"]
foreach line $lines {
if {[string match *cmd=$cmd* $line]} {
set parts [split $line " "]
foreach part $parts {
if {[string match id=* $part]} {
set client_id [lindex [split $part "="] 1]
return $client_id
}
}
}
}
return $client_id
}
# Wait until the process enters a paused state.
proc wait_process_paused idx {
set pid [srv $idx pid]

View File

@ -652,7 +652,12 @@ proc populate {num {prefix key:} {size 3} {idx 0} {prints false} {expires 0}} {
set val [string repeat A $size]
for {set j 0} {$j < $pipeline} {incr j} {
if {$expires > 0} {
r $idx set $prefix$j $val ex $expires
if {$expires < 1} {
set pexpires [expr int($expires * 1000)]
r $idx set $prefix$j $val px $pexpires
} else {
r $idx set $prefix$j $val ex $expires
}
} else {
r $idx set $prefix$j $val
}
@ -660,7 +665,12 @@ proc populate {num {prefix key:} {size 3} {idx 0} {prints false} {expires 0}} {
}
for {} {$j < $num} {incr j} {
if {$expires > 0} {
r $idx set $prefix$j $val ex $expires
if {$expires < 1} {
set pexpires [expr int($expires * 1000)]
r $idx set $prefix$j $val px $pexpires
} else {
r $idx set $prefix$j $val ex $expires
}
} else {
r $idx set $prefix$j $val
}
@ -1231,6 +1241,24 @@ proc generate_largevalue_test_array {} {
return [array get largevalue]
}
proc get_client_id_by_last_cmd {r cmd} {
set client_list [$r client list]
set client_id ""
set lines [split $client_list "\n"]
foreach line $lines {
if {[string match *cmd=$cmd* $line]} {
set parts [split $line " "]
foreach part $parts {
if {[string match id=* $part]} {
set client_id [lindex [split $part "="] 1]
return $client_id
}
}
}
}
return $client_id
}
# Breakpoint function, which invokes a minimal debugger.
# This function can be placed within the desired Tcl tests for debugging purposes.
#

File diff suppressed because it is too large Load Diff

View File

@ -58,6 +58,11 @@ IGNORED_COMMANDS = {
# Commands to which we decided not write a reply schema
"pfdebug",
"lolwut",
# Slot migration commands are not tested for RC1
"cluster|syncslots",
"cluster|cancelslotmigrations",
"cluster|getslotmigrations",
"cluster|migrateslots",
}
class Request(object):

View File

@ -1891,7 +1891,8 @@ aof-timestamp-enabled no
# The timeout in milliseconds for cluster manual failover. If a manual failover
# does not complete within the specified time, both the replica and the primary
# will abort it.
# will abort it. Note that this timeout is also used for the finalization of
# migrations initiated with the CLUSTER MIGRATESLOTS command.
#
# A manual failover is a special kind of failover that is usually executed when
# there are no actual failures, and we wish to swap the current primary with one
@ -1994,6 +1995,23 @@ aof-timestamp-enabled no
#
# cluster-slot-stats-enabled no
# Slot migrations using the CLUSTER MIGRATESLOTS command will generate an in-memory migration log on both
# the source and target nodes of the migration. These can be observed with CLUSTER GETSLOTMIGRATIONS.
# 'cluster-slot-migration-log-max-len' allows the maximum length of this log to be specified. Only
# migrations that are completed will be considered for removal.
#
# cluster-slot-migration-log-max-len 1000
# During the CLUSTER MIGRATESLOTS command execution, the source node needs to pause itself and allow all
# writes to be fully processed by the target node. The amount of data remaining in the buffer on the
# source node when this pause happens will affect how long this pause takes.
# 'slot-migration-max-failover-repl-bytes' allows the pause to wait until there are at most this
# many bytes in the output buffer. Setting this to -1 will disable this limit, and 0 will require
# no data be in the source output buffer (although this is not a guaranatee the data is fully
# received by the target).
#
# slot-migration-max-failover-repl-bytes 0
# In order to setup your cluster make sure to read the documentation
# available at https://valkey.io web site.