valkey/utils
Jacob Murphy d7993b78d8
Introduce atomic slot migration (#1949)
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```
                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 
```

Co-authored-by: Binbin <binloveplay1314@qq.com>

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-08-11 18:02:37 -07:00
..
create-cluster Enable protected-configs, debug and module commands in create-cluster script (#701) 2024-06-27 12:27:09 -07:00
graphs/commits-over-time Added Tcl program to show commits graphicaly. 2015-11-20 15:45:25 +01:00
hyperloglog Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
lru Fix typo and rename Redis to Valkey in the utils/lru/README (#314) 2024-04-14 11:40:01 -07:00
req-res-validator Add reply_schema to command json files (internal for now) (#10273) 2023-03-11 10:14:16 +02:00
srandmember Fixed some typos, add a spell check ci and others minor fix (#8890) 2021-06-10 15:39:33 +03:00
build-static-symbols.tcl Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
corrupt_rdb.c Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
gen-test-certs.sh Update redis legacy word when run TLS cert file (#572) 2024-05-30 13:09:29 -04:00
generate-command-code.py Remove trademarked language in code comments (#223) 2024-04-09 10:24:03 +02:00
generate-commands-json.py Update redis to valkey in generate-commands-json.py (#238) 2024-04-18 21:31:49 +02:00
generate-fmtargs.py Improve multithreaded performance with memory prefetching (#861) 2024-08-26 21:10:44 -07:00
generate-module-api-doc.rb Minor fix in module API doc script (#494) 2024-05-14 01:12:30 +02:00
generate-unit-test-header.py clang-format: set ColumnLimit to 0 and reformat (#1045) 2024-09-25 01:22:54 +02:00
install_server.sh Spelling 9 (#2245) 2025-07-04 16:15:04 -04:00
module-api-since.rb Update script to generate Valkey Module API docs (#406) 2024-04-30 18:24:18 +02:00
redis-copy.rb Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
redis-sha1.rb Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
reply_schema_linter.js Add reply_schema to command json files (internal for now) (#10273) 2023-03-11 10:14:16 +02:00
req-res-log-validator.py Introduce atomic slot migration (#1949) 2025-08-11 18:02:37 -07:00
speed-regression.tcl Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
systemd-valkey_multiple_servers@.service Rename systemd files and content to valkey from redis (#234) 2024-04-08 14:37:56 -04:00
systemd-valkey_server.service Rename systemd files and content to valkey from redis (#234) 2024-04-08 14:37:56 -04:00
tracking_collisions.c Copyright update to reflect IP transfer from salvatore to Redis (#740) 2024-08-14 09:20:36 -07:00
valkey_init_script Rename redis_init_script file and its content (#357) 2024-04-24 10:05:11 +02:00
valkey_init_script.tpl Rename redis_init_script file and its content (#357) 2024-04-24 10:05:11 +02:00
whatisdoing.sh Update Redis legacy keyword and link in utils/whatisdoing.sh (#1495) 2025-01-03 09:37:55 +08:00