Fix cluster slot migration flaky test (#2756)

The original test code only checks:

The original test code only checks:

1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.

2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.


Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.

Fixes #2693.

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This commit is contained in:
Hanxi Zhang 2025-11-19 23:07:16 -08:00 committed by GitHub
parent e19ceb7a6d
commit ed8856bdfc
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 8 additions and 1 deletions

View File

@ -275,7 +275,14 @@ test {Migrate the last slot away from a node using valkey-cli} {
# First we wait for new node to be recognized by entire cluster
wait_for_cluster_size 4
# Cluster check just verifies the config state is self-consistent,
# waiting for cluster_state to be okay is an independent check that all the
# nodes actually believe each other are healthy, prevent cluster down error.
wait_for_condition 1000 50 {
[catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv 0 port]}] == 0 &&
[catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -1 port]}] == 0 &&
[catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -2 port]}] == 0 &&
[catch {exec src/valkey-cli --cluster check 127.0.0.1:[srv -3 port]}] == 0 &&
[CI 0 cluster_state] eq {ok} &&
[CI 1 cluster_state] eq {ok} &&
[CI 2 cluster_state] eq {ok} &&