6.3 KiB
Network Fault Injection Framework (mongobridge)
Overview
Mongobridge is a network fault injection testing tool that allows test authors to intentionally simulate network issues such as connection failures, message delays, or packet loss during communication to any node in a cluster. It acts as a transparent proxy between MongoDB processes and their clients, enabling controlled network fault injection for testing distributed system behavior.
How It Works
When ReplSetTest or ShardingTest are instructed to use mongobridge, they will set up a mongobridge process for each node that creates a ProxiedConnection between the node and any clients (including other nodes in the cluster) attempting to communicate with it. When test authors send a command to a node, mongobridge intercepts the command and applies any configured actions onto the commands before forwarding the command along to the node itself. This allows simple fault injection from the test author's perspective.
Quick Start
To use mongobridge in your tests:
-
Enable mongobridge in your test setup:
let st = new ShardingTest({ shards: {rs0: {nodes: 2}}, mongos: 1, config: 1, useBridge: true, // Enable mongobridge });- Test commands must be enabled: Mongobridge's
*Fromcommands requireenableTestCommands: true(which is the default in test environments)
- Test commands must be enabled: Mongobridge's
-
Inject network faults using bridge commands:
// Delay messages by 5 seconds st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 5000); // Reject all connections st.rs0.getPrimary().rejectConnectionsFrom(st.rs0.getSecondary()); // Restore normal behavior st.rs0.getPrimary().acceptConnectionsFrom(st.rs0.getSecondary()); -
Operations that depend on communication between the affected nodes will fail or timeout as expected.
What to keep in mind
Be aware that there are consequences to injecting network faults between nodes that can cause downstream impact in (for example) heartbeats, sync source selection, and SDAM, and so after a fault has been injected the test may not be in the state you expect it to be in for future commands. It is best to keep mongobridge tests relatively short and targeted to ensure that flakiness due to these faults doesn't impact the rest of your testing.
Command Reference
Mongobridge supports four commands for network fault injection:
acceptConnectionsFrom(bridges)
Purpose: Allows normal communication from specified sources
Usage:
node.acceptConnectionsFrom(otherNode);
node.acceptConnectionsFrom([node1, node2, node3]); // Multiple nodes
Effect: Restores normal message forwarding (default state)
rejectConnectionsFrom(bridges)
Purpose: Immediately closes connections from specified sources
Usage:
node.rejectConnectionsFrom(otherNode);
Effect: New connections are rejected, existing connections are closed when a new request is sent over them
Use case: Simulating complete network partitions
delayMessagesFrom(bridges, delayMs)
Purpose: Delays message forwarding by specified milliseconds
Usage:
node.delayMessagesFrom(otherNode, 5000); // 5 second delay
node.delayMessagesFrom(otherNode, 0); // Remove delay
Parameters:
delayMs: Delay in milliseconds (0 to disable)
Use case: Simulating slow networks or testing timeout behavior
discardMessagesFrom(bridges, lossProbability)
Purpose: Randomly discards messages with specified probability
Usage:
node.discardMessagesFrom(otherNode, 0.5); // Drop 50% of messages
node.discardMessagesFrom(otherNode, 1.0); // Drop all messages
node.discardMessagesFrom(otherNode, 0.0); // Drop no messages
Parameters:
lossProbability: Number between 0.0 (no loss) and 1.0 (total loss)
Use case: Simulating unreliable networks or packet loss
Examples
Basic Network Partition Test
assert.eq(jsTest.options().enableTestCommands, true);
// Set up a replica set with mongobridge
let rst = new ReplSetTest({
nodes: 3,
useBridge: true,
settings: {electionTimeoutMillis: 2000, heartbeatIntervalMillis: 400},
});
rst.startSet();
rst.initiate();
// Partition the primary from secondaries
let primary = rst.getPrimary();
let secondaries = rst.getSecondaries();
primary.rejectConnectionsFrom(secondaries);
// Verify primary steps down due to lost majority
assert.soon(() => {
return rst.getPrimary() !== primary;
});
// Restore network
primary.acceptConnectionsFrom(secondaries);
rst.stopSet();
Write Concern Timeout Test
assert.eq(jsTest.options().enableTestCommands, true);
let st = new ShardingTest({
shards: {rs0: {nodes: 2}},
useBridge: true,
});
// Delay replication to cause write concern timeout
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 10000);
// This write should fail due to timeout
assert.commandFailed(
st.s0.getCollection("test.coll").insert(
{x: 1},
{
writeConcern: {w: 2, wtimeout: 5000},
},
),
);
// Restore normal replication
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 0);
st.stop();
Simulating Packet Loss
// Set up unreliable network with 30% packet loss
primary.discardMessagesFrom(secondary, 0.3);
// Operations may succeed or fail unpredictably
// Useful for testing retry logic and resilience
Limitations
- OP_QUERY exhaust: Not supported for legacy exhaust queries (OP_MSG exhaust cursors are supported)
- Direct connections: Only works when connections go through the bridge proxy
- TLS support: Mongobridge is not supported if the cluster is using TLS.