SERVER-109671 Document mongobridge (#43402)

GitOrigin-RevId: 1e4f97c95e8fb8418dc8921f5e9302f9001f1fe0
This commit is contained in:
Erin McNulty 2025-10-31 15:40:11 -04:00 committed by MongoDB Bot
parent ee19cd5693
commit f0ffa7be9a
1 changed files with 192 additions and 0 deletions

View File

@ -0,0 +1,192 @@
# Network Fault Injection Framework (mongobridge)
## Overview
[Mongobridge](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L1) is a network fault injection testing tool that allows test authors to intentionally simulate network issues such as connection failures, message delays, or packet loss during communication to any node in a cluster. It acts as a transparent proxy between MongoDB processes and their clients, enabling controlled network fault injection for testing distributed system behavior.
## How It Works
When `ReplSetTest` or `ShardingTest` are instructed to use `mongobridge`, they will [set up a mongobridge process](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/libs/replsettest.js#L2962) for each node that [creates a ProxiedConnection](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L323-L324) between the node and any clients (including other nodes in the cluster) attempting to communicate with it. When test authors send a command to a node, mongobridge [intercepts the command and applies any configured actions](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L395-L430) onto the commands before forwarding the command along to the node itself. This allows simple fault injection from the test author's perspective.
## Quick Start
To use mongobridge in your tests:
1. **Enable mongobridge** in your test setup:
```javascript
let st = new ShardingTest({
shards: {rs0: {nodes: 2}},
mongos: 1,
config: 1,
useBridge: true, // Enable mongobridge
});
```
- **Test commands must be enabled**: Mongobridge's `*From` commands require `enableTestCommands: true` (which is the default in test environments)
2. **Inject network faults** using bridge commands:
```javascript
// Delay messages by 5 seconds
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 5000);
// Reject all connections
st.rs0.getPrimary().rejectConnectionsFrom(st.rs0.getSecondary());
// Restore normal behavior
st.rs0.getPrimary().acceptConnectionsFrom(st.rs0.getSecondary());
```
3. Operations that depend on communication between the affected nodes will fail or timeout as expected.
## What to keep in mind
Be aware that there are consequences to injecting network faults between nodes that can cause downstream impact in (for example) heartbeats, sync source selection, and SDAM, and so after a fault has been injected the test may not be in the state you expect it to be in for future commands. It is best to keep mongobridge tests relatively short and targeted to ensure that flakiness due to these faults doesn't impact the rest of your testing.
## Command Reference
Mongobridge supports four commands for network fault injection:
### `acceptConnectionsFrom(bridges)`
**Purpose**: Allows normal communication from specified sources
**Usage**:
```javascript
node.acceptConnectionsFrom(otherNode);
node.acceptConnectionsFrom([node1, node2, node3]); // Multiple nodes
```
**Effect**: Restores normal message forwarding (default state)
### `rejectConnectionsFrom(bridges)`
**Purpose**: Immediately closes connections from specified sources
**Usage**:
```javascript
node.rejectConnectionsFrom(otherNode);
```
**Effect**: New connections are rejected, existing connections are closed when a new request is sent over them
**Use case**: Simulating complete network partitions
### `delayMessagesFrom(bridges, delayMs)`
**Purpose**: Delays message forwarding by specified milliseconds
**Usage**:
```javascript
node.delayMessagesFrom(otherNode, 5000); // 5 second delay
node.delayMessagesFrom(otherNode, 0); // Remove delay
```
**Parameters**:
- `delayMs`: Delay in milliseconds (0 to disable)
**Use case**: Simulating slow networks or testing timeout behavior
### `discardMessagesFrom(bridges, lossProbability)`
**Purpose**: Randomly discards messages with specified probability
**Usage**:
```javascript
node.discardMessagesFrom(otherNode, 0.5); // Drop 50% of messages
node.discardMessagesFrom(otherNode, 1.0); // Drop all messages
node.discardMessagesFrom(otherNode, 0.0); // Drop no messages
```
**Parameters**:
- `lossProbability`: Number between 0.0 (no loss) and 1.0 (total loss)
**Use case**: Simulating unreliable networks or packet loss
## Examples
### Basic Network Partition Test
```javascript
assert.eq(jsTest.options().enableTestCommands, true);
// Set up a replica set with mongobridge
let rst = new ReplSetTest({
nodes: 3,
useBridge: true,
settings: {electionTimeoutMillis: 2000, heartbeatIntervalMillis: 400},
});
rst.startSet();
rst.initiate();
// Partition the primary from secondaries
let primary = rst.getPrimary();
let secondaries = rst.getSecondaries();
primary.rejectConnectionsFrom(secondaries);
// Verify primary steps down due to lost majority
assert.soon(() => {
return rst.getPrimary() !== primary;
});
// Restore network
primary.acceptConnectionsFrom(secondaries);
rst.stopSet();
```
### Write Concern Timeout Test
```javascript
assert.eq(jsTest.options().enableTestCommands, true);
let st = new ShardingTest({
shards: {rs0: {nodes: 2}},
useBridge: true,
});
// Delay replication to cause write concern timeout
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 10000);
// This write should fail due to timeout
assert.commandFailed(
st.s0.getCollection("test.coll").insert(
{x: 1},
{
writeConcern: {w: 2, wtimeout: 5000},
},
),
);
// Restore normal replication
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 0);
st.stop();
```
### Simulating Packet Loss
```javascript
// Set up unreliable network with 30% packet loss
primary.discardMessagesFrom(secondary, 0.3);
// Operations may succeed or fail unpredictably
// Useful for testing retry logic and resilience
```
### Limitations
- **OP_QUERY exhaust**: Not supported for legacy exhaust queries (OP_MSG exhaust cursors are supported)
- **Direct connections**: Only works when connections go through the bridge proxy
- **TLS support**: Mongobridge is not supported if the cluster is using TLS.
## See Also
- [mongobridge.js test example](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/noPassthrough/mongobridge/mongobridge.js#L1)