mirror of https://github.com/mongodb/mongo
SERVER-87034 Initial markdown format (#19276)
GitOrigin-RevId: 64e388007ec1ac3744537253540995af628bcc00
This commit is contained in:
parent
fad7d1a9a2
commit
71f830ce49
95
README.md
95
README.md
|
|
@ -4,92 +4,89 @@ Welcome to MongoDB!
|
|||
|
||||
## Components
|
||||
|
||||
- `mongod` - The database server.
|
||||
- `mongos` - Sharding router.
|
||||
- `mongo` - The database shell (uses interactive javascript).
|
||||
|
||||
- `mongod` - The database server.
|
||||
- `mongos` - Sharding router.
|
||||
- `mongo` - The database shell (uses interactive javascript).
|
||||
|
||||
## Download MongoDB
|
||||
- https://www.mongodb.com/try/download/community
|
||||
- Using homebrew `brew tap mongodb/brew`
|
||||
- Using docker image `docker pull mongo`
|
||||
|
||||
- https://www.mongodb.com/try/download/community
|
||||
- Using homebrew `brew tap mongodb/brew`
|
||||
- Using docker image `docker pull mongo`
|
||||
|
||||
## Building
|
||||
|
||||
See [Building MongoDB](docs/building.md).
|
||||
See [Building MongoDB](docs/building.md).
|
||||
|
||||
## Running
|
||||
|
||||
For command line options invoke:
|
||||
For command line options invoke:
|
||||
|
||||
```bash
|
||||
$ ./mongod --help
|
||||
```
|
||||
```bash
|
||||
$ ./mongod --help
|
||||
```
|
||||
|
||||
To run a single server database:
|
||||
To run a single server database:
|
||||
|
||||
```bash
|
||||
$ sudo mkdir -p /data/db
|
||||
$ ./mongod
|
||||
$
|
||||
$ # The mongo javascript shell connects to localhost and test database by default:
|
||||
$ ./mongo
|
||||
> help
|
||||
```
|
||||
```bash
|
||||
$ sudo mkdir -p /data/db
|
||||
$ ./mongod
|
||||
$
|
||||
$ # The mongo javascript shell connects to localhost and test database by default:
|
||||
$ ./mongo
|
||||
> help
|
||||
```
|
||||
|
||||
## Installing Compass
|
||||
|
||||
You can install compass using the `install_compass` script packaged with MongoDB:
|
||||
You can install compass using the `install_compass` script packaged with MongoDB:
|
||||
|
||||
```bash
|
||||
$ ./install_compass
|
||||
```
|
||||
```bash
|
||||
$ ./install_compass
|
||||
```
|
||||
|
||||
This will download the appropriate MongoDB Compass package for your platform
|
||||
and install it.
|
||||
This will download the appropriate MongoDB Compass package for your platform
|
||||
and install it.
|
||||
|
||||
## Drivers
|
||||
|
||||
Client drivers for most programming languages are available at
|
||||
https://docs.mongodb.com/manual/applications/drivers/. Use the shell
|
||||
(`mongo`) for administrative tasks.
|
||||
Client drivers for most programming languages are available at
|
||||
https://docs.mongodb.com/manual/applications/drivers/. Use the shell
|
||||
(`mongo`) for administrative tasks.
|
||||
|
||||
## Bug Reports
|
||||
|
||||
See https://github.com/mongodb/mongo/wiki/Submit-Bug-Reports.
|
||||
See https://github.com/mongodb/mongo/wiki/Submit-Bug-Reports.
|
||||
|
||||
## Packaging
|
||||
|
||||
Packages are created dynamically by the [buildscripts/packager.py](buildscripts/packager.py) script.
|
||||
This will generate RPM and Debian packages.
|
||||
Packages are created dynamically by the [buildscripts/packager.py](buildscripts/packager.py) script.
|
||||
This will generate RPM and Debian packages.
|
||||
|
||||
## Learn MongoDB
|
||||
## Learn MongoDB
|
||||
|
||||
- Documentation - https://docs.mongodb.com/manual/
|
||||
- Developer Center - https://www.mongodb.com/developer/
|
||||
- MongoDB University - https://learn.mongodb.com
|
||||
- Documentation - https://docs.mongodb.com/manual/
|
||||
- Developer Center - https://www.mongodb.com/developer/
|
||||
- MongoDB University - https://learn.mongodb.com
|
||||
|
||||
## Cloud Hosted MongoDB
|
||||
|
||||
https://www.mongodb.com/cloud/atlas
|
||||
https://www.mongodb.com/cloud/atlas
|
||||
|
||||
## Forums
|
||||
|
||||
- https://mongodb.com/community/forums/
|
||||
- https://mongodb.com/community/forums/
|
||||
|
||||
Technical questions about using MongoDB.
|
||||
Technical questions about using MongoDB.
|
||||
|
||||
- https://mongodb.com/community/forums/c/server-dev
|
||||
|
||||
Technical questions about building and developing MongoDB.
|
||||
- https://mongodb.com/community/forums/c/server-dev
|
||||
|
||||
Technical questions about building and developing MongoDB.
|
||||
|
||||
## LICENSE
|
||||
|
||||
MongoDB is free and the source is available. Versions released prior to
|
||||
October 16, 2018 are published under the AGPL. All versions released after
|
||||
October 16, 2018, including patch fixes for prior versions, are published
|
||||
under the [Server Side Public License (SSPL) v1](LICENSE-Community.txt).
|
||||
See individual files for details.
|
||||
|
||||
MongoDB is free and the source is available. Versions released prior to
|
||||
October 16, 2018 are published under the AGPL. All versions released after
|
||||
October 16, 2018, including patch fixes for prior versions, are published
|
||||
under the [Server Side Public License (SSPL) v1](LICENSE-Community.txt).
|
||||
See individual files for details.
|
||||
|
|
|
|||
|
|
@ -19,54 +19,54 @@ not authored by MongoDB, and has a license which requires reproduction,
|
|||
a notice will be included in
|
||||
`THIRD-PARTY-NOTICES`.
|
||||
|
||||
| Name | License | Vendored Version | Emits persisted data | Distributed in Release Binaries |
|
||||
| ---------------------------| ----------------- | ------------------| :------------------: | :-----------------------------: |
|
||||
| [abseil-cpp] | Apache-2.0 | 20230802.1 | | ✗ |
|
||||
| [Aladdin MD5] | Zlib | Unknown | ✗ | ✗ |
|
||||
| [ASIO] | BSL-1.0 | 1.12.2 | | ✗ |
|
||||
| [benchmark] | Apache-2.0 | 1.5.2 | | |
|
||||
| [Boost] | BSL-1.0 | 1.79.0 | | ✗ |
|
||||
| [c-ares] | MIT | 1.19.1 | | ✗ |
|
||||
| [double-conversion] | ??? | ??? | | ??? |
|
||||
| [fmt] | BSD-2-Clause | 7.1.3 | | ✗ |
|
||||
| [GPerfTools] | BSD-3-Clause | 2.9.1 | | ✗ |
|
||||
| [gRPC] | Apache-2.0 | 1.59.2 | | ✗ |
|
||||
| [ICU4] | ICU | 57.1 | ✗ | ✗ |
|
||||
| [immer] | BSL-1.0 | d98a68c + changes | | ✗ |
|
||||
| [Intel Decimal FP Library] | BSD-3-Clause | 2.0 Update 1 | | ✗ |
|
||||
| [JSON-Schema-Test-Suite] | MIT | 728066f9c5 | | |
|
||||
| [libstemmer] | BSD-3-Clause | Unknown | ✗ | ✗ |
|
||||
| [librdkafka] | BSD-2-Clause | 2.0.2 | | |
|
||||
| [libmongocrypt] | Apache-2.0 | 1.8.4 | ✗ | ✗ |
|
||||
| [linenoise] | BSD-3-Clause | 6cdc775 + changes | | ✗ |
|
||||
| [mongo-c-driver] | Apache-2.0 | 1.23.0 | ✗ | ✗ |
|
||||
| [MozJS] | MPL-2.0 | ESR 91.3.0 | | ✗ |
|
||||
| [MurmurHash3] | Public Domain | a6bd3ce + changes | ✗ | ✗ |
|
||||
| [ocspbuilder] | MIT | 0.10.2 | | |
|
||||
| [ocspresponder] | Apache-2.0 | 0.5.0 | | |
|
||||
| [pcre2] | BSD-3-Clause | 10.40 | | ✗ |
|
||||
| [protobuf] | BSD-3-Clause | 4.25.0 | | ✗ |
|
||||
| [re2] | BSD-3-Clause | 2021-09-01 | | ✗ |
|
||||
| [S2] | Apache-2.0 | Unknown | ✗ | ✗ |
|
||||
| [SafeInt] | MIT | 3.0.26 | | |
|
||||
| [schemastore.org] | Apache-2.0 | 6847cfc3a1 | | |
|
||||
| [scons] | MIT | 3.1.2 | | |
|
||||
| [Snappy] | BSD-3-Clause | 1.1.10 | ✗ | ✗ |
|
||||
| [timelib] | MIT | 2022.04 | | ✗ |
|
||||
| [TomCrypt] | Public Domain | 1.18.2 | ✗ | ✗ |
|
||||
| [Unicode] | Unicode-DFS-2015 | 8.0.0 | ✗ | ✗ |
|
||||
| [libunwind] | MIT | 1.6.2 + changes | | ✗ |
|
||||
| [Valgrind] | BSD-4-Clause<sup>\[<a href="#note_vg" id="ref_vg">1</a>]</sup> | 3.17.0 | | ✗ |
|
||||
| [wiredtiger] | | <sup>\[<a href="#note_wt" id="ref_wt">2</a>]</sup> | ✗ | ✗ |
|
||||
| [yaml-cpp] | MIT | 0.6.3 | | ✗ |
|
||||
| [Zlib] | Zlib | 1.3 | ✗ | ✗ |
|
||||
| [Zstandard] | BSD-3-Clause | 1.5.5 | ✗ | ✗ |
|
||||
| Name | License | Vendored Version | Emits persisted data | Distributed in Release Binaries |
|
||||
| -------------------------- | -------------------------------------------------------------- | -------------------------------------------------- | :------------------: | :-----------------------------: |
|
||||
| [abseil-cpp] | Apache-2.0 | 20230802.1 | | ✗ |
|
||||
| [Aladdin MD5] | Zlib | Unknown | ✗ | ✗ |
|
||||
| [ASIO] | BSL-1.0 | 1.12.2 | | ✗ |
|
||||
| [benchmark] | Apache-2.0 | 1.5.2 | | |
|
||||
| [Boost] | BSL-1.0 | 1.79.0 | | ✗ |
|
||||
| [c-ares] | MIT | 1.19.1 | | ✗ |
|
||||
| [double-conversion] | ??? | ??? | | ??? |
|
||||
| [fmt] | BSD-2-Clause | 7.1.3 | | ✗ |
|
||||
| [GPerfTools] | BSD-3-Clause | 2.9.1 | | ✗ |
|
||||
| [gRPC] | Apache-2.0 | 1.59.2 | | ✗ |
|
||||
| [ICU4] | ICU | 57.1 | ✗ | ✗ |
|
||||
| [immer] | BSL-1.0 | d98a68c + changes | | ✗ |
|
||||
| [Intel Decimal FP Library] | BSD-3-Clause | 2.0 Update 1 | | ✗ |
|
||||
| [JSON-Schema-Test-Suite] | MIT | 728066f9c5 | | |
|
||||
| [libstemmer] | BSD-3-Clause | Unknown | ✗ | ✗ |
|
||||
| [librdkafka] | BSD-2-Clause | 2.0.2 | | |
|
||||
| [libmongocrypt] | Apache-2.0 | 1.8.4 | ✗ | ✗ |
|
||||
| [linenoise] | BSD-3-Clause | 6cdc775 + changes | | ✗ |
|
||||
| [mongo-c-driver] | Apache-2.0 | 1.23.0 | ✗ | ✗ |
|
||||
| [MozJS] | MPL-2.0 | ESR 91.3.0 | | ✗ |
|
||||
| [MurmurHash3] | Public Domain | a6bd3ce + changes | ✗ | ✗ |
|
||||
| [ocspbuilder] | MIT | 0.10.2 | | |
|
||||
| [ocspresponder] | Apache-2.0 | 0.5.0 | | |
|
||||
| [pcre2] | BSD-3-Clause | 10.40 | | ✗ |
|
||||
| [protobuf] | BSD-3-Clause | 4.25.0 | | ✗ |
|
||||
| [re2] | BSD-3-Clause | 2021-09-01 | | ✗ |
|
||||
| [S2] | Apache-2.0 | Unknown | ✗ | ✗ |
|
||||
| [SafeInt] | MIT | 3.0.26 | | |
|
||||
| [schemastore.org] | Apache-2.0 | 6847cfc3a1 | | |
|
||||
| [scons] | MIT | 3.1.2 | | |
|
||||
| [Snappy] | BSD-3-Clause | 1.1.10 | ✗ | ✗ |
|
||||
| [timelib] | MIT | 2022.04 | | ✗ |
|
||||
| [TomCrypt] | Public Domain | 1.18.2 | ✗ | ✗ |
|
||||
| [Unicode] | Unicode-DFS-2015 | 8.0.0 | ✗ | ✗ |
|
||||
| [libunwind] | MIT | 1.6.2 + changes | | ✗ |
|
||||
| [Valgrind] | BSD-4-Clause<sup>\[<a href="#note_vg" id="ref_vg">1</a>]</sup> | 3.17.0 | | ✗ |
|
||||
| [wiredtiger] | | <sup>\[<a href="#note_wt" id="ref_wt">2</a>]</sup> | ✗ | ✗ |
|
||||
| [yaml-cpp] | MIT | 0.6.3 | | ✗ |
|
||||
| [Zlib] | Zlib | 1.3 | ✗ | ✗ |
|
||||
| [Zstandard] | BSD-3-Clause | 1.5.5 | ✗ | ✗ |
|
||||
|
||||
[abseil-cpp]: https://github.com/abseil/abseil-cpp
|
||||
[ASIO]: https://github.com/chriskohlhoff/asio
|
||||
[benchmark]: https://github.com/google/benchmark
|
||||
[Boost]: http://www.boost.org/
|
||||
[double-conversion]: https://github.com/google/double-conversion (transitive dependency of MozJS)
|
||||
[double-conversion]: https://github.com/google/double-conversion "transitive dependency of MozJS"
|
||||
[fmt]: http://fmtlib.net/
|
||||
[GPerfTools]: https://github.com/gperftools/gperftools
|
||||
[gRPC]: https://github.com/grpc/grpc
|
||||
|
|
@ -132,26 +132,25 @@ these libraries. Releases prepared in this fashion will include a copy
|
|||
of these libraries' license in a file named
|
||||
`THIRD-PARTY-NOTICES.windows`.
|
||||
|
||||
| Name | Enterprise Only | Has Windows DLLs |
|
||||
| :--------- | :-------------: | :--------------: |
|
||||
| Cyrus SASL | Yes | Yes |
|
||||
| libldap | Yes | No |
|
||||
| net-snmp | Yes | Yes |
|
||||
| OpenSSL | No | Yes<sup>\[<a href="#note_ssl" id="ref_ssl">3</a>]</sup> |
|
||||
| libcurl | No | No |
|
||||
|
||||
| Name | Enterprise Only | Has Windows DLLs |
|
||||
| :--------- | :-------------: | :-----------------------------------------------------: |
|
||||
| Cyrus SASL | Yes | Yes |
|
||||
| libldap | Yes | No |
|
||||
| net-snmp | Yes | Yes |
|
||||
| OpenSSL | No | Yes<sup>\[<a href="#note_ssl" id="ref_ssl">3</a>]</sup> |
|
||||
| libcurl | No | No |
|
||||
|
||||
## Notes:
|
||||
|
||||
1. <a id="note_vg" href="#ref_vg">^</a>
|
||||
The majority of Valgrind is licensed under the GPL, with the exception of a single
|
||||
header file which is licensed under a BSD license. This BSD licensed header is the only
|
||||
file from Valgrind which is vendored and consumed by MongoDB.
|
||||
The majority of Valgrind is licensed under the GPL, with the exception of a single
|
||||
header file which is licensed under a BSD license. This BSD licensed header is the only
|
||||
file from Valgrind which is vendored and consumed by MongoDB.
|
||||
|
||||
2. <a id="note_wt" href="#ref_wt">^</a>
|
||||
WiredTiger is maintained by MongoDB in a separate repository. As a part of our
|
||||
development process, we periodically ingest the latest snapshot of that repository.
|
||||
WiredTiger is maintained by MongoDB in a separate repository. As a part of our
|
||||
development process, we periodically ingest the latest snapshot of that repository.
|
||||
|
||||
3. <a id="note_ssl" href="#ref_ssl">^</a>
|
||||
OpenSSL is only shipped as a dependency of the MongoDB tools written in Go. The MongoDB
|
||||
shell and server binaries use Windows' cryptography APIs.
|
||||
OpenSSL is only shipped as a dependency of the MongoDB tools written in Go. The MongoDB
|
||||
shell and server binaries use Windows' cryptography APIs.
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
# MongoDB Bazel Documentation
|
||||
- [Developer Workflow](docs/developer_workflow.md)
|
||||
- [Best Practices](docs/best_practices.md)
|
||||
|
||||
- [Developer Workflow](docs/developer_workflow.md)
|
||||
- [Best Practices](docs/best_practices.md)
|
||||
|
|
|
|||
|
|
@ -11,10 +11,10 @@ The Bazel equivalent of SConscript files are BUILD.bazel files.
|
|||
src/mongo/BUILD.bazel would contain:
|
||||
|
||||
mongo_cc_binary(
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp"
|
||||
],
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp"
|
||||
],
|
||||
}
|
||||
|
||||
Once you've obtained bazelisk by running **evergreen/get_bazelisk.sh**, you can then build this target via "bazelisk build":
|
||||
|
|
@ -25,7 +25,7 @@ Or run this target via "bazelisk run":
|
|||
|
||||
./bazelisk run //src/mongo:hello_world
|
||||
|
||||
The full target name is a combination between the directory of the BUILD.bazel file and the target name:
|
||||
The full target name is a combination between the directory of the BUILD.bazel file and the target name:
|
||||
|
||||
//{BUILD.bazel dir}:{targetname}
|
||||
|
||||
|
|
@ -36,32 +36,31 @@ Bazel makes use of static analysis wherever possible to improve execution and qu
|
|||
The divergence from SCons is that now source files have to be declared in addition to header files.
|
||||
|
||||
mongo_cc_binary(
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp",
|
||||
"new_source.cpp" # If adding a source file
|
||||
],
|
||||
hdrs = [
|
||||
"new_header.h" # If adding a header file
|
||||
],
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp",
|
||||
"new_source.cpp" # If adding a source file
|
||||
],
|
||||
hdrs = [
|
||||
"new_header.h" # If adding a header file
|
||||
],
|
||||
}
|
||||
|
||||
## Adding a New Library
|
||||
|
||||
The DevProd Build Team created MongoDB-specific macros for the different types of build targets you may want to specify. These include:
|
||||
|
||||
- mongo_cc_binary
|
||||
- mongo_cc_library
|
||||
- idl_generator
|
||||
|
||||
- mongo_cc_binary
|
||||
- mongo_cc_library
|
||||
- idl_generator
|
||||
|
||||
Creating a new library is similar to the steps above for creating a new binary. A new **mongo_cc_library** definition would be created in the BUILD.bazel file.
|
||||
|
||||
mongo_cc_library(
|
||||
name = "new_library",
|
||||
srcs = [
|
||||
"new_library_source_file.cpp"
|
||||
]
|
||||
name = "new_library",
|
||||
srcs = [
|
||||
"new_library_source_file.cpp"
|
||||
]
|
||||
}
|
||||
|
||||
## Declaring Dependencies
|
||||
|
|
@ -69,20 +68,20 @@ Creating a new library is similar to the steps above for creating a new binary.
|
|||
If a library or binary depends on another library, this must be declared in the **deps** section of the target. The syntax for referring to the library is the same syntax used in the bazelisk build/run command.
|
||||
|
||||
mongo_cc_library(
|
||||
name = "new_library",
|
||||
# ...
|
||||
name = "new_library",
|
||||
# ...
|
||||
}
|
||||
|
||||
|
||||
mongo_cc_binary(
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp",
|
||||
],
|
||||
deps = [
|
||||
":new_library", # if referring to the library declared in the same directory as this build file
|
||||
# "//src/mongo:new_library" # absolute path
|
||||
# "sub_directory:new_library" # relative path of a subdirectory
|
||||
],
|
||||
name = "hello_world",
|
||||
srcs = [
|
||||
"hello_world.cpp",
|
||||
],
|
||||
deps = [
|
||||
":new_library", # if referring to the library declared in the same directory as this build file
|
||||
# "//src/mongo:new_library" # absolute path
|
||||
# "sub_directory:new_library" # relative path of a subdirectory
|
||||
],
|
||||
}
|
||||
|
||||
## Depending on a Bazel Library in a SCons Build Target
|
||||
|
|
@ -97,7 +96,6 @@ This allows SCons build targets to depend on Bazel build targets directly. The B
|
|||
'fsync_locked.cpp',
|
||||
],
|
||||
LIBDEPS=[
|
||||
'new_library', # depend on the bazel "new_library" target defined above
|
||||
],
|
||||
'new_library', # depend on the bazel "new_library" target defined above
|
||||
],
|
||||
)
|
||||
|
||||
|
|
|
|||
|
|
@ -5,18 +5,20 @@ MongoDB uses EngFlow to enable remote execution with Bazel. This dramatically sp
|
|||
To install the necessary credentials to enable remote execution, run scons.py with any build command, then follow the setup instructions it prints out. Or:
|
||||
|
||||
(Only if not in the Engineering org)
|
||||
- Request access to the MANA group https://mana.corp.mongodbgov.com/resources/659ec4b9bccf3819e5608712
|
||||
|
||||
- Request access to the MANA group https://mana.corp.mongodbgov.com/resources/659ec4b9bccf3819e5608712
|
||||
|
||||
(For everyone)
|
||||
- Go to https://sodalite.cluster.engflow.com/gettingstarted
|
||||
- Login with OKTA, then click the "GENERATE AND DOWNLOAD MTLS CERTIFICATE" button
|
||||
- (If logging in with OKTA doesn't work) Login with Google using your MongoDB email, then click the "GENERATE AND DOWNLOAD MTLS CERTIFICATE" button
|
||||
- On your local system (usually your MacBook), open a shell terminal and, after setting the variables on the first three lines, run:
|
||||
|
||||
REMOTE_USER=<SSH User from https://spruce.mongodb.com/spawn/host>
|
||||
REMOTE_HOST=<DNS Name from https://spruce.mongodb.com/spawn/host>
|
||||
ZIP_FILE=~/Downloads/engflow-mTLS.zip
|
||||
- Go to https://sodalite.cluster.engflow.com/gettingstarted
|
||||
- Login with OKTA, then click the "GENERATE AND DOWNLOAD MTLS CERTIFICATE" button
|
||||
- (If logging in with OKTA doesn't work) Login with Google using your MongoDB email, then click the "GENERATE AND DOWNLOAD MTLS CERTIFICATE" button
|
||||
- On your local system (usually your MacBook), open a shell terminal and, after setting the variables on the first three lines, run:
|
||||
|
||||
curl https://raw.githubusercontent.com/mongodb/mongo/master/buildscripts/setup_engflow_creds.sh -o setup_engflow_creds.sh
|
||||
chmod +x ./setup_engflow_creds.sh
|
||||
./setup_engflow_creds.sh $REMOTE_USER $REMOTE_HOST $ZIP_FILE
|
||||
REMOTE_USER=<SSH User from https://spruce.mongodb.com/spawn/host>
|
||||
REMOTE_HOST=<DNS Name from https://spruce.mongodb.com/spawn/host>
|
||||
ZIP_FILE=~/Downloads/engflow-mTLS.zip
|
||||
|
||||
curl https://raw.githubusercontent.com/mongodb/mongo/master/buildscripts/setup_engflow_creds.sh -o setup_engflow_creds.sh
|
||||
chmod +x ./setup_engflow_creds.sh
|
||||
./setup_engflow_creds.sh $REMOTE_USER $REMOTE_HOST $ZIP_FILE
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
This directory exists to manage a Buildfarm; see docs/bazel.md for more details.
|
||||
This directory exists to manage a Buildfarm; see docs/bazel.md for more details.
|
||||
|
|
|
|||
|
|
@ -1,52 +1,60 @@
|
|||
# How to Use Antithesis
|
||||
|
||||
## Context
|
||||
Antithesis is a third party vendor with an environment that can perform network fuzzing. We can
|
||||
upload images containing `docker-compose.yml` files, which represent various MongoDB topologies, to
|
||||
the Antithesis Docker registry. Antithesis runs `docker-compose up` from these images to spin up
|
||||
the corresponding multi-container application in their environment and run a test suite. Network
|
||||
fuzzing is performed on the topology while the test suite runs & a report is generated by
|
||||
Antithesis identifying bugs. Check out
|
||||
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis to see an example of how we
|
||||
|
||||
Antithesis is a third party vendor with an environment that can perform network fuzzing. We can
|
||||
upload images containing `docker-compose.yml` files, which represent various MongoDB topologies, to
|
||||
the Antithesis Docker registry. Antithesis runs `docker-compose up` from these images to spin up
|
||||
the corresponding multi-container application in their environment and run a test suite. Network
|
||||
fuzzing is performed on the topology while the test suite runs & a report is generated by
|
||||
Antithesis identifying bugs. Check out
|
||||
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis to see an example of how we
|
||||
use Antithesis today.
|
||||
|
||||
## Base Images
|
||||
The `base_images` directory consists of the building blocks for creating a MongoDB test topology.
|
||||
These images are uploaded to the Antithesis Docker registry weekly during the
|
||||
`antithesis_image_push` task. For more visibility into how these images are built and uploaded to
|
||||
|
||||
The `base_images` directory consists of the building blocks for creating a MongoDB test topology.
|
||||
These images are uploaded to the Antithesis Docker registry weekly during the
|
||||
`antithesis_image_push` task. For more visibility into how these images are built and uploaded to
|
||||
the Antithesis Docker registry, please see that task.
|
||||
|
||||
### mongo_binaries
|
||||
This image contains the latest `mongo`, `mongos` and `mongod` binaries. It can be used to
|
||||
start a `mongod` instance, `mongos` instance or execute `mongo` commands. This is the main building
|
||||
|
||||
This image contains the latest `mongo`, `mongos` and `mongod` binaries. It can be used to
|
||||
start a `mongod` instance, `mongos` instance or execute `mongo` commands. This is the main building
|
||||
block for creating the System Under Test topology.
|
||||
|
||||
### workload
|
||||
This image contains the latest `mongo` binary as well as the `resmoke` test runner. The `workload`
|
||||
container is not part of the actual toplogy. The purpose of a `workload` container is to execute
|
||||
`mongo` commands to complete the topology setup, and to run a test suite on an existing topology
|
||||
|
||||
This image contains the latest `mongo` binary as well as the `resmoke` test runner. The `workload`
|
||||
container is not part of the actual toplogy. The purpose of a `workload` container is to execute
|
||||
`mongo` commands to complete the topology setup, and to run a test suite on an existing topology
|
||||
like so:
|
||||
|
||||
```shell
|
||||
buildscript/resmoke.py run --suite antithesis_concurrency_sharded_with_stepdowns_and_balancer
|
||||
```
|
||||
|
||||
**Every topology must have 1 workload container.**
|
||||
|
||||
Note: During `workload` image build, `buildscripts/antithesis_suite.py` runs, which generates
|
||||
"antithesis compatible" test suites and prepends them with `antithesis_`. These are the test suites
|
||||
Note: During `workload` image build, `buildscripts/antithesis_suite.py` runs, which generates
|
||||
"antithesis compatible" test suites and prepends them with `antithesis_`. These are the test suites
|
||||
that can run in antithesis and are available from witihin the `workload` container.
|
||||
|
||||
## Topologies
|
||||
The `topologies` directory consists of subdirectories representing various mongo test topologies.
|
||||
|
||||
The `topologies` directory consists of subdirectories representing various mongo test topologies.
|
||||
Each topology has a `Dockerfile`, a `docker-compose.yml` file and a `scripts` directory.
|
||||
|
||||
### Dockerfile
|
||||
This assembles an image with the necessary files for spinning up the corresponding topology. It
|
||||
consists of a `docker-compose.yml`, a `logs` directory, a `scripts` directory and a `data`
|
||||
directory. If this is structured properly, you should be able to copy the files & directories
|
||||
|
||||
This assembles an image with the necessary files for spinning up the corresponding topology. It
|
||||
consists of a `docker-compose.yml`, a `logs` directory, a `scripts` directory and a `data`
|
||||
directory. If this is structured properly, you should be able to copy the files & directories
|
||||
from this image and run `docker-compose up` to set up the desired topology.
|
||||
|
||||
Example from `buildscripts/antithesis/topologies/sharded_cluster/Dockerfile`:
|
||||
|
||||
```Dockerfile
|
||||
FROM scratch
|
||||
COPY docker-compose.yml /
|
||||
|
|
@ -56,18 +64,20 @@ ADD data /data
|
|||
ADD debug /debug
|
||||
```
|
||||
|
||||
All topology images are built and uploaded to the Antithesis Docker registry during the
|
||||
`antithesis_image_push` task. Some of these directories are created during the
|
||||
All topology images are built and uploaded to the Antithesis Docker registry during the
|
||||
`antithesis_image_push` task. Some of these directories are created during the
|
||||
`evergreen/antithesis_image_build.sh` script such as `/data` and `/logs`.
|
||||
|
||||
Note: These images serve solely as a filesystem containing all necessary files for a topology,
|
||||
Note: These images serve solely as a filesystem containing all necessary files for a topology,
|
||||
therefore use `FROM scratch`.
|
||||
|
||||
### docker-compose.yml
|
||||
This describes how to construct the corresponding topology using the
|
||||
`mongo-binaries` and `workload` images.
|
||||
|
||||
This describes how to construct the corresponding topology using the
|
||||
`mongo-binaries` and `workload` images.
|
||||
|
||||
Example from `buildscripts/antithesis/topologies/sharded_cluster/docker-compose.yml`:
|
||||
|
||||
```yml
|
||||
version: '3.0'
|
||||
|
||||
|
|
@ -156,62 +166,71 @@ networks:
|
|||
- subnet: 10.20.20.0/24
|
||||
```
|
||||
|
||||
Each container must have a `command` in `docker-compose.yml` that runs an init script. The init
|
||||
script belongs in the `scripts` directory, which is included as a volume. The `command` should be
|
||||
set like so: `/bin/bash /scripts/[script_name].sh` or `python3 /scripts/[script_name].py`. This is
|
||||
Each container must have a `command` in `docker-compose.yml` that runs an init script. The init
|
||||
script belongs in the `scripts` directory, which is included as a volume. The `command` should be
|
||||
set like so: `/bin/bash /scripts/[script_name].sh` or `python3 /scripts/[script_name].py`. This is
|
||||
a requirement for the topology to start up properly in Antithesis.
|
||||
|
||||
When creating `mongod` or `mongos` instances, route the logs like so:
|
||||
`--logpath /var/log/mongodb/mongodb.log` and utilize `volumes` -- as in `database1`.
|
||||
This enables us to easily retrieve logs if a bug is detected by Antithesis.
|
||||
When creating `mongod` or `mongos` instances, route the logs like so:
|
||||
`--logpath /var/log/mongodb/mongodb.log` and utilize `volumes` -- as in `database1`.
|
||||
This enables us to easily retrieve logs if a bug is detected by Antithesis.
|
||||
|
||||
The `ipv4_address` should be set to `10.20.20.130` or higher if you do not want that container to
|
||||
The `ipv4_address` should be set to `10.20.20.130` or higher if you do not want that container to
|
||||
be affected by network fuzzing. For instance, you would likely not want the `workload` container
|
||||
to be affected by network fuzzing -- as shown in the example above.
|
||||
|
||||
Use the `evergreen-latest-master` tag for all images. This is updated automatically in
|
||||
`evergreen/antithesis_image_build.sh` during the `antithesis_image_build` task -- if needed.
|
||||
Use the `evergreen-latest-master` tag for all images. This is updated automatically in
|
||||
`evergreen/antithesis_image_build.sh` during the `antithesis_image_build` task -- if needed.
|
||||
|
||||
### scripts
|
||||
|
||||
Take a look at `buildscripts/antithesis/topologies/sharded_cluster/scripts/mongos_init.py` to see
|
||||
how to use util methods from `buildscripts/antithesis/topologies/sharded_cluster/scripts/utils.py`
|
||||
to set up the desired topology. You can also use simple shell scripts as in the case of
|
||||
`buildscripts/antithesis/topologies/sharded_cluster/scripts/database_init.py`. These init scripts
|
||||
must not end in order to keep the underlying container alive. You can use an infinite while
|
||||
loop for `python` scripts or you can use `tail -f /dev/null` for shell scripts.
|
||||
Take a look at `buildscripts/antithesis/topologies/sharded_cluster/scripts/mongos_init.py` to see
|
||||
how to use util methods from `buildscripts/antithesis/topologies/sharded_cluster/scripts/utils.py`
|
||||
to set up the desired topology. You can also use simple shell scripts as in the case of
|
||||
`buildscripts/antithesis/topologies/sharded_cluster/scripts/database_init.py`. These init scripts
|
||||
must not end in order to keep the underlying container alive. You can use an infinite while
|
||||
loop for `python` scripts or you can use `tail -f /dev/null` for shell scripts.
|
||||
|
||||
## How do I create a new topology for Antithesis testing?
|
||||
To create a new topology for Antithesis testing is easy & requires a few simple steps.
|
||||
1. Add a new directory in `buildscripts/antithesis/topologies` to represent your desired topology.
|
||||
You can use existing topologies as an example.
|
||||
2. Make sure that your workload test suite runs against your topology without any failures. This
|
||||
may require tagging some tests as `antithesis-incompatible`.
|
||||
3. Update the `antithesis_image_push` task so that your new topology image is
|
||||
uploaded to the Antithesis Docker registry.
|
||||
4. Reach out to #server-testing on Slack & provide the new topology image name as well as the
|
||||
|
||||
To create a new topology for Antithesis testing is easy & requires a few simple steps.
|
||||
|
||||
1. Add a new directory in `buildscripts/antithesis/topologies` to represent your desired topology.
|
||||
You can use existing topologies as an example.
|
||||
2. Make sure that your workload test suite runs against your topology without any failures. This
|
||||
may require tagging some tests as `antithesis-incompatible`.
|
||||
3. Update the `antithesis_image_push` task so that your new topology image is
|
||||
uploaded to the Antithesis Docker registry.
|
||||
4. Reach out to #server-testing on Slack & provide the new topology image name as well as the
|
||||
desired test suite to run.
|
||||
5. Include the SDP team on the code review.
|
||||
|
||||
|
||||
These are the required updates to `evergreen/antithesis_image_build.sh`:
|
||||
- Add the following command for each of your `mongos` and `mongod` containers in your topology to
|
||||
create your log directories.
|
||||
|
||||
- Add the following command for each of your `mongos` and `mongod` containers in your topology to
|
||||
create your log directories.
|
||||
|
||||
```shell
|
||||
mkdir -p antithesis/topologies/[topology_name]/{logs,data}/[container_name]
|
||||
```
|
||||
|
||||
- Build an image for your new topology ending in `-config`
|
||||
|
||||
- Build an image for your new topology ending in `-config`
|
||||
|
||||
```shell
|
||||
cd [your_topology_dir]
|
||||
sed -i s/evergreen-latest-master/$tag/ docker-compose.yml
|
||||
sudo docker build . -t [your-topology-name]-config:$tag
|
||||
```
|
||||
|
||||
These are the required updates to `evergreen/antithesis_image_push.sh`:
|
||||
- Push your new image to the Antithesis Docker registry
|
||||
|
||||
- Push your new image to the Antithesis Docker registry
|
||||
|
||||
```shell
|
||||
sudo docker tag "[your-topology-name]-config:$tag" "us-central1-docker.pkg.dev/molten-verve-216720/mongodb-repository/[your-topology-name]-config:$tag"
|
||||
sudo docker push "us-central1-docker.pkg.dev/molten-verve-216720/mongodb-repository/[your-topology-name]-config:$tag"
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
If you are interested in leveraging Antithesis feel free to reach out to #server-testing on Slack.
|
||||
|
||||
If you are interested in leveraging Antithesis feel free to reach out to #server-testing on Slack.
|
||||
|
|
|
|||
|
|
@ -14,10 +14,10 @@ The following assumes you are using python from the MongoDB toolchain.
|
|||
(mongo-python3) deactivate # only if you have another python env activated
|
||||
sh> /opt/mongodbtoolchain/v4/bin/python3 -m venv cm # create new env
|
||||
sh> source cm/bin/activate # activate new env
|
||||
(cm) python -m pip install -r requirements.txt # install required packages
|
||||
(cm) python -m pip install -r requirements.txt # install required packages
|
||||
(cm) python start.py # run the calibrator
|
||||
(cm) deactivate # back to bash
|
||||
sh>
|
||||
sh>
|
||||
```
|
||||
|
||||
### Install new packages
|
||||
|
|
@ -25,3 +25,4 @@ sh>
|
|||
```sh
|
||||
(cm) python -m pip install <package_name> # install <package_name>
|
||||
(cm) python -m pip freeze > requirements.txt # do not forget to update requirements.txt
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,163 +1,184 @@
|
|||
# Resmoke Test Suites
|
||||
Resmoke stores test suites represented as `.yml` files in the `buildscripts/resmokeconfig/suites`
|
||||
directory. These `.yml` files allow users to spin up a variety of configurations to run tests
|
||||
|
||||
Resmoke stores test suites represented as `.yml` files in the `buildscripts/resmokeconfig/suites`
|
||||
directory. These `.yml` files allow users to spin up a variety of configurations to run tests
|
||||
against.
|
||||
|
||||
# Suite Fields
|
||||
|
||||
## test_kind - [Root Level]
|
||||
This represents the type of tests that are running in this suite. Some examples include: *js_test,
|
||||
cpp_unit_test, cpp_integration_tests, benchmark_test, fsm_workload_test, etc.* You can see all
|
||||
|
||||
This represents the type of tests that are running in this suite. Some examples include: _js_test,
|
||||
cpp_unit_test, cpp_integration_tests, benchmark_test, fsm_workload_test, etc._ You can see all
|
||||
available options in the `_SELECTOR_REGISTRY` at `mongo/buildscripts/resmokelib/selector.py`.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
test_kind: js_test
|
||||
```
|
||||
|
||||
## selector - [Root Level]
|
||||
|
||||
The selector determines test files to include/exclude in the suite.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
roots:
|
||||
- jstests/aggregation/**/*.js
|
||||
exclude_files:
|
||||
- jstests/aggregation/extras/*.js
|
||||
- jstests/aggregation/data/*.js
|
||||
exclude_with_any_tags:
|
||||
- requires_pipeline_optimization
|
||||
roots:
|
||||
- jstests/aggregation/**/*.js
|
||||
exclude_files:
|
||||
- jstests/aggregation/extras/*.js
|
||||
- jstests/aggregation/data/*.js
|
||||
exclude_with_any_tags:
|
||||
- requires_pipeline_optimization
|
||||
```
|
||||
|
||||
### selector.roots
|
||||
File path(s) of test files to include. If a path without a glob is provided, it must exist.
|
||||
|
||||
File path(s) of test files to include. If a path without a glob is provided, it must exist.
|
||||
|
||||
### selector.exclude_files
|
||||
|
||||
File path(s) of test files to exclude. If a path without a glob is provided, it must exist.
|
||||
|
||||
### selector.exclude_with_any_tags
|
||||
Exclude test files by tag name(s). To see all available tags, run
|
||||
|
||||
Exclude test files by tag name(s). To see all available tags, run
|
||||
`./buildscripts/resmoke.py list-tags`.
|
||||
|
||||
## executor - [Root Level]
|
||||
|
||||
Configuration for the test execution framework.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
executor:
|
||||
archive:
|
||||
...
|
||||
config:
|
||||
...
|
||||
hooks:
|
||||
...
|
||||
fixture:
|
||||
...
|
||||
archive:
|
||||
---
|
||||
config:
|
||||
---
|
||||
hooks:
|
||||
---
|
||||
fixture:
|
||||
```
|
||||
|
||||
### executor.archive
|
||||
Upon failure, data files can be uploaded to s3. A failure is when a `hook` or `test` throws an
|
||||
exception. Data files will be archived in the following situations:
|
||||
|
||||
Upon failure, data files can be uploaded to s3. A failure is when a `hook` or `test` throws an
|
||||
exception. Data files will be archived in the following situations:
|
||||
|
||||
1. Any `hook` included in this section throws an exception.
|
||||
2. If `tests: true` and any `test` in the suite throws an exception.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
archive:
|
||||
hooks:
|
||||
- Hook1
|
||||
- Hook2
|
||||
...
|
||||
tests: true
|
||||
hooks:
|
||||
- Hook1
|
||||
- Hook2
|
||||
---
|
||||
tests: true
|
||||
```
|
||||
|
||||
### executor.config
|
||||
This section contains additional configuration for each test. The structure of this can vary
|
||||
significantly based on the `test_kind`. For specific information, you can look at the
|
||||
implementation of the `test_kind` of concern in the `buildscripts/resmokelib/testing/testcases`
|
||||
|
||||
This section contains additional configuration for each test. The structure of this can vary
|
||||
significantly based on the `test_kind`. For specific information, you can look at the
|
||||
implementation of the `test_kind` of concern in the `buildscripts/resmokelib/testing/testcases`
|
||||
directory.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
config:
|
||||
shell_options:
|
||||
global_vars:
|
||||
TestData:
|
||||
defaultReadConcernLevel: null
|
||||
enableMajorityReadConcern: ''
|
||||
nodb: ''
|
||||
gssapiServiceName: "mockservice"
|
||||
eval: >-
|
||||
var testingReplication = true;
|
||||
load('jstests/libs/override_methods/set_read_and_write_concerns.js');
|
||||
load('jstests/libs/override_methods/enable_causal_consistency_without_read_pref.js');
|
||||
shell_options:
|
||||
global_vars:
|
||||
TestData:
|
||||
defaultReadConcernLevel: null
|
||||
enableMajorityReadConcern: ""
|
||||
nodb: ""
|
||||
gssapiServiceName: "mockservice"
|
||||
eval: >-
|
||||
var testingReplication = true;
|
||||
load('jstests/libs/override_methods/set_read_and_write_concerns.js');
|
||||
load('jstests/libs/override_methods/enable_causal_consistency_without_read_pref.js');
|
||||
```
|
||||
Above is an example of the most common `test_kind` -- `js_test`. `js_test` uses `shell_options` to
|
||||
customize the mongo shell when running tests.
|
||||
|
||||
`global_vars` allows for setting global variables. A `TestData` object is a special global variable
|
||||
that is used to hold testing data. Parts of `TestData` can be updated via `resmoke` command-line
|
||||
invocation, via `.yml` (as shown above), and during runtime. The global `TestData` object is merged
|
||||
intelligently and made available to the `js_test` running. Behavior can vary on key collision, but
|
||||
in general this is the order of precedence: (1) resmoke command-line (2) [suite].yml (3)
|
||||
Above is an example of the most common `test_kind` -- `js_test`. `js_test` uses `shell_options` to
|
||||
customize the mongo shell when running tests.
|
||||
|
||||
`global_vars` allows for setting global variables. A `TestData` object is a special global variable
|
||||
that is used to hold testing data. Parts of `TestData` can be updated via `resmoke` command-line
|
||||
invocation, via `.yml` (as shown above), and during runtime. The global `TestData` object is merged
|
||||
intelligently and made available to the `js_test` running. Behavior can vary on key collision, but
|
||||
in general this is the order of precedence: (1) resmoke command-line (2) [suite].yml (3)
|
||||
runtime/default.
|
||||
|
||||
The mongo shell can also be invoked with flags &
|
||||
The mongo shell can also be invoked with flags &
|
||||
named arguments. Flags must have the `''` value, such as in the case for `nodb` above.
|
||||
|
||||
`eval` can also be used to run generic javascript code in the shell. You can directly include
|
||||
`eval` can also be used to run generic javascript code in the shell. You can directly include
|
||||
javascript code, or you can put it in a separate script & `load` it.
|
||||
|
||||
### executor.hooks
|
||||
All hooks inherit from the `buildscripts.resmokelib.testing.hooks.interface.Hook` parent class and
|
||||
can override any subset of the following empty base methods: `before_suite`, `after_suite`,
|
||||
`before_test`, `after_test`. At least 1 base method must be overridden, otherwise the hook will
|
||||
not do anything at all. During test suite execution, each hook runs its custom logic in the
|
||||
respective scenarios. Some customizable tasks that hooks can perform include: *validating data,
|
||||
deleting data, performing cleanup*, etc. You can see all existing hooks in the
|
||||
|
||||
All hooks inherit from the `buildscripts.resmokelib.testing.hooks.interface.Hook` parent class and
|
||||
can override any subset of the following empty base methods: `before_suite`, `after_suite`,
|
||||
`before_test`, `after_test`. At least 1 base method must be overridden, otherwise the hook will
|
||||
not do anything at all. During test suite execution, each hook runs its custom logic in the
|
||||
respective scenarios. Some customizable tasks that hooks can perform include: _validating data,
|
||||
deleting data, performing cleanup_, etc. You can see all existing hooks in the
|
||||
`buildscripts/resmokelib/testing/hooks` directory.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
hooks:
|
||||
- class: CheckReplOplogs
|
||||
- class: CheckReplDBHash
|
||||
- class: ValidateCollections
|
||||
- class: CleanEveryN
|
||||
n: 20
|
||||
- class: MyHook
|
||||
param1: something
|
||||
param2: somethingelse
|
||||
- class: CheckReplOplogs
|
||||
- class: CheckReplDBHash
|
||||
- class: ValidateCollections
|
||||
- class: CleanEveryN
|
||||
n: 20
|
||||
- class: MyHook
|
||||
param1: something
|
||||
param2: somethingelse
|
||||
```
|
||||
|
||||
The hook name in the `.yml` must match its Python class name in the
|
||||
`buildscripts/resmokelib/testing/hooks` directory. Parameters can also be included in the `.yml`
|
||||
and will be passed to the hook's constructor (the `hook_logger` & `fixture` parameters are
|
||||
The hook name in the `.yml` must match its Python class name in the
|
||||
`buildscripts/resmokelib/testing/hooks` directory. Parameters can also be included in the `.yml`
|
||||
and will be passed to the hook's constructor (the `hook_logger` & `fixture` parameters are
|
||||
automatically included, so those should not be included in the `.yml`).
|
||||
|
||||
### executor.fixture
|
||||
This represents the test fixture to run tests against. The `class` sub-field corresponds to the
|
||||
Python class name of a fixture in the `buildscripts/resmokelib/testing/fixtures` directory. All
|
||||
other sub-fields are passed into the constructor of the fixture. These sub-fields will vary based
|
||||
|
||||
This represents the test fixture to run tests against. The `class` sub-field corresponds to the
|
||||
Python class name of a fixture in the `buildscripts/resmokelib/testing/fixtures` directory. All
|
||||
other sub-fields are passed into the constructor of the fixture. These sub-fields will vary based
|
||||
on the fixture used.
|
||||
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
fixture:
|
||||
class: ShardedClusterFixture
|
||||
num_shards: 2
|
||||
mongos_options:
|
||||
bind_ip_all: ''
|
||||
set_parameters:
|
||||
enableTestCommands: 1
|
||||
mongod_options:
|
||||
bind_ip_all: ''
|
||||
set_parameters:
|
||||
enableTestCommands: 1
|
||||
periodicNoopIntervalSecs: 1
|
||||
writePeriodicNoops: true
|
||||
class: ShardedClusterFixture
|
||||
num_shards: 2
|
||||
mongos_options:
|
||||
bind_ip_all: ""
|
||||
set_parameters:
|
||||
enableTestCommands: 1
|
||||
mongod_options:
|
||||
bind_ip_all: ""
|
||||
set_parameters:
|
||||
enableTestCommands: 1
|
||||
periodicNoopIntervalSecs: 1
|
||||
writePeriodicNoops: true
|
||||
```
|
||||
|
||||
## Examples
|
||||
For inspiration on creating a new test suite, you can check out a variety of examples in the
|
||||
`buildscripts/resmokeconfig/suites` directory.
|
||||
|
||||
For inspiration on creating a new test suite, you can check out a variety of examples in the
|
||||
`buildscripts/resmokeconfig/suites` directory.
|
||||
|
|
|
|||
|
|
@ -4,60 +4,68 @@
|
|||
|
||||
1. Install the latest [Node.js](https://nodejs.org/en/download/) if you don't have it.
|
||||
2. Install [pkg](https://www.npmjs.com/package/pkg) with npm.
|
||||
```
|
||||
npm install -g pkg
|
||||
```
|
||||
```
|
||||
npm install -g pkg
|
||||
```
|
||||
3. Get [ESLint](https://github.com/eslint/eslint) source code.
|
||||
```
|
||||
git clone git@github.com:eslint/eslint.git
|
||||
```
|
||||
```
|
||||
git clone git@github.com:eslint/eslint.git
|
||||
```
|
||||
4. Checkout the latest version using git tag.
|
||||
```
|
||||
cd eslint
|
||||
git checkout v${version}
|
||||
```
|
||||
```
|
||||
cd eslint
|
||||
git checkout v${version}
|
||||
```
|
||||
5. Add pkg options to `package.json` file.
|
||||
```
|
||||
"pkg": {
|
||||
"scripts": [ "conf/**/*", "lib/**/*", "messages/**/*" ],
|
||||
"targets": [ "linux-x64", "macos-x64" ]
|
||||
# "targets": [ "linux-arm" ]
|
||||
},
|
||||
```
|
||||
```
|
||||
"pkg": {
|
||||
"scripts": [ "conf/**/*", "lib/**/*", "messages/**/*" ],
|
||||
"targets": [ "linux-x64", "macos-x64" ]
|
||||
# "targets": [ "linux-arm" ]
|
||||
},
|
||||
```
|
||||
6. Run pkg command to make ESLint executables.
|
||||
```
|
||||
npm install
|
||||
pkg .
|
||||
```
|
||||
```
|
||||
npm install
|
||||
pkg .
|
||||
```
|
||||
7. Check that executables are working.
|
||||
Copy files to somewhere in your PATH and try to run it.
|
||||
|
||||
Depending on your system
|
||||
```
|
||||
eslint-linux --help
|
||||
```
|
||||
or
|
||||
```
|
||||
eslint-macos --help
|
||||
```
|
||||
or (if you are on arm)
|
||||
```
|
||||
eslint --help
|
||||
```
|
||||
Depending on your system
|
||||
|
||||
(*) If executable fails to find some .js files there are [extra steps](#extra-steps)
|
||||
```
|
||||
eslint-linux --help
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
eslint-macos --help
|
||||
```
|
||||
|
||||
or (if you are on arm)
|
||||
|
||||
```
|
||||
eslint --help
|
||||
```
|
||||
|
||||
(\*) If executable fails to find some .js files there are [extra steps](#extra-steps)
|
||||
required to be done before step 6.
|
||||
|
||||
### Prepare archives
|
||||
|
||||
Rename produced files.
|
||||
|
||||
```
|
||||
mv eslint-linux eslint-Linux-x86_64
|
||||
mv eslint-macos eslint-Darwin-x86_64
|
||||
# arm
|
||||
# mv eslint eslint-Linux-arm64
|
||||
```
|
||||
|
||||
Archive files. (No leading v in version e.g. 8.28.0 NOT v8.28.0)
|
||||
|
||||
```
|
||||
tar -czvf eslint-${version}-linux-x86_64.tar.gz eslint-Linux-x86_64
|
||||
tar -czvf eslint-${version}-darwin.tar.gz eslint-Darwin-x86_64
|
||||
|
|
@ -68,17 +76,20 @@ tar -czvf eslint-${version}-darwin.tar.gz eslint-Darwin-x86_64
|
|||
### Upload archives to `boxes.10gen.com`
|
||||
|
||||
Archives should be available by the following links:
|
||||
|
||||
```
|
||||
https://s3.amazonaws.com/boxes.10gen.com/build/eslint-${version}-linux-x86_64.tar.gz
|
||||
https://s3.amazonaws.com/boxes.10gen.com/build/eslint-${version}-darwin.tar.gz
|
||||
# arm
|
||||
# https://s3.amazonaws.com/boxes.10gen.com/build/eslint-${version}-linux-arm64.tar.gz
|
||||
```
|
||||
|
||||
Build team has an access to do that.
|
||||
You can create a build ticket in Jira for them to do it
|
||||
(e.g. https://jira.mongodb.org/browse/BUILD-12984)
|
||||
|
||||
### Update ESLint version in `buildscripts/eslint.py`
|
||||
|
||||
```
|
||||
# Expected version of ESLint.
|
||||
ESLINT_VERSION = "${version}"
|
||||
|
|
@ -91,6 +102,7 @@ and force include files using `assets` or `scripts` options might not help.
|
|||
|
||||
For the ESLint version 7.22.0 and 8.28.0 the following change was applied to the
|
||||
source code to make everything work:
|
||||
|
||||
```
|
||||
diff --git a/lib/cli-engine/cli-engine.js b/lib/cli-engine/cli-engine.js
|
||||
index b1befaa04..e02230f83 100644
|
||||
|
|
@ -99,7 +111,7 @@ index b1befaa04..e02230f83 100644
|
|||
@@ -987,43 +987,35 @@ class CLIEngine {
|
||||
*/
|
||||
getFormatter(format) {
|
||||
|
||||
|
||||
- // default is stylish
|
||||
- const resolvedFormatName = format || "stylish";
|
||||
-
|
||||
|
|
|
|||
|
|
@ -1,19 +1,19 @@
|
|||
# IWYU Analysis tool
|
||||
|
||||
This tool will run
|
||||
[include-what-you-use](https://github.com/include-what-you-use/include-what-you-use)
|
||||
This tool will run
|
||||
[include-what-you-use](https://github.com/include-what-you-use/include-what-you-use)
|
||||
(IWYU) analysis across the codebase via `compile_commands.json`.
|
||||
|
||||
The `iwyu_config.yml` file consists of the current options and automatic
|
||||
The `iwyu_config.yml` file consists of the current options and automatic
|
||||
pragma marking. You can exclude files from the analysis here.
|
||||
|
||||
The tool has two main modes of operation, `fix` and `check` modes. `fix`
|
||||
mode will attempt to make changes to the source files based off IWYU's
|
||||
suggestions. The check mode will simply check if there are any suggestion
|
||||
The tool has two main modes of operation, `fix` and `check` modes. `fix`
|
||||
mode will attempt to make changes to the source files based off IWYU's
|
||||
suggestions. The check mode will simply check if there are any suggestion
|
||||
at all.
|
||||
|
||||
`fix` mode will take a long time to run, as the tool needs to rerun any
|
||||
source in which a underlying header was changed to ensure things are not
|
||||
`fix` mode will take a long time to run, as the tool needs to rerun any
|
||||
source in which a underlying header was changed to ensure things are not
|
||||
broken, and so therefore ends up recompile the codebase several times over.
|
||||
|
||||
For more information please refer the the script `--help` option.
|
||||
|
|
@ -31,29 +31,30 @@ Next you can run the analysis:
|
|||
```
|
||||
python3 buildscripts/iwyu/run_iwyu_analysis.py
|
||||
```
|
||||
The default mode is fix mode, and it will start making changes to the code
|
||||
|
||||
The default mode is fix mode, and it will start making changes to the code
|
||||
if any changes are found.
|
||||
|
||||
# Debugging failures
|
||||
|
||||
Occasionally IWYU tool will run into problems where it is unable to suggest
|
||||
valid changes and the changes will cause things to break (not compile). When
|
||||
it his a failure it will copy the source and all the header's that were used
|
||||
at the time of the compilation into a directory where the same command can be
|
||||
Occasionally IWYU tool will run into problems where it is unable to suggest
|
||||
valid changes and the changes will cause things to break (not compile). When
|
||||
it his a failure it will copy the source and all the header's that were used
|
||||
at the time of the compilation into a directory where the same command can be
|
||||
run to reproduce the error.
|
||||
|
||||
You can examine the suggested changes in the source and headers and compare
|
||||
You can examine the suggested changes in the source and headers and compare
|
||||
them to the working source tree. Then you can make corrective changes to allow
|
||||
IWYU to get past the failure.
|
||||
IWYU to get past the failure.
|
||||
|
||||
IWYU is not perfect and it make several mistakes that a human can understand
|
||||
IWYU is not perfect and it make several mistakes that a human can understand
|
||||
and fix appropriately.
|
||||
|
||||
# Running the tests
|
||||
|
||||
This tool includes its own end to end testing. The test directory includes
|
||||
sub directories which contain source and iwyu configs to run the tool against.
|
||||
The tests will then compare the results to built in expected results and fail
|
||||
This tool includes its own end to end testing. The test directory includes
|
||||
sub directories which contain source and iwyu configs to run the tool against.
|
||||
The tests will then compare the results to built in expected results and fail
|
||||
if the the tests are not producing the expected results.
|
||||
|
||||
To run the tests use the command:
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
# Matrix Resmoke.py Suites
|
||||
|
||||
## Summary
|
||||
|
||||
Matrix Suites are defined as a combination of explict
|
||||
suite files (in `buildscripts/resmokeconfig/suites` by default)
|
||||
and a set of "overrides" for specific keys. The intention is
|
||||
|
|
@ -10,10 +11,12 @@ fully composed of reusable sections, similar to how Genny's
|
|||
workloads are defined as a set of parameterized `PhaseConfig`s.
|
||||
|
||||
## Usage
|
||||
|
||||
Matrix suites behave like regular suites for all functionality in resmoke.py,
|
||||
including `list-suites`, `find-suites` and `run --suites=[SUITE]`.
|
||||
|
||||
## Writing a matrix suite mapping file.
|
||||
|
||||
Matrix suites consist of a mapping, and a set of overrides in
|
||||
their eponymous directories. When you are done writing the mapping file, you must
|
||||
[generate the matrix suite file.](#generating-matrix-suites)
|
||||
|
|
@ -24,6 +27,7 @@ modifiers. There is also an optional `decription` field that will get output
|
|||
with the local resmoke invocation.
|
||||
|
||||
The fields of modifiers are the following:
|
||||
|
||||
1. overrides
|
||||
2. excludes
|
||||
3. eval
|
||||
|
|
@ -35,6 +39,7 @@ For example `encryption.mongodfixture_ese` would reference the `mongodfixture_es
|
|||
inside of the `encryption.yml` file inside of the `overrides` directory.
|
||||
|
||||
### overrides
|
||||
|
||||
All fields referenced in the `overrides` section of the mappings file will overwrite the specified
|
||||
fields in the `base_suite`.
|
||||
The `overrides` modifier takes precidence over the `excludes` and `eval` modifiers.
|
||||
|
|
@ -42,22 +47,26 @@ The `overrides` list will be processed in order so order can matter if multiple
|
|||
try to overwrite the same field in the base_suite.
|
||||
|
||||
### excludes
|
||||
|
||||
All fields referenced in the `excludes` section of the mappings file will append to the specified
|
||||
`exclude` fields in the base suite.
|
||||
The only two valid options in the referenced modifier field are `exclude_with_any_tags` and
|
||||
`exclude_files`. They are appended in the order they are specified in the mappings file.
|
||||
|
||||
### eval
|
||||
|
||||
All fields referenced in the `eval` section of the mappings file will append to the specified
|
||||
`config.shell_options.eval` field in the base suite.
|
||||
They are appended in the order they are specified in the mappings file.
|
||||
|
||||
### extends
|
||||
|
||||
All fields referenced in the `extends` section of the mappings file must be lists, and will be
|
||||
appended to the correspending keys on the same path. When extends is applied (after the other
|
||||
modifiers), the key being extended must already exist and also be a list.
|
||||
|
||||
## Generating matrix suites
|
||||
|
||||
The generated matrix suites live in the `buildscripts/resmokeconfig/matrix_suites/generated_suites`
|
||||
directory. These files may be edited for local testing but must remain consistent with the mapping
|
||||
files. There is a task in the commit queue that enforces this. To generate a new version of these
|
||||
|
|
@ -66,10 +75,12 @@ will overwrite the current generated matrix suites on disk so make sure you do n
|
|||
changes to these files.
|
||||
|
||||
## Validating matrix suites
|
||||
|
||||
All matrix suites are validated whenever they are run to ensure that the mapping file and the
|
||||
generated suite file are in sync. The `resmoke_validation_tests` task in the commit queue also
|
||||
ensures that the files are validated.
|
||||
|
||||
## FAQ
|
||||
|
||||
For questions about the user or authorship experience,
|
||||
please reach out in #server-testing.
|
||||
|
|
|
|||
|
|
@ -1,26 +1,30 @@
|
|||
## Running the core analyzer
|
||||
|
||||
There are two main ways of running the core analyzer.
|
||||
|
||||
1. Running the core analyzer with local core dumps and binaries.
|
||||
2. Running the core analyzer with core dumps and binaries from an evergreen task. Note that some analysis might fail if you are not on the same AMI (Amazon Machine Image) that the task was run on.
|
||||
|
||||
To run the core analyzer with local core dumps and binaries:
|
||||
|
||||
```
|
||||
python3 buildscripts/resmoke.py core-analyzer
|
||||
```
|
||||
|
||||
This will look for binaries in the build/install directory, and it will look for core dumps in the current directory. If your local environment is different you can include `--install-dir` and `--core-dir` in your invocation to specify other locations.
|
||||
|
||||
To run the core analyzer with core dumps and binaries from an evergreen task:
|
||||
|
||||
```
|
||||
python3 buildscripts/resmoke.py core-analyzer --task-id={task_id}
|
||||
```
|
||||
|
||||
This will download all of the core dumps and binaries from the task and put them into the configured `--working-dir`, this defaults to the `core-analyzer` directory.
|
||||
|
||||
All of the task analysis will be added to the `analysis` directory inside the configured `--working-dir`.
|
||||
|
||||
Note: Currently the core analyzer only runs on linux. Windows uses the legacy hang analyzer but will be switched over when we run into issues or have time to do the transition. We have not tackled the problem of getting core dumps on macOS so we have no core dump analysis on that operating system.
|
||||
|
||||
|
||||
### Getting core dumps
|
||||
|
||||
```mermaid
|
||||
|
|
@ -31,17 +35,21 @@ sequenceDiagram
|
|||
Hang Analyzer ->> Core Dumps: Attach to pid and generate core dumps
|
||||
```
|
||||
|
||||
When a task times out, it hits the [timeout](https://github.com/10gen/mongo/blob/a6e56a8e136fe554dc90565bf6acf5bf86f7a46e/etc/evergreen_yml_components/definitions.yml#L2694) section in the defined evergreen config.
|
||||
When a task times out, it hits the [timeout](https://github.com/10gen/mongo/blob/a6e56a8e136fe554dc90565bf6acf5bf86f7a46e/etc/evergreen_yml_components/definitions.yml#L2694) section in the defined evergreen config.
|
||||
In this timeout section, we run [this](https://github.com/10gen/mongo/blob/a6e56a8e136fe554dc90565bf6acf5bf86f7a46e/etc/evergreen_yml_components/definitions.yml#L2302) task which runs the hang-analyzer with the following invocation:
|
||||
|
||||
```
|
||||
python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -m exact -p python
|
||||
```
|
||||
|
||||
This tells the hang-analyzer to look for all of the python processes (we are specifically looking for resmoke) on the machine and to signal them.
|
||||
When resmoke is [signaled](https://github.com/10gen/mongo/blob/08a99b15eea7ae0952b2098710d565dd7f709ff6/buildscripts/resmokelib/sighandler.py#L25), it again invokes the hang analyzer with the specific pids of it's child processes.
|
||||
It will look similar to this most of the time:
|
||||
|
||||
```
|
||||
python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -k -c -d pid1,pid2,pid3
|
||||
```
|
||||
|
||||
The things to note here are the `-k` which kills the process and `-c` which takes core dumps.
|
||||
The resulting core dumps are put into the current running directory.
|
||||
When a task fails normally, core dumps may also be generated by the linux kernel and put into the working directory.
|
||||
|
|
@ -56,7 +64,6 @@ After investigation of the above issue, we found that compressing and uploading
|
|||
We made a [script](https://github.com/10gen/mongo/blob/master/buildscripts/fast_archive.py) that gzips all of the core dumps in parallel and uploads them to S3 individually asynchronously.
|
||||
This solved all of the problems listed above.
|
||||
|
||||
|
||||
### Generating the core analyzer task
|
||||
|
||||
```mermaid
|
||||
|
|
|
|||
|
|
@ -12,6 +12,7 @@ Powercycle test is the part of resmoke. Python 3.10+ with python venv is require
|
|||
run the resmoke (python3 from [mongodbtoolchain](http://mongodbtoolchain.build.10gen.cc/)
|
||||
is highly recommended). Python venv can be set up by running in the root mongo repo
|
||||
directory:
|
||||
|
||||
```
|
||||
python3 -m venv python3-venv
|
||||
source python3-venv/bin/activate
|
||||
|
|
@ -19,36 +20,40 @@ pip install -r buildscripts/requirements.txt
|
|||
```
|
||||
|
||||
If python venv is already set up activate it before running the resmoke:
|
||||
|
||||
```
|
||||
source python3-venv/bin/activate
|
||||
```
|
||||
|
||||
There are several commands that can be run by calling resmoke powercycle subcommand:
|
||||
|
||||
```
|
||||
python buildscripts/resmoke.py powercycle --help
|
||||
```
|
||||
|
||||
The main entry point of resmoke powercycle subcommand is located in this file:
|
||||
|
||||
```
|
||||
buildscripts/resmokelib/powercycle/__init__.py
|
||||
```
|
||||
|
||||
## Poweryclce main steps
|
||||
|
||||
- [Set up EC2 instance](#set-up-ec2-instance)
|
||||
- [Run powercycle test](#run-powercycle-test)
|
||||
- [Resmoke powercycle run arguments](#resmoke-powercycle-run-arguments)
|
||||
- [Powercycle test implementation](#powercycle-test-implementation)
|
||||
- [Save diagnostics](#save-diagnostics)
|
||||
- [Remote hang analyzer (optional)](#remote-hang-analyzer-optional)
|
||||
- [Set up EC2 instance](#set-up-ec2-instance)
|
||||
- [Run powercycle test](#run-powercycle-test)
|
||||
- [Resmoke powercycle run arguments](#resmoke-powercycle-run-arguments)
|
||||
- [Powercycle test implementation](#powercycle-test-implementation)
|
||||
- [Save diagnostics](#save-diagnostics)
|
||||
- [Remote hang analyzer (optional)](#remote-hang-analyzer-optional)
|
||||
|
||||
### Set up EC2 instance
|
||||
|
||||
1. `Evergreen host.create command` - in Evergreen the remote host is created with
|
||||
the same distro as the localhost runs and some initial connections are made to ensure
|
||||
it's up before further steps
|
||||
the same distro as the localhost runs and some initial connections are made to ensure
|
||||
it's up before further steps
|
||||
2. `Resmoke powercycle setup-host command` - prepares remote host via ssh to run
|
||||
the powercycle test:
|
||||
the powercycle test:
|
||||
|
||||
```
|
||||
python buildscripts/resmoke.py powercycle setup-host
|
||||
```
|
||||
|
|
@ -59,25 +64,28 @@ Powercycle setup-host operations are located in
|
|||
created by `expansions.write` command in Evergreen.
|
||||
|
||||
It runs several operations via ssh:
|
||||
- create directory on the remote host
|
||||
- copy `buildscripts` and `mongoDB executables` from localhost to the remote host
|
||||
- set up python venv on the remote host
|
||||
- set up curator to collect system & process stats on the remote host
|
||||
- install [NotMyFault](https://docs.microsoft.com/en-us/sysinternals/downloads/notmyfault)
|
||||
to crash Windows (only on Windows)
|
||||
|
||||
- create directory on the remote host
|
||||
- copy `buildscripts` and `mongoDB executables` from localhost to the remote host
|
||||
- set up python venv on the remote host
|
||||
- set up curator to collect system & process stats on the remote host
|
||||
- install [NotMyFault](https://docs.microsoft.com/en-us/sysinternals/downloads/notmyfault)
|
||||
to crash Windows (only on Windows)
|
||||
|
||||
Remote operation via ssh implementation is located in
|
||||
`buildscripts/resmokelib/powercycle/lib/remote_operations.py`.
|
||||
The following operations are supported:
|
||||
- `copy_to` - copy files from the localhost to the remote host
|
||||
- `copy_from` - copy files from the remote host to the localhost
|
||||
- `shell` - runs shell command on the remote host
|
||||
|
||||
- `copy_to` - copy files from the localhost to the remote host
|
||||
- `copy_from` - copy files from the remote host to the localhost
|
||||
- `shell` - runs shell command on the remote host
|
||||
|
||||
### Run powercycle test
|
||||
|
||||
`Resmoke powercycle run command` - runs the powercycle test on the localhost
|
||||
which runs remote operations on the remote host via ssh and local validation
|
||||
checks:
|
||||
|
||||
```
|
||||
python buildscripts/resmoke.py powercycle run \
|
||||
--sshUserHost=${user_name}@${host_ip} \
|
||||
|
|
@ -85,7 +93,7 @@ python buildscripts/resmoke.py powercycle run \
|
|||
--taskName=${task_name}
|
||||
```
|
||||
|
||||
###### Resmoke powercycle run arguments
|
||||
###### Resmoke powercycle run arguments
|
||||
|
||||
The arguments for resmoke powercycle run command are defined in `add_subcommand()`
|
||||
function in `buildscripts/resmokelib/powercycle/__init__.py`. When powercycle test
|
||||
|
|
@ -107,41 +115,43 @@ The powercycle test main implementation is located in `main()` function in
|
|||
The value of `--remoteOperation` argument is used to distinguish if we are running the script
|
||||
on the localhost or on the remote host.
|
||||
`remote_handler()` function performs the following remote operations:
|
||||
- `noop` - do nothing
|
||||
- `crash_server` - internally crash the server
|
||||
- `kill_mongod` - kill mongod process
|
||||
- `install_mongod` - install mongod
|
||||
- `start_mongod` - start mongod process
|
||||
- `stop_mongod` - stop mongod process
|
||||
- `shutdown_mongod` - run shutdown command using mongo client
|
||||
- `rsync_data` - backups mongod data
|
||||
- `seed_docs` - seed a collection with random document values
|
||||
- `set_fcv` - run set FCV command using mongo client
|
||||
- `check_disk` - run `chkdsk` command on Windows
|
||||
|
||||
- `noop` - do nothing
|
||||
- `crash_server` - internally crash the server
|
||||
- `kill_mongod` - kill mongod process
|
||||
- `install_mongod` - install mongod
|
||||
- `start_mongod` - start mongod process
|
||||
- `stop_mongod` - stop mongod process
|
||||
- `shutdown_mongod` - run shutdown command using mongo client
|
||||
- `rsync_data` - backups mongod data
|
||||
- `seed_docs` - seed a collection with random document values
|
||||
- `set_fcv` - run set FCV command using mongo client
|
||||
- `check_disk` - run `chkdsk` command on Windows
|
||||
|
||||
When running on localhost the powercycle test loops do the following steps:
|
||||
- Rsync the database post-crash (starting from the 2nd loop), pre-recovery on the remote host
|
||||
- makes a backup before recovery
|
||||
- Start mongod on the secret port on the remote host and wait for it to recover
|
||||
- also sets FCV and seeds documents on the 1st loop
|
||||
- Validate canary from the localhost (starting from the 2nd loop)
|
||||
- uses mongo client to connect to the remote mongod
|
||||
- Validate collections from the localhost
|
||||
- calls resmoke to perform the validation on the remote mongod
|
||||
- Shutdown mongod on the remote host
|
||||
- Rsync the database post-recovery on the remote host
|
||||
- makes a backup after recovery
|
||||
- Start mongod on the standard port on the remote host
|
||||
- Start CRUD and FSM clients on the localhost
|
||||
- calls resmoke to run CRUD and FSM clients
|
||||
- Generate canary document from the localhost
|
||||
- uses mongo client to connect to the remote mongod
|
||||
- Crash the remote server or kill mongod on the remote host
|
||||
- most of the powercycle tasks do crashes
|
||||
- Run check disk on the remote host (on Windows)
|
||||
- Exit loop if one of these occurs:
|
||||
- loop number exceeded
|
||||
- any step fails
|
||||
|
||||
- Rsync the database post-crash (starting from the 2nd loop), pre-recovery on the remote host
|
||||
- makes a backup before recovery
|
||||
- Start mongod on the secret port on the remote host and wait for it to recover
|
||||
- also sets FCV and seeds documents on the 1st loop
|
||||
- Validate canary from the localhost (starting from the 2nd loop)
|
||||
- uses mongo client to connect to the remote mongod
|
||||
- Validate collections from the localhost
|
||||
- calls resmoke to perform the validation on the remote mongod
|
||||
- Shutdown mongod on the remote host
|
||||
- Rsync the database post-recovery on the remote host
|
||||
- makes a backup after recovery
|
||||
- Start mongod on the standard port on the remote host
|
||||
- Start CRUD and FSM clients on the localhost
|
||||
- calls resmoke to run CRUD and FSM clients
|
||||
- Generate canary document from the localhost
|
||||
- uses mongo client to connect to the remote mongod
|
||||
- Crash the remote server or kill mongod on the remote host
|
||||
- most of the powercycle tasks do crashes
|
||||
- Run check disk on the remote host (on Windows)
|
||||
- Exit loop if one of these occurs:
|
||||
- loop number exceeded
|
||||
- any step fails
|
||||
|
||||
`exit_handler()` function writes a report and does cleanups any time after the test run exits.
|
||||
|
||||
|
|
@ -149,6 +159,7 @@ When running on localhost the powercycle test loops do the following steps:
|
|||
|
||||
`Resmoke powercycle save-diagnostics command` - copies powercycle diagnostics
|
||||
files from the remote host to the localhost (mainly used by Evergreen):
|
||||
|
||||
```
|
||||
python buildscripts/resmoke.py powercycle save-diagnostics
|
||||
```
|
||||
|
|
@ -159,25 +170,27 @@ Powercycle save-diagnostics operations are located in
|
|||
created by `expansions.write` command in Evergreen.
|
||||
|
||||
It runs several operations via ssh:
|
||||
- `gatherRemoteEventLogs`
|
||||
- runs on Windows
|
||||
- `tarEC2Artifacts`
|
||||
- on success archives `mongod.log`
|
||||
- on failure additionally archives data files and all before-recovery and after-recovery backups
|
||||
- on failure on Windows additionally archives event logs
|
||||
- `copyEC2Artifacts`
|
||||
- from the remote host to the localhost
|
||||
- `copyEC2MonitorFiles`
|
||||
- from the remote host to the localhost
|
||||
- `gatherRemoteMongoCoredumps`
|
||||
- copies all mongo core dumps to a single directory
|
||||
- `copyRemoteMongoCoredumps`
|
||||
- from the remote host to the localhost
|
||||
|
||||
- `gatherRemoteEventLogs`
|
||||
- runs on Windows
|
||||
- `tarEC2Artifacts`
|
||||
- on success archives `mongod.log`
|
||||
- on failure additionally archives data files and all before-recovery and after-recovery backups
|
||||
- on failure on Windows additionally archives event logs
|
||||
- `copyEC2Artifacts`
|
||||
- from the remote host to the localhost
|
||||
- `copyEC2MonitorFiles`
|
||||
- from the remote host to the localhost
|
||||
- `gatherRemoteMongoCoredumps`
|
||||
- copies all mongo core dumps to a single directory
|
||||
- `copyRemoteMongoCoredumps`
|
||||
- from the remote host to the localhost
|
||||
|
||||
### Remote hang analyzer (optional)
|
||||
|
||||
`Resmoke powercycle remote-hang-analyzer command` - runs hang analyzer on the
|
||||
remote host (mainly used by Evergreen):
|
||||
|
||||
```
|
||||
$python buildscripts/resmoke.py powercycle remote-hang-analyzer
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
* All end-to-end resmoke tests can be run via a resmoke suite itself:
|
||||
- All end-to-end resmoke tests can be run via a resmoke suite itself:
|
||||
|
||||
```
|
||||
mongodb_repo_root$ /opt/mongodbtoolchain/v4/bin/python3 buildscripts/resmoke.py run --suites resmoke_end2end_tests
|
||||
```
|
||||
|
||||
* Finer grained control of tests can also be run with by invoking python's unittest main by hand. E.g:
|
||||
- Finer grained control of tests can also be run with by invoking python's unittest main by hand. E.g:
|
||||
|
||||
```
|
||||
mongodb_repo_root$ /opt/mongodbtoolchain/v4/bin/python3 -m unittest -v buildscripts.tests.resmoke_end2end.test_resmoke.TestTestSelection.test_at_sign_as_replay_file
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,26 +1,25 @@
|
|||
MongoDB Server Documentation
|
||||
============
|
||||
# MongoDB Server Documentation
|
||||
|
||||
This is just some internal documentation.
|
||||
|
||||
For the full MongoDB docs, please see [mongodb.org](http://www.mongodb.org/)
|
||||
|
||||
* [Batons](baton.md)
|
||||
* [Build System](build_system.md)
|
||||
* [Build System Reference](build_system_reference.md)
|
||||
* [Building MongoDB](building.md)
|
||||
* [Command Dispatch](command_dispatch.md)
|
||||
* [Contextual Singletons](contexts.md)
|
||||
* [Egress Networking](egress_networking.md)
|
||||
* [Exception Architecture](exception_architecture.md)
|
||||
* [Fail Points](fail_points.md)
|
||||
* [Futures and Promises](futures_and_promises.md)
|
||||
* [Load Balancer Support](load_balancer_support.md)
|
||||
* [Memory Management](memory_management.md)
|
||||
* [Parsing Stack Traces](parsing_stack_traces.md)
|
||||
* [Primary-only Services](primary_only_service.md)
|
||||
* [Security Architecture Guide](security_guide.md)
|
||||
* [Server Parameters](server-parameters.md)
|
||||
* [String Manipulation](string_manipulation.md)
|
||||
* [Thread Pools](thread_pools.md)
|
||||
* [MongoDB Voluntary Product Accessibility Template® (VPAT™)](vpat.md)
|
||||
- [Batons](baton.md)
|
||||
- [Build System](build_system.md)
|
||||
- [Build System Reference](build_system_reference.md)
|
||||
- [Building MongoDB](building.md)
|
||||
- [Command Dispatch](command_dispatch.md)
|
||||
- [Contextual Singletons](contexts.md)
|
||||
- [Egress Networking](egress_networking.md)
|
||||
- [Exception Architecture](exception_architecture.md)
|
||||
- [Fail Points](fail_points.md)
|
||||
- [Futures and Promises](futures_and_promises.md)
|
||||
- [Load Balancer Support](load_balancer_support.md)
|
||||
- [Memory Management](memory_management.md)
|
||||
- [Parsing Stack Traces](parsing_stack_traces.md)
|
||||
- [Primary-only Services](primary_only_service.md)
|
||||
- [Security Architecture Guide](security_guide.md)
|
||||
- [Server Parameters](server-parameters.md)
|
||||
- [String Manipulation](string_manipulation.md)
|
||||
- [Thread Pools](thread_pools.md)
|
||||
- [MongoDB Voluntary Product Accessibility Template® (VPAT™)](vpat.md)
|
||||
|
|
|
|||
|
|
@ -1,65 +1,64 @@
|
|||
# Server-Internal Baton Pattern
|
||||
|
||||
Batons are lightweight job queues in *mongod* and *mongos* processes that allow
|
||||
recording the intent to execute a task (e.g., polling on a network socket) and
|
||||
deferring its execution to a later time. Batons, often by reusing `Client`
|
||||
threads and through the *Waitable* interface, move the execution of scheduled
|
||||
tasks out of the line, potentially hiding the execution cost from the critical
|
||||
Batons are lightweight job queues in _mongod_ and _mongos_ processes that allow
|
||||
recording the intent to execute a task (e.g., polling on a network socket) and
|
||||
deferring its execution to a later time. Batons, often by reusing `Client`
|
||||
threads and through the _Waitable_ interface, move the execution of scheduled
|
||||
tasks out of the line, potentially hiding the execution cost from the critical
|
||||
path. A total of four baton classes are available today:
|
||||
|
||||
- [Baton][baton]
|
||||
- [DefaultBaton][defaultBaton]
|
||||
- [NetworkingBaton][networkingBaton]
|
||||
- [AsioNetworkingBaton][asioNetworkingBaton]
|
||||
- [Baton][baton]
|
||||
- [DefaultBaton][defaultBaton]
|
||||
- [NetworkingBaton][networkingBaton]
|
||||
- [AsioNetworkingBaton][asioNetworkingBaton]
|
||||
|
||||
## Baton Hierarchy
|
||||
|
||||
All baton implementations extend *Baton*. They all expose an interface to allow
|
||||
scheduling tasks on the baton, to demand the awakening of the baton on client
|
||||
socket disconnect, and to create a *SubBaton*. A *SubBaton*, for any of the
|
||||
baton types, is essentially a handle to a local object that proxies scheduling
|
||||
requests to its underlying baton until it is detached (e.g., through destruction
|
||||
All baton implementations extend _Baton_. They all expose an interface to allow
|
||||
scheduling tasks on the baton, to demand the awakening of the baton on client
|
||||
socket disconnect, and to create a _SubBaton_. A _SubBaton_, for any of the
|
||||
baton types, is essentially a handle to a local object that proxies scheduling
|
||||
requests to its underlying baton until it is detached (e.g., through destruction
|
||||
of its handle).
|
||||
|
||||
Additionally, a *NetworkingBaton* enables consumers of a transport layer to
|
||||
execute I/O themselves, rather than delegating it to other threads. They are
|
||||
special batons that are able to poll network sockets, which is not feasible
|
||||
through other baton types. This is essential for minimizing context switches and
|
||||
Additionally, a _NetworkingBaton_ enables consumers of a transport layer to
|
||||
execute I/O themselves, rather than delegating it to other threads. They are
|
||||
special batons that are able to poll network sockets, which is not feasible
|
||||
through other baton types. This is essential for minimizing context switches and
|
||||
improving the readability of stack traces.
|
||||
|
||||
### DefaultBaton
|
||||
|
||||
DefaultBaton is the most basic baton implementation. A default baton is tightly
|
||||
associated with an `OperationContext`, and its associated `Client` thread. This
|
||||
baton provides the platform to execute tasks while a client thread awaits an
|
||||
event or a timeout (e.g., via `OperationContext::sleepUntil(...)`), essentially
|
||||
paving the way towards utilizing idle cycles of client threads for useful work.
|
||||
Tasks can be scheduled on this baton through its associated `OperationContext`
|
||||
DefaultBaton is the most basic baton implementation. A default baton is tightly
|
||||
associated with an `OperationContext`, and its associated `Client` thread. This
|
||||
baton provides the platform to execute tasks while a client thread awaits an
|
||||
event or a timeout (e.g., via `OperationContext::sleepUntil(...)`), essentially
|
||||
paving the way towards utilizing idle cycles of client threads for useful work.
|
||||
Tasks can be scheduled on this baton through its associated `OperationContext`
|
||||
and using `OperationContext::getBaton()::schedule(...)`.
|
||||
|
||||
Note that this baton is not available for an `OperationContext` that belongs to
|
||||
a `ServiceContext` with an `AsioTransportLayer` transport layer. In that case,
|
||||
the aforementioned interface will return a handle to *AsioNetworkingBaton*.
|
||||
Note that this baton is not available for an `OperationContext` that belongs to
|
||||
a `ServiceContext` with an `AsioTransportLayer` transport layer. In that case,
|
||||
the aforementioned interface will return a handle to _AsioNetworkingBaton_.
|
||||
|
||||
### AsioNetworkingBaton
|
||||
|
||||
This baton is only available for Linux and extends *NetworkingBaton* to
|
||||
implement a networking reactor. It utilizes `poll(2)` and `eventfd(2)` to allow
|
||||
This baton is only available for Linux and extends _NetworkingBaton_ to
|
||||
implement a networking reactor. It utilizes `poll(2)` and `eventfd(2)` to allow
|
||||
client threads await events without busy polling.
|
||||
|
||||
## Example
|
||||
|
||||
For an example of scheduling a task on the `OperationContext` baton, see
|
||||
For an example of scheduling a task on the `OperationContext` baton, see
|
||||
[here][example].
|
||||
|
||||
## Considerations
|
||||
|
||||
Since any task scheduled on a baton is intended for out-of-line execution, it
|
||||
Since any task scheduled on a baton is intended for out-of-line execution, it
|
||||
must be non-blocking and preferably short-lived to ensure forward progress.
|
||||
|
||||
[baton]:https://github.com/mongodb/mongo/blob/5906d967c3144d09fab6a4cc1daddb295df19ffb/src/mongo/db/baton.h#L61-L178
|
||||
[baton]: https://github.com/mongodb/mongo/blob/5906d967c3144d09fab6a4cc1daddb295df19ffb/src/mongo/db/baton.h#L61-L178
|
||||
[defaultBaton]: https://github.com/mongodb/mongo/blob/9cfe13115e92a43d1b9273ee1d5817d548264ba7/src/mongo/db/default_baton.h#L46-L75
|
||||
[networkingBaton]: https://github.com/mongodb/mongo/blob/9cfe13115e92a43d1b9273ee1d5817d548264ba7/src/mongo/transport/baton.h#L61-L96
|
||||
[asioNetworkingBaton]: https://github.com/mongodb/mongo/blob/9cfe13115e92a43d1b9273ee1d5817d548264ba7/src/mongo/transport/baton_asio_linux.h#L60-L529
|
||||
[example]: https://github.com/mongodb/mongo/blob/262e5a961fa7221bfba5722aeea2db719f2149f5/src/mongo/s/multi_statement_transaction_requests_sender.cpp#L91-L99
|
||||
|
||||
|
|
|
|||
|
|
@ -1,27 +1,31 @@
|
|||
(Note: This is a work-in-progress for the SDP team; contact #server-dev-platform for questions)
|
||||
|
||||
To perform a Bazel build via SCons:
|
||||
* You must be on a arm64 virtual workstation
|
||||
* You must generate engflow credentials and store them in the correct location (see below)
|
||||
* Build the Bazel-compatible target: `python3 ./buildscripts/scons.py BAZEL_BUILD_ENABLED=1 --build-profile=fast --ninja=disabled --link-model=static -j 200 --modules= build/fast/mongo/db/commands/libfsync_locked.a`
|
||||
|
||||
- You must be on a arm64 virtual workstation
|
||||
- You must generate engflow credentials and store them in the correct location (see below)
|
||||
- Build the Bazel-compatible target: `python3 ./buildscripts/scons.py BAZEL_BUILD_ENABLED=1 --build-profile=fast --ninja=disabled --link-model=static -j 200 --modules= build/fast/mongo/db/commands/libfsync_locked.a`
|
||||
|
||||
To generate and install the engflow credentials:
|
||||
* Navigate to and log in with your mongodb gmail account: https://sodalite.cluster.engflow.com/gettingstarted
|
||||
* Generate and download the credentials; you will need to move them to the workstation machine (scp, copy paste plain text, etc...)
|
||||
* Store them (the same filename they downloaded as) on your machine at the default location our build expects: `/engflow/creds/`
|
||||
* You should run `chmod 600` on them to make sure they are readable only by your user
|
||||
* If you don't want to use the cluster you can pass `BAZEL_FLAGS=--config=local` on the SCons command line or `--config=local` on the bazel command line
|
||||
|
||||
To perform a Bazel build and *bypass* SCons:
|
||||
* Install Bazelisk: `curl -L https://github.com/bazelbuild/bazelisk/releases/download/v1.17.0/bazelisk-linux-arm64 --output /tmp/bazelisk && chmod +x /tmp/bazelisk`
|
||||
* Build the Bazel-compatible target: `/tmp/bazelisk build --verbose_failures src/mongo/db/commands:fsync_locked`
|
||||
- Navigate to and log in with your mongodb gmail account: https://sodalite.cluster.engflow.com/gettingstarted
|
||||
- Generate and download the credentials; you will need to move them to the workstation machine (scp, copy paste plain text, etc...)
|
||||
- Store them (the same filename they downloaded as) on your machine at the default location our build expects: `/engflow/creds/`
|
||||
- You should run `chmod 600` on them to make sure they are readable only by your user
|
||||
- If you don't want to use the cluster you can pass `BAZEL_FLAGS=--config=local` on the SCons command line or `--config=local` on the bazel command line
|
||||
|
||||
To perform a Bazel build and _bypass_ SCons:
|
||||
|
||||
- Install Bazelisk: `curl -L https://github.com/bazelbuild/bazelisk/releases/download/v1.17.0/bazelisk-linux-arm64 --output /tmp/bazelisk && chmod +x /tmp/bazelisk`
|
||||
- Build the Bazel-compatible target: `/tmp/bazelisk build --verbose_failures src/mongo/db/commands:fsync_locked`
|
||||
|
||||
To perform a Bazel build using a local Buildfarm (to test remote execution capability):
|
||||
* For more details on Buildfarm, see https://bazelbuild.github.io/bazel-buildfarm
|
||||
* (One time only) Build and start the Buildfarm:
|
||||
** Change into the `buildfarm` directory: `cd buildfarm`
|
||||
** Build the image: `docker-compose build`
|
||||
** Start the container: `docker-compose up --detach`
|
||||
** Poll until the containers report status `running`: `docker ps --filter status=running --filter name=buildfarm`
|
||||
* (Whenever you build):
|
||||
** Build the Bazel-compatible target with remote execution enabled: `/tmp/bazelisk build --verbose_failures --remote_executor=grpc://localhost:8980 src/mongo/db/commands:fsync_locked`
|
||||
|
||||
- For more details on Buildfarm, see https://bazelbuild.github.io/bazel-buildfarm
|
||||
- (One time only) Build and start the Buildfarm:
|
||||
** Change into the `buildfarm` directory: `cd buildfarm`
|
||||
** Build the image: `docker-compose build`
|
||||
** Start the container: `docker-compose up --detach`
|
||||
** Poll until the containers report status `running`: `docker ps --filter status=running --filter name=buildfarm`
|
||||
- (Whenever you build):
|
||||
\*\* Build the Bazel-compatible target with remote execution enabled: `/tmp/bazelisk build --verbose_failures --remote_executor=grpc://localhost:8980 src/mongo/db/commands:fsync_locked`
|
||||
|
|
|
|||
|
|
@ -1,76 +1,135 @@
|
|||
# The MongoDB Build System
|
||||
|
||||
## Introduction
|
||||
|
||||
### System requirements and supported platforms
|
||||
|
||||
## How to get Help
|
||||
|
||||
### Where to go
|
||||
|
||||
### What to bring when you go there (SCons version, server version, SCons command line, versions of relevant tools, `config.log`, etc.)
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Commonly-encountered issues
|
||||
|
||||
#### `--disable-warnings-as-errors`
|
||||
|
||||
### Reference to known issues in the ticket system
|
||||
|
||||
### How to report a problem
|
||||
|
||||
#### For employees
|
||||
|
||||
#### For non-employees
|
||||
|
||||
## Set up the build environment
|
||||
|
||||
### Set up the virtualenv and poetry
|
||||
|
||||
See [Building Python Prerequisites](building.md#python-prerequisites)
|
||||
|
||||
### The Enterprise Module
|
||||
|
||||
#### Getting the module source
|
||||
|
||||
#### Enabling the module
|
||||
|
||||
## Building the software
|
||||
|
||||
### Commonly-used build targets
|
||||
|
||||
### Building a standard “debug” build
|
||||
|
||||
#### `--dbg`
|
||||
|
||||
### What goes where?
|
||||
|
||||
#### `$BUILD_ROOT/scons` and its contents
|
||||
|
||||
#### `$BUILD_ROOT/$VARIANT_DIR` and its contents
|
||||
|
||||
#### `$BUILD_ROOT/install` and its contents
|
||||
|
||||
#### `DESTDIR` and `PREFIX`
|
||||
|
||||
#### `--build-dir`
|
||||
|
||||
### Running core tests to verify the build
|
||||
|
||||
### Building a standard “release” build
|
||||
|
||||
#### `--separate-debug`
|
||||
|
||||
### Installing from the build directory
|
||||
|
||||
#### `--install-action`
|
||||
|
||||
### Creating a release archive
|
||||
|
||||
## Advanced Builds
|
||||
|
||||
### Compiler and linker options
|
||||
|
||||
#### `CC, CXX, CCFLAGS, CFLAGS, CXXFLAGS`
|
||||
|
||||
#### `CPPDEFINES and CPPPATH`
|
||||
|
||||
#### `LINKFLAGS`
|
||||
|
||||
#### `MSVC_VERSION`
|
||||
|
||||
#### `VERBOSE`
|
||||
|
||||
### Advanced build options
|
||||
|
||||
#### `-j`
|
||||
|
||||
#### `--separate-debug`
|
||||
|
||||
#### `--link-model`
|
||||
|
||||
#### `--allocator`
|
||||
|
||||
#### `--cxx-std`
|
||||
|
||||
#### `--linker`
|
||||
|
||||
#### `--variables-files`
|
||||
|
||||
### Cross compiling
|
||||
|
||||
#### `HOST_ARCH` and `TARGET_ARCH`
|
||||
|
||||
### Using Ninja
|
||||
|
||||
#### `--ninja`
|
||||
|
||||
### Cached builds
|
||||
|
||||
#### Using the SCons build cache
|
||||
|
||||
##### `--cache`
|
||||
|
||||
##### `--cache-dir`
|
||||
|
||||
#### Using `ccache`
|
||||
|
||||
##### `CCACHE`
|
||||
|
||||
### Using Icecream
|
||||
|
||||
#### `ICECC`, `ICECRUN`, `ICECC_CREATE_ENV`
|
||||
|
||||
#### `ICECC_VERSION` and `ICECC_VERSION_ARCH`
|
||||
|
||||
#### `ICECC_DEBUG`
|
||||
|
||||
## Developer builds
|
||||
|
||||
### Developer build options
|
||||
|
||||
#### `MONGO_{VERSION,GIT_HASH}`
|
||||
|
||||
By default, the server build system consults the local git repository
|
||||
|
|
@ -128,117 +187,196 @@ SCons invocations on almost any branch you are likely to find yourself
|
|||
using.
|
||||
|
||||
#### Using sanitizers
|
||||
|
||||
##### `--sanitize`
|
||||
|
||||
##### `*SAN_OPTIONS`
|
||||
|
||||
#### `--dbg` `--opt`
|
||||
|
||||
#### `--build-tools=[stable|next]`
|
||||
|
||||
### Setting up your development environment
|
||||
|
||||
#### `mongo_custom_variables.py`
|
||||
|
||||
##### Guidance on what to put in your custom variables
|
||||
|
||||
##### How to suppress use of your custom variables
|
||||
|
||||
##### Useful variables files (e.g. `mongodbtoolchain`)
|
||||
|
||||
#### Using the Mongo toolchain
|
||||
|
||||
##### Why do we have our own toolchain?
|
||||
|
||||
##### When is it appropriate to use the MongoDB toolchain?
|
||||
|
||||
##### How do I obtain the toolchain?
|
||||
|
||||
##### How do I upgrade the toolchain?
|
||||
|
||||
##### How do I tell the build system to use it?
|
||||
|
||||
### Creating and using build variants
|
||||
|
||||
#### Using `--build-dir` to separate variant build artifacts
|
||||
|
||||
#### `BUILD_ROOT` and `BUILD_DIR`
|
||||
|
||||
#### `VARIANT_DIR`
|
||||
|
||||
#### `NINJA_PREFIX` and `NINJA_SUFFIX`
|
||||
|
||||
### Building older versions
|
||||
|
||||
#### Using` git-worktree`
|
||||
|
||||
### Speeding up incremental builds
|
||||
|
||||
#### Selecting minimal build targets
|
||||
|
||||
#### Compiler arguments
|
||||
|
||||
##### `-gsplit-dwarf` and `/DEBUG:FASTLINK`
|
||||
#### Don’t reinstall what you don’t have to (*NIX only)
|
||||
|
||||
#### Don’t reinstall what you don’t have to (\*NIX only)
|
||||
|
||||
##### `--install-action=hardlink`
|
||||
|
||||
#### Speeding up SCons dependency evaluation
|
||||
|
||||
##### `--implicit-cache`
|
||||
|
||||
##### `--build-fast-and-loose`
|
||||
|
||||
#### Using Ninja responsibly
|
||||
|
||||
#### What about `ccache`?
|
||||
|
||||
## Making source changes
|
||||
|
||||
### Adding a new dependency
|
||||
|
||||
### Linting and Lint Targets
|
||||
|
||||
#### What lint targets are available?
|
||||
|
||||
#### Using `clang-format`
|
||||
|
||||
### Testing your changes
|
||||
|
||||
#### How are test test suites defined?
|
||||
|
||||
#### Running test suites
|
||||
|
||||
#### Adding tests to a suite
|
||||
|
||||
#### Running individual tests
|
||||
|
||||
## Modifying the buid system
|
||||
|
||||
### What is SCons?
|
||||
|
||||
#### `SConstruct` and `SConscripts`
|
||||
|
||||
#### `Environments `and their `Clone`s
|
||||
|
||||
##### Overriding and altering variables
|
||||
|
||||
#### `Targets` and `Sources`
|
||||
|
||||
#### `Nodes`
|
||||
|
||||
##### `File` Nodes
|
||||
|
||||
##### `Program` and `Library` Nodes
|
||||
|
||||
#### `Aliases`, `Depends` and `Requires`
|
||||
|
||||
#### `Builders`
|
||||
|
||||
#### `Emitters`
|
||||
|
||||
#### `Scanners`
|
||||
|
||||
#### `Actions`
|
||||
|
||||
#### `Configure` objects
|
||||
|
||||
#### DAG walk
|
||||
|
||||
#### Reference to SCons documentation
|
||||
|
||||
### Modules
|
||||
|
||||
#### How modules work
|
||||
|
||||
#### The Enterprise module
|
||||
|
||||
##### The `build.py` file
|
||||
|
||||
#### Adding a new module
|
||||
|
||||
### Poetry
|
||||
|
||||
#### What is Poetry
|
||||
|
||||
[Poetry](https://python-poetry.org/) is a python dependency management system. Poetry tries to find dependencies in [pypi](https://pypi.org/) (similar to pip). For more details visit the poetry website.
|
||||
|
||||
#### Why use Poetry
|
||||
|
||||
Poetry creates a dependency lock file similar to that of a [Ruby Gemfile](https://bundler.io/guides/gemfile.html#gemfiles) or a [Rust Cargo File](https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html). This lock file has exact dependencies that will be the same no matter when they are installed. Even if dependencyA has an update available the older pinned dependency will still be installed. The means that there will be less errors that are based on two users having different versions of python dependencies.
|
||||
|
||||
#### Poetry Lock File
|
||||
|
||||
In a Poetry project there are two files that determine and resolve the dependencies. The first is [pyproject.toml](../pyproject.toml). This file loosely tells poetry what dependencies and needed and the constraints of those dependencies. For example the following are all valid selections.
|
||||
|
||||
1. `dependencyA = "1.0.0" # dependencyA can only ever be 1.0.0`
|
||||
2. `dependencyA = "^1.0.0" # dependencyA can be any version greater than or equal to 1.0.0 and less than 2.0.0`
|
||||
3. `dependencyA = "*" # dependencyA can be any version`
|
||||
|
||||
The [poetry.lock](../poetry.lock) file has the exact package versions. This file is generated by poetry by running `poetry lock`. This file contains a pinned list of all transitive dependencies that satisfy the requirements in [pyproject.toml](../pyproject.toml).
|
||||
|
||||
### `LIBDEPS` and the `LIBDEPS` Linter
|
||||
|
||||
#### Why `LIBDEPS`?
|
||||
|
||||
Libdeps is a subsystem within the build, which is centered around the LIBrary DEPendency graph. It tracks and maintains the dependency graph as well as lints, analyzes and provides useful metrics about the graph.
|
||||
|
||||
#### Different `LIBDEPS` variable types
|
||||
|
||||
The `LIBDEPS` variables are how the library relationships are defined within the build scripts. The primary variables are as follows:
|
||||
* `LIBDEPS`:
|
||||
The 'public' type which propagates lower level dependencies onward automatically.
|
||||
* `LIBDEPS_PRIVATE`:
|
||||
Creates a dependency only between the target and the dependency.
|
||||
* `LIBDEPS_INTERFACE`:
|
||||
Same as `LIBDEPS` but excludes itself from the propagation onward.
|
||||
* `LIBDEPS_DEPENDENTS`:
|
||||
Creates a reverse `LIBDEPS_PRIVATE` dependency where the dependency is the one declaring the relationship.
|
||||
* `PROGDEPS_DEPENDENTS`:
|
||||
Same as `LIBDEPS_DEPENDENTS` but for use with Program builders.
|
||||
|
||||
- `LIBDEPS`:
|
||||
The 'public' type which propagates lower level dependencies onward automatically.
|
||||
- `LIBDEPS_PRIVATE`:
|
||||
Creates a dependency only between the target and the dependency.
|
||||
- `LIBDEPS_INTERFACE`:
|
||||
Same as `LIBDEPS` but excludes itself from the propagation onward.
|
||||
- `LIBDEPS_DEPENDENTS`:
|
||||
Creates a reverse `LIBDEPS_PRIVATE` dependency where the dependency is the one declaring the relationship.
|
||||
- `PROGDEPS_DEPENDENTS`:
|
||||
Same as `LIBDEPS_DEPENDENTS` but for use with Program builders.
|
||||
|
||||
Libraries are added to these variables as lists per each SCons builder instance in the SConscripts depending on what type of relationship is needed. For more detailed information on theses types, refer to [`The LIBDEPS variables`](build_system_reference.md#the-libdeps-variables)
|
||||
|
||||
#### The `LIBDEPS` lint rules and tags
|
||||
|
||||
The libdeps subsystem is capable of linting and automatically detecting issues. Some of these linting rules are automatically checked during build-time (while the SConscripts are read and the build is performed) while others need to be manually run post-build (after the the generated graph file has been built). Some rules will include exemption tags which can be added to a libraries `LIBDEPS_TAGS` to override a rule for that library.
|
||||
|
||||
The build-time linter also has a print option `--libdeps-linting=print` which will print all issues without failing the build and ignoring exemption tags. This is useful for getting an idea of what issues are currently outstanding.
|
||||
|
||||
For a complete list of build-time lint rules, please refer to [`Build-time Libdeps Linter`](build_system_reference.md#build-time-libdeps-linter)
|
||||
|
||||
#### `LIBDEPS_TAGS`
|
||||
|
||||
`LIBDEPS_TAGS` can also be used to supply flags to the libdeps subsystem to do special handling for certain libraries such as exemptions or inclusions for linting rules and also SCons command line expansion functions.
|
||||
|
||||
For a full list of tags refer to [`LIBDEPS_TAGS`](build_system_reference.md#libdeps_tags)
|
||||
|
||||
#### Using the post-build LIBDEPS Linter
|
||||
|
||||
To use the post-build tools, you must first build the libdeps dependency graph by building the `generate-libdeps-graph` target.
|
||||
|
||||
You must also install the requirements file:
|
||||
|
|
@ -253,14 +391,18 @@ After the graph file is created, it can be used as input into the `gacli` tool t
|
|||
python3 buildscripts/libdeps/gacli.py --graph-file build/cached/libdeps/libdeps.graphml
|
||||
```
|
||||
|
||||
Another tool which provides a graphical interface as well as visual representation of the graph is the graph visualizer. Minimally, it requires passing in a directory in which any files with the `.graphml` extension will be available for analysis. By default it will launch the web interface which is reachable in a web browser at http://localhost:3000.
|
||||
Another tool which provides a graphical interface as well as visual representation of the graph is the graph visualizer. Minimally, it requires passing in a directory in which any files with the `.graphml` extension will be available for analysis. By default it will launch the web interface which is reachable in a web browser at http://localhost:3000.
|
||||
|
||||
```
|
||||
python3 buildscripts/libdeps/graph_visualizer.py --graphml-dir build/opt/libdeps
|
||||
```
|
||||
|
||||
For more information about the details of using the post-build linting tools refer to [`post-build linting and analysis`](build_system_reference.md#post-build-linting-and-analysis)
|
||||
|
||||
### Debugging build system failures
|
||||
|
||||
#### Using` -k` and `-n`
|
||||
|
||||
#### `--debug=[explain, time, stacktrace]`
|
||||
|
||||
#### `--libdeps-debug`
|
||||
|
|
|
|||
|
|
@ -1,25 +1,43 @@
|
|||
# MongoDB Build System Reference
|
||||
|
||||
## MongoDB Build System Requirements
|
||||
|
||||
### Recommended minimum requirements
|
||||
|
||||
### Python modules
|
||||
|
||||
### External libraries
|
||||
|
||||
### Enterprise module requirements
|
||||
|
||||
### Testing requirements
|
||||
|
||||
## MongoDB customizations
|
||||
|
||||
### SCons modules
|
||||
|
||||
### Development tools
|
||||
|
||||
#### Compilation database generator
|
||||
|
||||
### Build tools
|
||||
|
||||
#### IDL Compiler
|
||||
|
||||
### Auxiliary tools
|
||||
|
||||
#### Ninja generator
|
||||
|
||||
#### Icecream tool
|
||||
|
||||
#### ccache tool
|
||||
|
||||
### LIBDEPS
|
||||
|
||||
Libdeps is a subsystem within the build, which is centered around the LIBrary DEPendency graph. It tracks and maintains the dependency graph as well as lints, analyzes and provides useful metrics about the graph.
|
||||
|
||||
#### Design
|
||||
|
||||
The libdeps subsystem is divided into several stages, described in order of use as follows.
|
||||
|
||||
##### SConscript `LIBDEPS` definitions and built time linting
|
||||
|
|
@ -37,17 +55,18 @@ The libdeps analyzer module is a python library which provides and Application P
|
|||
##### The CLI and Visualizer tools
|
||||
|
||||
The libdeps analyzer module is used in the libdeps Graph Analysis Command Line Interface (gacli) tool and the libdeps Graph Visualizer web service. Both tools read in the graph file generated from the build and provide the Human Machine Interface (HMI) for analysis and linting.
|
||||
|
||||
#### The `LIBDEPS` variables
|
||||
|
||||
The variables include several types of lists to be added to libraries per a SCons builder instance:
|
||||
|
||||
| Variable | Use |
|
||||
| ------------- |-------------|
|
||||
| `LIBDEPS` | transitive dependencies |
|
||||
| `LIBDEPS_PRIVATE` | local dependencies |
|
||||
| `LIBDEPS_INTERFACE` | transitive dependencies excluding self |
|
||||
| `LIBDEPS_DEPENDENTS` | reverse dependencies |
|
||||
| `PROGDEPS_DEPENDENTS` | reverse dependencies for Programs |
|
||||
|
||||
| Variable | Use |
|
||||
| --------------------- | -------------------------------------- |
|
||||
| `LIBDEPS` | transitive dependencies |
|
||||
| `LIBDEPS_PRIVATE` | local dependencies |
|
||||
| `LIBDEPS_INTERFACE` | transitive dependencies excluding self |
|
||||
| `LIBDEPS_DEPENDENTS` | reverse dependencies |
|
||||
| `PROGDEPS_DEPENDENTS` | reverse dependencies for Programs |
|
||||
|
||||
_`LIBDEPS`_ is the 'public' type, such that libraries that are added to this list become a dependency of the current library, and also become dependencies of libraries which may depend on the current library. This propagation also includes not just the libraries in the `LIBDEPS` list, but all `LIBDEPS` of those `LIBDEPS` recursively, meaning that all dependencies of the `LIBDEPS` libraries, also become dependencies of the current library and libraries which depend on it.
|
||||
|
||||
|
|
@ -60,34 +79,40 @@ _`LIBDEPS_DEPENDENTS`_ are added to libraries which will force themselves as dep
|
|||
_`PROGDEPS_DEPENDENTS`_ are the same as `LIBDEPS_DEPENDENTS`, but intended for use only with Program builders.
|
||||
|
||||
#### `LIBDEPS_TAGS`
|
||||
|
||||
The `LIBDEPS_TAGS` variable is used to mark certain libdeps for various reasons. Some `LIBDEPS_TAGS` are used to mark certain libraries for `LIBDEPS_TAG_EXPANSIONS` variable which is used to create a function which can expand to a string on the command line. Below is a table of available `LIBDEPS` tags:
|
||||
|
||||
| Tag | Description |
|
||||
|---|---|
|
||||
| `illegal_cyclic_or_unresolved_dependencies_allowlisted` | SCons subst expansion tag to handle dependency cycles |
|
||||
| `init-no-global-side-effects` | SCons subst expansion tag for causing linkers to avoid pulling in all symbols |
|
||||
| `lint-public-dep-allowed` | Linting exemption tag exempting the `lint-no-public-deps` tag |
|
||||
| `lint-no-public-deps` | Linting inclusion tag ensuring a libdep has no `LIBDEPS` declared |
|
||||
| `lint-allow-non-alphabetic` | Linting exemption tag allowing `LIBDEPS` variable lists to be non-alphabetic |
|
||||
| `lint-leaf-node-allowed-dep` | Linting exemption tag exempting the `lint-leaf-node-no-deps` tag |
|
||||
| `lint-leaf-node-no-deps` | Linting inclusion tag ensuring a libdep has no libdeps and is a leaf node |
|
||||
| `lint-allow-nonlist-libdeps` | Linting exemption tag allowing a `LIBDEPS` variable to not be a list | `lint-allow-bidirectional-edges` | Linting exemption tag allowing reverse dependencies to also be a forward dependencies |
|
||||
| `lint-allow-nonprivate-on-deps-dependents` | Linting exemption tag allowing reverse dependencies to be transitive |
|
||||
| `lint-allow-dup-libdeps` | Linting exemption tag allowing `LIBDEPS` variables to contain duplicate libdeps on a given library |
|
||||
| `lint-allow-program-links-private` | Linting exemption tag allowing `Program`s to have `PRIVATE_LIBDEPS` |
|
||||
| Tag | Description |
|
||||
| ------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | -------------------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `illegal_cyclic_or_unresolved_dependencies_allowlisted` | SCons subst expansion tag to handle dependency cycles |
|
||||
| `init-no-global-side-effects` | SCons subst expansion tag for causing linkers to avoid pulling in all symbols |
|
||||
| `lint-public-dep-allowed` | Linting exemption tag exempting the `lint-no-public-deps` tag |
|
||||
| `lint-no-public-deps` | Linting inclusion tag ensuring a libdep has no `LIBDEPS` declared |
|
||||
| `lint-allow-non-alphabetic` | Linting exemption tag allowing `LIBDEPS` variable lists to be non-alphabetic |
|
||||
| `lint-leaf-node-allowed-dep` | Linting exemption tag exempting the `lint-leaf-node-no-deps` tag |
|
||||
| `lint-leaf-node-no-deps` | Linting inclusion tag ensuring a libdep has no libdeps and is a leaf node |
|
||||
| `lint-allow-nonlist-libdeps` | Linting exemption tag allowing a `LIBDEPS` variable to not be a list | `lint-allow-bidirectional-edges` | Linting exemption tag allowing reverse dependencies to also be a forward dependencies |
|
||||
| `lint-allow-nonprivate-on-deps-dependents` | Linting exemption tag allowing reverse dependencies to be transitive |
|
||||
| `lint-allow-dup-libdeps` | Linting exemption tag allowing `LIBDEPS` variables to contain duplicate libdeps on a given library |
|
||||
| `lint-allow-program-links-private` | Linting exemption tag allowing `Program`s to have `PRIVATE_LIBDEPS` |
|
||||
|
||||
##### The `illegal_cyclic_or_unresolved_dependencies_allowlisted` tag
|
||||
|
||||
This tag should not be used anymore because the library dependency graph has been successfully converted to a Directed Acyclic Graph (DAG). Prior to this accomplishment, it was necessary to handle
|
||||
cycles specifically with platform specific options on the command line.
|
||||
|
||||
##### The `init-no-global-side-effects` tag
|
||||
|
||||
Adding this flag to a library turns on platform specific compiler flags which will cause the linker to pull in just the symbols it needs. Note that by default, the build is configured to pull in all symbols from libraries because of the use of static initializers, however if a library is known to not have any of these initializers, then this flag can be added for some performance improvement.
|
||||
|
||||
#### Linting and linter tags
|
||||
|
||||
The libdeps linter features automatically detect certain classes of LIBDEPS usage errors. The libdeps linters are implemented as build-time linting and post-build linting procedures to maintain order in usage of the libdeps tool and the build’s library dependency graph. You will need to comply with the rules enforced by the libdeps linter, and fix issues that it raises when modifying the build scripts. There are exemption tags to prevent the linter from blocking things, however these exemption tags should only be used in extraordinary cases, and with good reason. A goal of the libdeps linter is to drive and maintain the number of exemption tags in use to zero.
|
||||
|
||||
##### Exemption Tags
|
||||
|
||||
There are a number of existing issues that need to be addressed, but they will be addressed in future tickets. In the meantime, the use of specific strings in the LIBDEPS_TAGS variable can allow the libdeps linter to skip certain issues on given libraries. For example, to have the linter skip enforcement of the lint rule against bidirectional edges for "some_library":
|
||||
|
||||
```
|
||||
env.Library(
|
||||
target=’some_library’
|
||||
|
|
@ -97,11 +122,13 @@ env.Library(
|
|||
```
|
||||
|
||||
#### build-time Libdeps Linter
|
||||
|
||||
If there is a build-time issue, the build will fail until it is addressed. This linting feature will be on by default and takes about half a second to complete in a full enterprise build (at the time of writing this), but can be turned off by using the --libdeps-linting=off option on your SCons invocation.
|
||||
|
||||
The current rules and there exemptions are listed below:
|
||||
|
||||
1. **A 'Program' can not link a non-public dependency, it can only have LIBDEPS links.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
|
|
@ -112,12 +139,15 @@ The current rules and there exemptions are listed below:
|
|||
LIBDEPS_PRIVATE=[‘lib2’], # This is a Program, BAD
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
A Program can not be linked into anything else, and there for the transitiveness does not apply. A default value of LIBDEPS was selected for consistency since most Program's were already doing this at the time the rule was created.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-program-links-private' on the target node
|
||||
|
||||
######
|
||||
|
||||
2. **A 'Node' can only directly link a given library once.**
|
||||
|
|
@ -133,15 +163,19 @@ The current rules and there exemptions are listed below:
|
|||
LIBDEPS_INTERFACE=[‘lib2’, 'lib2'], # Linked twice, BAD
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
Libdeps will ignore duplicate links, so this rule is mostly for consistency and neatness in the build scripts.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-dup-libdeps' on the target node
|
||||
|
||||
######
|
||||
|
||||
3. **A 'Node' which uses LIBDEPS_DEPENDENTS or PROGDEPS_DEPENDENTS can only have LIBDEPS_PRIVATE links.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
|
|
@ -153,14 +187,21 @@ The current rules and there exemptions are listed below:
|
|||
LIBDEPS_PRIVATE=[‘lib2’], # OK
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
The node that the library is using LIBDEPS_DEPENDENTS or PROGDEPS_DEPENDENT to inject its dependency onto should be conditional, therefore there should not be transitiveness for that dependency since it cannot be the source of any resolved symbols.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-nonprivate-on-deps-dependents' on the target node
|
||||
|
||||
######
|
||||
|
||||
4. **A 'Node' can not link directly to a library that uses LIBDEPS_DEPENDENTS or PROGDEPS_DEPENDENTS.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
env.Library(
|
||||
target='other_library',
|
||||
|
|
@ -173,16 +214,21 @@ The current rules and there exemptions are listed below:
|
|||
LIBDEPS_DEPENDENTS=['lib3'],
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
A library that is using LIBDEPS_DEPENDENTS or PROGDEPS_DEPENDENT should only be used for reverse dependency edges. If a node does need to link directly to a library that does have reverse dependency edges, that indicates the library should be split into two separate libraries, containing its direct dependency content and its conditional reverse dependency content.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-bidirectional-edges' on the target node
|
||||
|
||||
######
|
||||
|
||||
5. **All libdeps environment vars must be assigned as lists.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
env.Library(
|
||||
target='some_library',
|
||||
|
|
@ -191,13 +237,19 @@ The current rules and there exemptions are listed below:
|
|||
LIBDEPS_PRIVATE=['lib2'], # OK
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
Libdeps will handle non-list environment variables, so this is more for consistency and neatness in the build scripts.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-nonlist-libdeps' on the target node
|
||||
|
||||
######
|
||||
|
||||
6. **Libdeps with the tag 'lint-leaf-node-no-deps' shall not link any libdeps.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
|
|
@ -225,13 +277,19 @@ The current rules and there exemptions are listed below:
|
|||
The special tag allows certain nodes to be marked and programmatically checked that they remain lead nodes. An example use-case is when we want to make sure certain nodes never link mongodb code.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-leaf-node-allowed-dep' on the exempted libdep
|
||||
|
||||
###### Inclusion
|
||||
|
||||
'lint-leaf-node-no-deps' on the target node
|
||||
|
||||
######
|
||||
|
||||
7. **Libdeps with the tag 'lint-no-public-deps' shall not link any libdeps.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
env.Library(
|
||||
target='lib2',
|
||||
|
|
@ -253,17 +311,25 @@ The current rules and there exemptions are listed below:
|
|||
]
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
The special tag allows certain nodes to be marked and programmatically checked that they do not link publicly. Some nodes such as mongod_main have special requirements that this programmatically checks.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-public-dep-allowed' on the exempted libdep
|
||||
|
||||
###### Inclusion
|
||||
|
||||
'lint-no-public-deps' on the target node
|
||||
|
||||
######
|
||||
|
||||
8. **Libdeps shall be sorted alphabetically in LIBDEPS lists in the SCons files.**
|
||||
|
||||
###### Example
|
||||
|
||||
```
|
||||
env.Library(
|
||||
target='lib2',
|
||||
|
|
@ -276,17 +342,19 @@ The current rules and there exemptions are listed below:
|
|||
]
|
||||
)
|
||||
```
|
||||
|
||||
###### Rationale
|
||||
|
||||
Keeping the SCons files neat and ordered allows for easier Code Review diffs and generally better maintainability.
|
||||
|
||||
###### Exemption
|
||||
|
||||
'lint-allow-non-alphabetic' on the exempted libdep
|
||||
|
||||
######
|
||||
|
||||
|
||||
|
||||
|
||||
##### The build-time print Option
|
||||
|
||||
The libdeps linter also has the `--libdeps-linting=print` option which will perform linting, and instead of failing the build on an issue, just print and continue on. It will also ignore exemption tags, and still print the issue because it will not fail the build. This is a good way to see the entirety of existing issues that are exempted by tags, as well as printing other metrics such as time spent linting.
|
||||
|
||||
#### post-build linting and analysis
|
||||
|
|
@ -400,12 +468,14 @@ The script will launch the backend and then build the optimized production front
|
|||
|
||||
After the server has started up, it should notify you via the terminal that you can access it at http://localhost:3000 locally in your browser.
|
||||
|
||||
|
||||
|
||||
## Build system configuration
|
||||
|
||||
### SCons configuration
|
||||
|
||||
#### Frequently used flags and variables
|
||||
|
||||
### MongoDB build configuration
|
||||
|
||||
#### Frequently used flags and variables
|
||||
|
||||
##### `MONGO_GIT_HASH`
|
||||
|
|
@ -425,18 +495,31 @@ of `git describe`, which will use the local tags to derive a version.
|
|||
### Targets and Aliases
|
||||
|
||||
## Build artifacts and installation
|
||||
|
||||
### Hygienic builds
|
||||
|
||||
### AutoInstall
|
||||
|
||||
### AutoArchive
|
||||
|
||||
## MongoDB SCons style guide
|
||||
|
||||
### Sconscript Formatting Guidelines
|
||||
|
||||
#### Vertical list style
|
||||
|
||||
#### Alphabetize everything
|
||||
|
||||
### `Environment` Isolation
|
||||
|
||||
### Declaring Targets (`Program`, `Library`, and `CppUnitTest`)
|
||||
|
||||
### Invoking external tools correctly with `Command`s
|
||||
|
||||
### Customizing an `Environment` for a target
|
||||
|
||||
### Invoking subordinate `SConscript`s
|
||||
|
||||
#### `Import`s and `Export`s
|
||||
|
||||
### A Model `SConscript` with Comments
|
||||
|
|
|
|||
|
|
@ -6,26 +6,25 @@ way to get started, rather than building from source.
|
|||
|
||||
To build MongoDB, you will need:
|
||||
|
||||
* A modern C++ compiler capable of compiling C++20. One of the following is required:
|
||||
* GCC 11.3 or newer
|
||||
* Clang 12.0 (or Apple XCode 13.0 Clang) or newer
|
||||
* Visual Studio 2022 version 17.0 or newer (See Windows section below for details)
|
||||
* On Linux and macOS, the libcurl library and header is required. MacOS includes libcurl.
|
||||
* Fedora/RHEL - `dnf install libcurl-devel`
|
||||
* Ubuntu/Debian - `libcurl-dev` is provided by three packages. Install one of them:
|
||||
* `libcurl4-openssl-dev`
|
||||
* `libcurl4-nss-dev`
|
||||
* `libcurl4-gnutls-dev`
|
||||
* On Ubuntu, the lzma library is required. Install `liblzma-dev`
|
||||
* On Amazon Linux, the xz-devel library is required. `yum install xz-devel`
|
||||
* Python 3.10.x and Pip modules:
|
||||
* See the section "Python Prerequisites" below.
|
||||
* About 13 GB of free disk space for the core binaries (`mongod`,
|
||||
`mongos`, and `mongo`) and about 600 GB for the install-all target.
|
||||
- A modern C++ compiler capable of compiling C++20. One of the following is required:
|
||||
- GCC 11.3 or newer
|
||||
- Clang 12.0 (or Apple XCode 13.0 Clang) or newer
|
||||
- Visual Studio 2022 version 17.0 or newer (See Windows section below for details)
|
||||
- On Linux and macOS, the libcurl library and header is required. MacOS includes libcurl.
|
||||
- Fedora/RHEL - `dnf install libcurl-devel`
|
||||
- Ubuntu/Debian - `libcurl-dev` is provided by three packages. Install one of them:
|
||||
- `libcurl4-openssl-dev`
|
||||
- `libcurl4-nss-dev`
|
||||
- `libcurl4-gnutls-dev`
|
||||
- On Ubuntu, the lzma library is required. Install `liblzma-dev`
|
||||
- On Amazon Linux, the xz-devel library is required. `yum install xz-devel`
|
||||
- Python 3.10.x and Pip modules:
|
||||
- See the section "Python Prerequisites" below.
|
||||
- About 13 GB of free disk space for the core binaries (`mongod`,
|
||||
`mongos`, and `mongo`) and about 600 GB for the install-all target.
|
||||
|
||||
MongoDB supports the following architectures: arm64, ppc64le, s390x,
|
||||
and x86-64. More detailed platform instructions can be found below.
|
||||
|
||||
and x86-64. More detailed platform instructions can be found below.
|
||||
|
||||
## MongoDB Tools
|
||||
|
||||
|
|
@ -37,7 +36,6 @@ repository.
|
|||
The source for the tools is now available at
|
||||
[mongodb/mongo-tools](https://github.com/mongodb/mongo-tools).
|
||||
|
||||
|
||||
## Python Prerequisites
|
||||
|
||||
In order to build MongoDB, Python 3.10+ is required, and several Python
|
||||
|
|
@ -59,9 +57,9 @@ dedicated to building MongoDB is optional but recommended.
|
|||
Note: In order to compile C-based Python modules, you'll also need the
|
||||
Python and OpenSSL C headers. Run:
|
||||
|
||||
* Fedora/RHEL - `dnf install python3-devel openssl-devel`
|
||||
* Ubuntu (20.04 and newer)/Debian (Bullseye and newer) - `apt install python-dev-is-python3 libssl-dev`
|
||||
* Ubuntu (18.04 and older)/Debian (Buster and older) - `apt install python3.7-dev libssl-dev`
|
||||
- Fedora/RHEL - `dnf install python3-devel openssl-devel`
|
||||
- Ubuntu (20.04 and newer)/Debian (Bullseye and newer) - `apt install python-dev-is-python3 libssl-dev`
|
||||
- Ubuntu (18.04 and older)/Debian (Buster and older) - `apt install python3.7-dev libssl-dev`
|
||||
|
||||
Note: If you are seeing errors involving "Prompt dismissed.." you might need to run the following command before poetry install.
|
||||
|
||||
|
|
@ -73,7 +71,7 @@ If you only want to build the database server `mongod`:
|
|||
|
||||
$ python3 buildscripts/scons.py install-mongod
|
||||
|
||||
***Note***: For C++ compilers that are newer than the supported
|
||||
**_Note_**: For C++ compilers that are newer than the supported
|
||||
version, the compiler may issue new warnings that cause MongoDB to
|
||||
fail to build since the build system treats compiler warnings as
|
||||
errors. To ignore the warnings, pass the switch
|
||||
|
|
@ -81,7 +79,7 @@ errors. To ignore the warnings, pass the switch
|
|||
|
||||
$ python3 buildscripts/scons.py install-mongod --disable-warnings-as-errors
|
||||
|
||||
***Note***: On memory-constrained systems, you may run into an error such as `g++: fatal error: Killed signal terminated program cc1plus`. To use less memory during building, pass the parameter `-j1` to scons. This can be incremented to `-j2`, `-j3`, and higher as appropriate to find the fastest working option on your system.
|
||||
**_Note_**: On memory-constrained systems, you may run into an error such as `g++: fatal error: Killed signal terminated program cc1plus`. To use less memory during building, pass the parameter `-j1` to scons. This can be incremented to `-j2`, `-j3`, and higher as appropriate to find the fastest working option on your system.
|
||||
|
||||
$ python3 buildscripts/scons.py install-mongod -j1
|
||||
|
||||
|
|
@ -99,21 +97,20 @@ tests, etc):
|
|||
|
||||
$ python3 buildscripts/scons.py install-all-meta
|
||||
|
||||
|
||||
## SCons Targets
|
||||
|
||||
The following targets can be named on the scons command line to build and
|
||||
install a subset of components:
|
||||
|
||||
* `install-mongod`
|
||||
* `install-mongos`
|
||||
* `install-core` (includes *only* `mongod` and `mongos`)
|
||||
* `install-servers` (includes all server components)
|
||||
* `install-devcore` (includes `mongod`, `mongos`, and `jstestshell` (formerly `mongo` shell))
|
||||
* `install-all` (includes a complete end-user distribution and tests)
|
||||
* `install-all-meta` (absolutely everything that can be built and installed)
|
||||
- `install-mongod`
|
||||
- `install-mongos`
|
||||
- `install-core` (includes _only_ `mongod` and `mongos`)
|
||||
- `install-servers` (includes all server components)
|
||||
- `install-devcore` (includes `mongod`, `mongos`, and `jstestshell` (formerly `mongo` shell))
|
||||
- `install-all` (includes a complete end-user distribution and tests)
|
||||
- `install-all-meta` (absolutely everything that can be built and installed)
|
||||
|
||||
***NOTE***: The `install-core` and `install-servers` targets are *not*
|
||||
**_NOTE_**: The `install-core` and `install-servers` targets are _not_
|
||||
guaranteed to be identical. The `install-core` target will only ever include a
|
||||
minimal set of "core" server components, while `install-servers` is intended
|
||||
for a functional end-user installation. If you are testing, you should use the
|
||||
|
|
@ -126,23 +123,21 @@ The build system will produce an installation tree into
|
|||
`PREFIX` is by default empty. This means that with all of the listed
|
||||
targets all built binaries will be in `build/install/bin` by default.
|
||||
|
||||
|
||||
## Windows
|
||||
|
||||
Build requirements:
|
||||
* Visual Studio 2022 version 17.0 or newer
|
||||
* Python 3.10
|
||||
|
||||
- Visual Studio 2022 version 17.0 or newer
|
||||
- Python 3.10
|
||||
|
||||
Or download a prebuilt binary for Windows at www.mongodb.org.
|
||||
|
||||
|
||||
## Debian/Ubuntu
|
||||
|
||||
To install dependencies on Debian or Ubuntu systems:
|
||||
|
||||
# apt-get install build-essential
|
||||
|
||||
|
||||
## OS X
|
||||
|
||||
Install Xcode 13.0 or newer.
|
||||
|
|
@ -151,16 +146,16 @@ Install Xcode 13.0 or newer.
|
|||
|
||||
Install the following ports:
|
||||
|
||||
* `devel/libexecinfo`
|
||||
* `lang/llvm70`
|
||||
* `lang/python`
|
||||
- `devel/libexecinfo`
|
||||
- `lang/llvm70`
|
||||
- `lang/python`
|
||||
|
||||
Add `CC=clang12 CXX=clang++12` to the `scons` options, when building.
|
||||
|
||||
|
||||
## OpenBSD
|
||||
|
||||
Install the following ports:
|
||||
|
||||
* `devel/libexecinfo`
|
||||
* `lang/gcc`
|
||||
* `lang/python`
|
||||
- `devel/libexecinfo`
|
||||
- `lang/gcc`
|
||||
- `lang/python`
|
||||
|
|
|
|||
|
|
@ -15,10 +15,10 @@ single client connection during its lifetime. Central to the entry point is the
|
|||
requests and returns a response message indicating the result of the
|
||||
corresponding request message. This function is currently implemented by several
|
||||
subclasses of the parent `ServiceEntryPoint` in order to account for the
|
||||
differences in processing requests between *mongod* and *mongos* -- these
|
||||
differences in processing requests between _mongod_ and _mongos_ -- these
|
||||
distinctions are reflected in the `ServiceEntryPointMongos` and
|
||||
`ServiceEntryPointMongod` subclasses (see [here][service_entry_point_mongos_h]
|
||||
and [here][service_entry_point_mongod_h]). One such distinction is the *mongod*
|
||||
and [here][service_entry_point_mongod_h]). One such distinction is the _mongod_
|
||||
entry point's use of the `ServiceEntryPointCommon::Hooks` interface, which
|
||||
provides greater flexibility in modifying the entry point's behavior. This
|
||||
includes waiting on a read of a particular [read concern][read_concern] level to
|
||||
|
|
@ -28,17 +28,17 @@ for [write concerns][write_concern] as well.
|
|||
|
||||
## Strategy
|
||||
|
||||
One area in which the *mongos* entry point differs from its *mongod* counterpart
|
||||
One area in which the _mongos_ entry point differs from its _mongod_ counterpart
|
||||
is in its usage of the [Strategy class][strategy_h]. `Strategy` operates as a
|
||||
legacy interface for processing client read, write, and command requests; there
|
||||
is a near 1-to-1 mapping between its constituent functions and request types
|
||||
(e.g. `writeOp()` for handling write operation requests, `getMore()` for a
|
||||
getMore request, etc.). These functions comprise the backbone of the *mongos*
|
||||
getMore request, etc.). These functions comprise the backbone of the _mongos_
|
||||
entry point's `handleRequest()` -- that is to say, when a valid request is
|
||||
received, it is sieved and ultimately passed along to the appropriate Strategy
|
||||
class member function. The significance of using the Strategy class specifically
|
||||
with the *mongos* entry point is that it [facilitates query routing to
|
||||
shards][mongos_router] in *addition* to running queries against targeted
|
||||
with the _mongos_ entry point is that it [facilitates query routing to
|
||||
shards][mongos_router] in _addition_ to running queries against targeted
|
||||
databases (see [s/transaction_router.h][transaction_router_h] for finer
|
||||
details).
|
||||
|
||||
|
|
@ -50,7 +50,7 @@ system][template_method_pattern], that will likely be used during the lifespan
|
|||
of a particular server. Construction of a Command should only occur during
|
||||
server startup. When a new Command is constructed, that Command is stored in a
|
||||
global `CommandRegistry` object for future reference. There are two kinds of
|
||||
Command subclasses: `BasicCommand` and `TypedCommand`.
|
||||
Command subclasses: `BasicCommand` and `TypedCommand`.
|
||||
|
||||
A major distinction between the two is in their implementation of the `parse()`
|
||||
member function. `parse()` takes in a request and returns a handle to a single
|
||||
|
|
@ -62,7 +62,7 @@ implementation of `TypedCommand::parse()`, on the other hand, varies depending
|
|||
on the Request type parameter the Command takes in. Since the `TypedCommand`
|
||||
accepts requests generated by IDL, the parsing function associated with a usable
|
||||
Request type must allow it to be parsed as an IDL command. In handling requests,
|
||||
both the *mongos* and *mongod* entry points interact with the Command subclasses
|
||||
both the _mongos_ and _mongod_ entry points interact with the Command subclasses
|
||||
through the `CommandHelpers` struct in order to parse requests and ultimately
|
||||
run them as Commands.
|
||||
|
||||
|
|
@ -81,5 +81,5 @@ For details on transport internals, including ingress networking, see [this docu
|
|||
[mongos_router]: https://docs.mongodb.com/manual/core/sharded-cluster-query-router/
|
||||
[transaction_router_h]: ../src/mongo/s/transaction_router.h
|
||||
[commands_h]: ../src/mongo/db/commands.h
|
||||
[template_method_pattern]: https://en.wikipedia.org/wiki/Template_method_pattern
|
||||
[template_method_pattern]: https://en.wikipedia.org/wiki/Template_method_pattern
|
||||
[transport_internals]: ../src/mongo/transport/README.md
|
||||
|
|
|
|||
|
|
@ -55,7 +55,7 @@ All `Client`s have an associated lock which protects their internal state includ
|
|||
associated `OperationContext` from concurrent access. Any mutation to a `Client`’s associated
|
||||
`OperationContext` (or other protected internal state) _must_ take the `Client` lock before being
|
||||
performed, as an `OperationContext` can otherwise be killed and destroyed at any time. A `Client`
|
||||
thread may read its own internal state without taking the `Client` lock, but _must_ take the
|
||||
thread may read its own internal state without taking the `Client` lock, but _must_ take the
|
||||
`Client` lock when reading another `Client` thread's internal state. Only a `Client`'s owning thread
|
||||
may write to its `Client`'s internal state, and must take the lock when doing so. `Client`s
|
||||
implement the standard lockable interface (`lock()`, `unlock()`, and `try_lock()`) to support these
|
||||
|
|
@ -68,7 +68,7 @@ operations. The semantics of the `Client` lock are summarized in the table below
|
|||
|
||||
### `Client` thread manipulation
|
||||
|
||||
[`Client::cc()`][client-cc-url] may be used to get the `Client` object associated with the currently
|
||||
[`Client::cc()`][client-cc-url] may be used to get the `Client` object associated with the currently
|
||||
executing thread. Prefer passing `Client` objects as parameters over calls to `Client::cc()` when
|
||||
possible. A [`ThreadClient`][thread-client-url] is an RAII-style class which may be used to construct
|
||||
and bind a `Client` to the current running thread and automatically unbind it once the `ThreadClient`
|
||||
|
|
|
|||
|
|
@ -1,28 +1,29 @@
|
|||
# Egress Networking
|
||||
|
||||
Egress networking entails outbound communication (i.e. requests) from a client process to a server process (e.g. *mongod*), as well as inbound communication (i.e. responses) from such a server process back to a client process.
|
||||
Egress networking entails outbound communication (i.e. requests) from a client process to a server process (e.g. _mongod_), as well as inbound communication (i.e. responses) from such a server process back to a client process.
|
||||
|
||||
## Remote Commands
|
||||
|
||||
Remote commands represent the "packages" in which data is transmitted via egress networking. There are two types of remote commands: requests and responses. The [request object][remote_command_request_h] is in essence a wrapper for a command in BSON format, that is to be delivered to and executed by a remote MongoDB node against a database specified by a member in the object. The [response object][remote_command_response_h], in turn, contains data that describes the response to a previously sent request, also in BSON format. Besides the actual response data, the response object also stores useful information such as the duration of running the command specified in the corresponding request, as well as a `Status` member that indicates whether the operation was a success, and the cause of error if not.
|
||||
Remote commands represent the "packages" in which data is transmitted via egress networking. There are two types of remote commands: requests and responses. The [request object][remote_command_request_h] is in essence a wrapper for a command in BSON format, that is to be delivered to and executed by a remote MongoDB node against a database specified by a member in the object. The [response object][remote_command_response_h], in turn, contains data that describes the response to a previously sent request, also in BSON format. Besides the actual response data, the response object also stores useful information such as the duration of running the command specified in the corresponding request, as well as a `Status` member that indicates whether the operation was a success, and the cause of error if not.
|
||||
|
||||
There are two variants of both the request and response classes that are used in egress networking. The distinction between the `RemoteCommandRequest` and `RemoteCommandRequestOnAny` classes is that the former specifies a particular host/server to connect to, whereas the latter houses a vector of hosts, for when a command may be run on multiple nodes in a replica set. The distinction between `RemoteCommandResponse` and `RemoteCommandOnAnyResponse` is that the latter includes additional information as to what host the originating request was ultimately run on. It should be noted that the distinctions between the request and response classes are characteristically different; that is to say, whereas the *OnAny* variant of the request object is a augmented version of the other, the response classes should be understood as being different return types altogether.
|
||||
There are two variants of both the request and response classes that are used in egress networking. The distinction between the `RemoteCommandRequest` and `RemoteCommandRequestOnAny` classes is that the former specifies a particular host/server to connect to, whereas the latter houses a vector of hosts, for when a command may be run on multiple nodes in a replica set. The distinction between `RemoteCommandResponse` and `RemoteCommandOnAnyResponse` is that the latter includes additional information as to what host the originating request was ultimately run on. It should be noted that the distinctions between the request and response classes are characteristically different; that is to say, whereas the _OnAny_ variant of the request object is a augmented version of the other, the response classes should be understood as being different return types altogether.
|
||||
|
||||
## Connection Pooling
|
||||
|
||||
[Connection pooling][connection_pool] is largely taken care of by the [executor::connection_pool][connection_pool_h] class. This class houses a collection of `ConnectionPool::SpecificPool` objects, each of which shares a one-to-one mapping with a unique host. This lends itself to a parent-child relationship between a "parent" ConnectionPool and its constituent "children" SpecificPool members. The `ConnectionPool::ControllerInterface` subclass is used to direct the behavior of the SpecificPools that belong to it. The main operations associated with the ControllerInterface are the addition, removal, and updating of hosts (and thereby corresponding SpecificPools) to/from/in the parent pool. SpecificPools are created when a connection to a new host is requested, and expire when `hostTimeout` has passed without there having been any new requests or checked-out connections (i.e. connections in use). A pool can have its expiration status lifted whenever a connection is requested, but once a pool is shutdown, the pool becomes unusable. The `hostTimeout` field is one of many parameters belonging to the `ConnectionPool::Options` struct that determines how pools operate.
|
||||
[Connection pooling][connection_pool] is largely taken care of by the [executor::connection_pool][connection_pool_h] class. This class houses a collection of `ConnectionPool::SpecificPool` objects, each of which shares a one-to-one mapping with a unique host. This lends itself to a parent-child relationship between a "parent" ConnectionPool and its constituent "children" SpecificPool members. The `ConnectionPool::ControllerInterface` subclass is used to direct the behavior of the SpecificPools that belong to it. The main operations associated with the ControllerInterface are the addition, removal, and updating of hosts (and thereby corresponding SpecificPools) to/from/in the parent pool. SpecificPools are created when a connection to a new host is requested, and expire when `hostTimeout` has passed without there having been any new requests or checked-out connections (i.e. connections in use). A pool can have its expiration status lifted whenever a connection is requested, but once a pool is shutdown, the pool becomes unusable. The `hostTimeout` field is one of many parameters belonging to the `ConnectionPool::Options` struct that determines how pools operate.
|
||||
|
||||
The `ConnectionPool::ConnectionInterface` is responsible for handling the connections *within* a pool. The ConnectionInterface's operations include, but are not limited to, connection setup (establishing a connection, authenticating, etc.), refreshing connections, and managing a timer. This interface also maintains the notion of a pool/connection **generation**, which is used to identify whether some particular connection's generation is older than that of the pool it belongs to (i.e. the connection is out-of-date), in which case it is dropped. The ConnectionPool uses a global mutex for access to SpecificPools as well as generation counters. Another component of the ConnectionPool is its `EgressConnectionCloserManager`. The manager consists of multiple `EgressConnectionClosers`, which are used to determine whether hosts should be dropped. In the context of the ConnectionPool, the manager's purpose is to drop *connections* to hosts based on whether they have been marked as keep open or not.
|
||||
The `ConnectionPool::ConnectionInterface` is responsible for handling the connections _within_ a pool. The ConnectionInterface's operations include, but are not limited to, connection setup (establishing a connection, authenticating, etc.), refreshing connections, and managing a timer. This interface also maintains the notion of a pool/connection **generation**, which is used to identify whether some particular connection's generation is older than that of the pool it belongs to (i.e. the connection is out-of-date), in which case it is dropped. The ConnectionPool uses a global mutex for access to SpecificPools as well as generation counters. Another component of the ConnectionPool is its `EgressConnectionCloserManager`. The manager consists of multiple `EgressConnectionClosers`, which are used to determine whether hosts should be dropped. In the context of the ConnectionPool, the manager's purpose is to drop _connections_ to hosts based on whether they have been marked as keep open or not.
|
||||
|
||||
## Internal Network Clients
|
||||
|
||||
Client-side outbound communication in egress networking is primarily handled by the [AsyncDBClient class][async_client_h]. The async client is responsible for initializing a connection to a particular host as well as initializing the [wire protocol][wire_protocol] for client-server communication, after which remote requests can be sent by the client and corresponding remote responses from a database can subsequently be received. In setting up the wire protocol, the async client sends an [isMaster][is_master] request to the server and parses the server's isMaster response to ensure that the status of the connection is OK. An initial isMaster request is constructed in the legacy OP_QUERY protocol, so that clients can still communicate with servers that may not support other protocols. The async client also supports client authentication functionality (i.e. authenticating a user's credentials, client host, remote host, etc.).
|
||||
Client-side outbound communication in egress networking is primarily handled by the [AsyncDBClient class][async_client_h]. The async client is responsible for initializing a connection to a particular host as well as initializing the [wire protocol][wire_protocol] for client-server communication, after which remote requests can be sent by the client and corresponding remote responses from a database can subsequently be received. In setting up the wire protocol, the async client sends an [isMaster][is_master] request to the server and parses the server's isMaster response to ensure that the status of the connection is OK. An initial isMaster request is constructed in the legacy OP_QUERY protocol, so that clients can still communicate with servers that may not support other protocols. The async client also supports client authentication functionality (i.e. authenticating a user's credentials, client host, remote host, etc.).
|
||||
|
||||
The scheduling of requests is managed by the [task executor][task_executor_h], which maintains the notion of **events** and **callbacks**. Callbacks represent work (e.g. remote requests) that is to be executed by the executor, and are scheduled by client threads as well as other callbacks. There are several variations of work scheduling methods, which include: immediate scheduling, scheduling no earlier than a specified time, and scheduling iff a specified event has been signalled. These methods return a handle that can be used while the executor is still in scope for either waiting on or cancelling the scheduled callback in question. If a scheduled callback is cancelled, it remains on the work queue and is technically still run, but is labeled as having been 'cancelled' beforehand. Once a given callback/request is scheduled, the task executor is then able to execute such requests via a [network interface][network_interface_h]. The network interface, connected to a particular host/server, begins the asynchronous execution of commands specified via a request bundled in the aforementioned callback handle. The interface is capable of blocking threads until its associated task executor has work that needs to be performed, and is likewise able to return from an idle state when it receives a signal that the executor has new work to process.
|
||||
The scheduling of requests is managed by the [task executor][task_executor_h], which maintains the notion of **events** and **callbacks**. Callbacks represent work (e.g. remote requests) that is to be executed by the executor, and are scheduled by client threads as well as other callbacks. There are several variations of work scheduling methods, which include: immediate scheduling, scheduling no earlier than a specified time, and scheduling iff a specified event has been signalled. These methods return a handle that can be used while the executor is still in scope for either waiting on or cancelling the scheduled callback in question. If a scheduled callback is cancelled, it remains on the work queue and is technically still run, but is labeled as having been 'cancelled' beforehand. Once a given callback/request is scheduled, the task executor is then able to execute such requests via a [network interface][network_interface_h]. The network interface, connected to a particular host/server, begins the asynchronous execution of commands specified via a request bundled in the aforementioned callback handle. The interface is capable of blocking threads until its associated task executor has work that needs to be performed, and is likewise able to return from an idle state when it receives a signal that the executor has new work to process.
|
||||
|
||||
Client-side legacy networking draws upon the `DBClientBase` class, of which there are multiple subclasses residing in the `src/mongo/client` folder. The [replica set DBClient][dbclient_rs_h] discerns which one of multiple servers in a replica set is the primary at construction time, and establishes a connection (using the `DBClientConnection` wrapper class, also extended from `DBClientBase`) with the replica set via the primary. In cases where the primary server is unresponsive within a specified time range, the RS DBClient will automatically attempt to establish a secondary server as the new primary (see [automatic failover][automatic_failover]).
|
||||
|
||||
## See Also
|
||||
|
||||
For details on transport internals, including ingress networking, see [this
|
||||
document][transport_internals].
|
||||
|
||||
|
|
|
|||
|
|
@ -2,9 +2,9 @@
|
|||
|
||||
Documentation about how MongoDB is tested in Evergreen.
|
||||
|
||||
* [Burn_in_tags](burn_in_tags.md)
|
||||
* [Burn_in_tests](burn_in_tests.md)
|
||||
* [Configuration for Evergreen Integration](configuration.md)
|
||||
* [Task Timeouts](task_timeouts.md)
|
||||
* [Task Generation](task_generation.md)
|
||||
* [Multiversion Testing](multiversion.md)
|
||||
- [Burn_in_tags](burn_in_tags.md)
|
||||
- [Burn_in_tests](burn_in_tests.md)
|
||||
- [Configuration for Evergreen Integration](configuration.md)
|
||||
- [Task Timeouts](task_timeouts.md)
|
||||
- [Task Generation](task_generation.md)
|
||||
- [Multiversion Testing](multiversion.md)
|
||||
|
|
|
|||
|
|
@ -19,6 +19,7 @@ new or changed javascript tests (note that a javascript test can be included in
|
|||
those tests will be run 2 times minimum, and 1000 times maximum or for 10 minutes, whichever is reached first.
|
||||
|
||||
## ! Run All Affected JStests
|
||||
|
||||
The `! Run All Affected JStests` variant has a single `burn_in_tags_gen` task. This task will create &
|
||||
activate [`burn_in_tests`](burn_in_tests.md) tasks for all required and suggested
|
||||
variants. The end result is that any jstests that have been modified in the patch will
|
||||
|
|
|
|||
|
|
@ -1,51 +1,53 @@
|
|||
# Evergreen configuration
|
||||
|
||||
This document describes the continuous integration (CI) configuration for MongoDB.
|
||||
|
||||
This document describes the continuous integration (CI) configuration for MongoDB.
|
||||
|
||||
## Projects
|
||||
|
||||
There are a number of Evergreen projects supporting MongoDB's CI. For more information on
|
||||
Evergreen-specific terminology used in this document, please refer to the
|
||||
Evergreen-specific terminology used in this document, please refer to the
|
||||
[Project Configuration](https://github.com/evergreen-ci/evergreen/wiki/Project-Configuration-Files)
|
||||
section of the Evergreen wiki.
|
||||
|
||||
### `mongodb-mongo-master`
|
||||
|
||||
The main project for testing MongoDB's dev environments with a number build variants,
|
||||
each one corresponding to a particular compile or testing environment to support development.
|
||||
Each build variant runs a set of tasks; each task ususally runs one or more tests.
|
||||
|
||||
### `mongodb-mongo-master-nightly
|
||||
|
||||
Tracks the same branch as `mongodb-mongo-master`, each build variant corresponds to a
|
||||
(version, OS, architecure) triplet for a supported MongoDB nightly release.
|
||||
|
||||
### `sys_perf`
|
||||
|
||||
The system performance project.
|
||||
|
||||
### `microbenchmarks`
|
||||
Performance unittests, used mainly for validating areas related to the Query system.
|
||||
|
||||
Performance unittests, used mainly for validating areas related to the Query system.
|
||||
|
||||
## Project configurations
|
||||
|
||||
The above Evergreen projects are defined in the following files:
|
||||
|
||||
* `etc/evergreen_yml_components/**.yml`. YAML files containing definitions for tasks, functions, buildvariants, etc.
|
||||
They are copied from the existing evergreen.yml file.
|
||||
- `etc/evergreen_yml_components/**.yml`. YAML files containing definitions for tasks, functions, buildvariants, etc.
|
||||
They are copied from the existing evergreen.yml file.
|
||||
|
||||
* `etc/evergreen.yml`. Imports components from above and serves as the project config for mongodb-mongo-master,
|
||||
containing all build variants for development, including all feature-specific, patch build required, and suggested
|
||||
variants.
|
||||
- `etc/evergreen.yml`. Imports components from above and serves as the project config for mongodb-mongo-master,
|
||||
containing all build variants for development, including all feature-specific, patch build required, and suggested
|
||||
variants.
|
||||
|
||||
* `etc/evergreen_nightly.yml`. The project configuration for mongodb-mongo-master-nightly, containing only build
|
||||
variants for public nightly builds, imports similar components as evergreen.yml to ensure consistency.
|
||||
- `etc/evergreen_nightly.yml`. The project configuration for mongodb-mongo-master-nightly, containing only build
|
||||
variants for public nightly builds, imports similar components as evergreen.yml to ensure consistency.
|
||||
|
||||
* `etc/sys_perf.yml`. Configuration file for the system performance project.
|
||||
|
||||
* `etc/perf.yml`. Configuration for the microbenchmark project.
|
||||
- `etc/sys_perf.yml`. Configuration file for the system performance project.
|
||||
|
||||
- `etc/perf.yml`. Configuration for the microbenchmark project.
|
||||
|
||||
## Release Branching Process
|
||||
|
||||
Only the `mongodb-mongo-master-nightly` project will be branched with required and other
|
||||
necessary variants (e.g. sanitizers) added back in. Most variants in `mongodb-mongo-master`
|
||||
would be dropped by default but can be re-introduced to the release branches manually on an
|
||||
|
|
|
|||
|
|
@ -1,48 +1,43 @@
|
|||
# Multiversion Testing
|
||||
|
||||
|
||||
## Table of contents
|
||||
|
||||
- [Multiversion Testing](#multiversion-testing)
|
||||
- [Table of contents](#table-of-contents)
|
||||
- [Terminology and overview](#terminology-and-overview)
|
||||
- [Introduction](#introduction)
|
||||
- [Latest vs last-lts vs last-continuous](#latest-vs-last-lts-vs-last-continuous)
|
||||
- [Old vs new](#old-vs-new)
|
||||
- [Explicit and Implicit multiversion suites](#explicit-and-implicit-multiversion-suites)
|
||||
- [Version combinations](#version-combinations)
|
||||
- [Working with multiversion tasks in Evergreen](#working-with-multiversion-tasks-in-evergreen)
|
||||
- [Exclude tests from multiversion testing](#exclude-tests-from-multiversion-testing)
|
||||
- [Multiversion task generation](#multiversion-task-generation)
|
||||
|
||||
- [Multiversion Testing](#multiversion-testing)
|
||||
- [Table of contents](#table-of-contents)
|
||||
- [Terminology and overview](#terminology-and-overview)
|
||||
- [Introduction](#introduction)
|
||||
- [Latest vs last-lts vs last-continuous](#latest-vs-last-lts-vs-last-continuous)
|
||||
- [Old vs new](#old-vs-new)
|
||||
- [Explicit and Implicit multiversion suites](#explicit-and-implicit-multiversion-suites)
|
||||
- [Version combinations](#version-combinations)
|
||||
- [Working with multiversion tasks in Evergreen](#working-with-multiversion-tasks-in-evergreen)
|
||||
- [Exclude tests from multiversion testing](#exclude-tests-from-multiversion-testing)
|
||||
- [Multiversion task generation](#multiversion-task-generation)
|
||||
|
||||
## Terminology and overview
|
||||
|
||||
|
||||
### Introduction
|
||||
|
||||
Some tests test specific upgrade/downgrade behavior expected between different versions of MongoDB.
|
||||
Several versions of MongoDB are spun up during those test runs.
|
||||
|
||||
* Multiversion suites - resmoke suites that are running tests with several versions of MongoDB.
|
||||
|
||||
* Multiversion tasks - Evergreen tasks that are running multiversion suites. Multiversion tasks in
|
||||
most cases include `multiversion` or `downgrade` in their names.
|
||||
- Multiversion suites - resmoke suites that are running tests with several versions of MongoDB.
|
||||
|
||||
- Multiversion tasks - Evergreen tasks that are running multiversion suites. Multiversion tasks in
|
||||
most cases include `multiversion` or `downgrade` in their names.
|
||||
|
||||
### Latest vs last-lts vs last-continuous
|
||||
|
||||
For some of the versions we are using such generic names as `latest`, `last-lts` and
|
||||
`last-continuous`.
|
||||
|
||||
* `latest` - the current version. In Evergreen, the version that was compiled in the current build.
|
||||
|
||||
* `last-lts` - the latest LTS (Long Term Support) Major release version. In Evergreen, the version
|
||||
that was downloaded from the last LTS release branch project.
|
||||
- `latest` - the current version. In Evergreen, the version that was compiled in the current build.
|
||||
|
||||
* `last-continuous` - the latest Rapid release version. In Evergreen, the version that was
|
||||
downloaded from the Rapid release branch project.
|
||||
- `last-lts` - the latest LTS (Long Term Support) Major release version. In Evergreen, the version
|
||||
that was downloaded from the last LTS release branch project.
|
||||
|
||||
- `last-continuous` - the latest Rapid release version. In Evergreen, the version that was
|
||||
downloaded from the Rapid release branch project.
|
||||
|
||||
### Old vs new
|
||||
|
||||
|
|
@ -56,34 +51,53 @@ compiled binaries are downloaded from the old branch projects with
|
|||
`db-contrib-tool` searches for the latest available compiled binaries on the old branch projects in
|
||||
Evergreen.
|
||||
|
||||
|
||||
### Explicit and Implicit multiversion suites
|
||||
|
||||
Multiversion suites can be explicit and implicit.
|
||||
|
||||
* Explicit - JS tests are aware of the binary versions they are running,
|
||||
e.g. [multiversion.yml](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be208481493fc96/buildscripts/resmokeconfig/suites/multiversion.yml).
|
||||
The version of binaries is explicitly set in JS tests,
|
||||
e.g. [jstests/multiVersion/genericSetFCVUsage/major_version_upgrade.js](https://github.com/mongodb/mongo/blob/397c8da541940b3fbe6257243f97a342fe7e0d3b/jstests/multiVersion/genericSetFCVUsage/major_version_upgrade.js#L33-L44):
|
||||
- Explicit - JS tests are aware of the binary versions they are running,
|
||||
e.g. [multiversion.yml](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be208481493fc96/buildscripts/resmokeconfig/suites/multiversion.yml).
|
||||
The version of binaries is explicitly set in JS tests,
|
||||
e.g. [jstests/multiVersion/genericSetFCVUsage/major_version_upgrade.js](https://github.com/mongodb/mongo/blob/397c8da541940b3fbe6257243f97a342fe7e0d3b/jstests/multiVersion/genericSetFCVUsage/major_version_upgrade.js#L33-L44):
|
||||
|
||||
```js
|
||||
const versions = [
|
||||
{binVersion: '4.4', featureCompatibilityVersion: '4.4', testCollection: 'four_four'},
|
||||
{binVersion: '5.0', featureCompatibilityVersion: '5.0', testCollection: 'five_zero'},
|
||||
{binVersion: '6.0', featureCompatibilityVersion: '6.0', testCollection: 'six_zero'},
|
||||
{binVersion: 'last-lts', featureCompatibilityVersion: lastLTSFCV, testCollection: 'last_lts'},
|
||||
{
|
||||
binVersion: 'last-continuous',
|
||||
featureCompatibilityVersion: lastContinuousFCV,
|
||||
testCollection: 'last_continuous'
|
||||
binVersion: "4.4",
|
||||
featureCompatibilityVersion: "4.4",
|
||||
testCollection: "four_four",
|
||||
},
|
||||
{
|
||||
binVersion: "5.0",
|
||||
featureCompatibilityVersion: "5.0",
|
||||
testCollection: "five_zero",
|
||||
},
|
||||
{
|
||||
binVersion: "6.0",
|
||||
featureCompatibilityVersion: "6.0",
|
||||
testCollection: "six_zero",
|
||||
},
|
||||
{
|
||||
binVersion: "last-lts",
|
||||
featureCompatibilityVersion: lastLTSFCV,
|
||||
testCollection: "last_lts",
|
||||
},
|
||||
{
|
||||
binVersion: "last-continuous",
|
||||
featureCompatibilityVersion: lastContinuousFCV,
|
||||
testCollection: "last_continuous",
|
||||
},
|
||||
{
|
||||
binVersion: "latest",
|
||||
featureCompatibilityVersion: latestFCV,
|
||||
testCollection: "latest",
|
||||
},
|
||||
{binVersion: 'latest', featureCompatibilityVersion: latestFCV, testCollection: 'latest'},
|
||||
];
|
||||
```
|
||||
|
||||
* Implicit - JS tests know nothing about the binary versions they are running,
|
||||
e.g. [retryable_writes_downgrade.yml](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be208481493fc96/buildscripts/resmokeconfig/suites/retryable_writes_downgrade.yml).
|
||||
Most of the implicit multiversion suites are using matrix suites, e.g. `replica_sets_last_lts`:
|
||||
- Implicit - JS tests know nothing about the binary versions they are running,
|
||||
e.g. [retryable_writes_downgrade.yml](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be208481493fc96/buildscripts/resmokeconfig/suites/retryable_writes_downgrade.yml).
|
||||
Most of the implicit multiversion suites are using matrix suites, e.g. `replica_sets_last_lts`:
|
||||
|
||||
```bash
|
||||
$ python buildscripts/resmoke.py suiteconfig --suite=replica_sets_last_lts
|
||||
|
|
@ -118,7 +132,7 @@ The [example](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be2
|
|||
of replica set fixture configuration override:
|
||||
|
||||
```yaml
|
||||
fixture:
|
||||
fixture:
|
||||
num_nodes: 3
|
||||
old_bin_version: last_lts
|
||||
mixed_bin_versions: new_new_old
|
||||
|
|
@ -128,7 +142,7 @@ The [example](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be2
|
|||
of sharded cluster fixture configuration override:
|
||||
|
||||
```yaml
|
||||
fixture:
|
||||
fixture:
|
||||
num_shards: 2
|
||||
num_rs_nodes_per_shard: 2
|
||||
old_bin_version: last_lts
|
||||
|
|
@ -139,71 +153,68 @@ The [example](https://github.com/mongodb/mongo/blob/e91cda950e50aa4c707efbdd0be2
|
|||
of shell fixture configuration override:
|
||||
|
||||
```yaml
|
||||
value:
|
||||
value:
|
||||
executor:
|
||||
config:
|
||||
shell_options:
|
||||
global_vars:
|
||||
TestData:
|
||||
useRandomBinVersionsWithinReplicaSet: 'last-lts'
|
||||
config:
|
||||
shell_options:
|
||||
global_vars:
|
||||
TestData:
|
||||
useRandomBinVersionsWithinReplicaSet: "last-lts"
|
||||
```
|
||||
|
||||
|
||||
### Version combinations
|
||||
|
||||
In implicit multiversion suites the same set of tests may run in similar suites that are using
|
||||
various mixed version combinations. Those version combinations depend on the type of resmoke
|
||||
fixture the suite is running with. These are the recommended version combinations to test against based on the suite fixtures:
|
||||
|
||||
* Replica set fixture combinations:
|
||||
* `last-lts new-new-old` (i.e. suite runs the replica set fixture that spins up the `latest` and
|
||||
the `last-lts` versions in a 3-node replica set where the 1st node is the `latest`, 2nd - `latest`,
|
||||
3rd - `last-lts`, etc.)
|
||||
* `last-lts new-old-new`
|
||||
* `last-lts old-new-new`
|
||||
* `last-continuous new-new-old`
|
||||
* `last-continuous new-old-new`
|
||||
* `last-continuous old-new-new`
|
||||
* Ex: [change_streams](https://github.com/10gen/mongo/blob/88d59bfe9d5ee2c9938ae251f7a77a8bf1250a6b/buildscripts/resmokeconfig/suites/change_streams.yml) uses a [`ReplicaSetFixture`](https://github.com/10gen/mongo/blob/88d59bfe9d5ee2c9938ae251f7a77a8bf1250a6b/buildscripts/resmokeconfig/suites/change_streams.yml#L50) so the corresponding multiversion suites are
|
||||
* [`change_streams_last_continuous_new_new_old`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_new_new_old.yml)
|
||||
* [`change_streams_last_continuous_new_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_new_old_new.yml)
|
||||
* [`change_streams_last_continuous_old_new_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_old_new_new.yml)
|
||||
* [`change_streams_last_lts_new_new_old`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_new_new_old.yml)
|
||||
* [`change_streams_last_lts_new_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_new_old_new.yml)
|
||||
* [`change_streams_last_lts_old_new_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_old_new_new.yml)
|
||||
- Replica set fixture combinations:
|
||||
|
||||
* Sharded cluster fixture combinations:
|
||||
* `last-lts new-old-old-new` (i.e. suite runs the sharded cluster fixture that spins up the
|
||||
`latest` and the `last-lts` versions in a sharded cluster that consists of 2 shards with 2-node
|
||||
replica sets per shard where the 1st node of the 1st shard is the `latest`, 2nd node of 1st
|
||||
shard - `last-lts`, 1st node of 2nd shard - `last-lts`, 2nd node of 2nd shard - `latest`, etc.)
|
||||
* `last-continuous new-old-old-new`
|
||||
* Ex: [change_streams_downgrade](https://github.com/10gen/mongo/blob/a96b83b2fa7010a5823fefac2469b4a06a697cf1/buildscripts/resmokeconfig/suites/change_streams_downgrade.yml) uses a [`ShardedClusterFixture`](https://github.com/10gen/mongo/blob/a96b83b2fa7010a5823fefac2469b4a06a697cf1/buildscripts/resmokeconfig/suites/change_streams_downgrade.yml#L408) so the corresponding multiversion suites are
|
||||
* [`change_streams_downgrade_last_continuous_new_old_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_downgrade_last_continuous_new_old_old_new.yml)
|
||||
* [`change_streams_downgrade_last_lts_new_old_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_downgrade_last_lts_new_old_old_new.yml)
|
||||
- `last-lts new-new-old` (i.e. suite runs the replica set fixture that spins up the `latest` and
|
||||
the `last-lts` versions in a 3-node replica set where the 1st node is the `latest`, 2nd - `latest`,
|
||||
3rd - `last-lts`, etc.)
|
||||
- `last-lts new-old-new`
|
||||
- `last-lts old-new-new`
|
||||
- `last-continuous new-new-old`
|
||||
- `last-continuous new-old-new`
|
||||
- `last-continuous old-new-new`
|
||||
- Ex: [change_streams](https://github.com/10gen/mongo/blob/88d59bfe9d5ee2c9938ae251f7a77a8bf1250a6b/buildscripts/resmokeconfig/suites/change_streams.yml) uses a [`ReplicaSetFixture`](https://github.com/10gen/mongo/blob/88d59bfe9d5ee2c9938ae251f7a77a8bf1250a6b/buildscripts/resmokeconfig/suites/change_streams.yml#L50) so the corresponding multiversion suites are
|
||||
- [`change_streams_last_continuous_new_new_old`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_new_new_old.yml)
|
||||
- [`change_streams_last_continuous_new_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_new_old_new.yml)
|
||||
- [`change_streams_last_continuous_old_new_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_continuous_old_new_new.yml)
|
||||
- [`change_streams_last_lts_new_new_old`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_new_new_old.yml)
|
||||
- [`change_streams_last_lts_new_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_new_old_new.yml)
|
||||
- [`change_streams_last_lts_old_new_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_last_lts_old_new_new.yml)
|
||||
|
||||
* Shell fixture combinations:
|
||||
* `last-lts` (i.e. suite runs the shell fixture that spins up `last-lts` as the `old` versions,
|
||||
etc.)
|
||||
* `last-continuous`
|
||||
* Ex: [initial_sync_fuzzer](https://github.com/10gen/mongo/blob/908625ffdec050a71aa2ce47c35788739f629c60/buildscripts/resmokeconfig/suites/initial_sync_fuzzer.yml) uses a Shell Fixture, so the corresponding multiversion suites are
|
||||
* [`initial_sync_fuzzer_last_lts`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/initial_sync_fuzzer_last_lts.yml)
|
||||
* [`initial_sync_fuzzer_last_continuous`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/initial_sync_fuzzer_last_continuous.yml)
|
||||
- Sharded cluster fixture combinations:
|
||||
|
||||
- `last-lts new-old-old-new` (i.e. suite runs the sharded cluster fixture that spins up the
|
||||
`latest` and the `last-lts` versions in a sharded cluster that consists of 2 shards with 2-node
|
||||
replica sets per shard where the 1st node of the 1st shard is the `latest`, 2nd node of 1st
|
||||
shard - `last-lts`, 1st node of 2nd shard - `last-lts`, 2nd node of 2nd shard - `latest`, etc.)
|
||||
- `last-continuous new-old-old-new`
|
||||
- Ex: [change_streams_downgrade](https://github.com/10gen/mongo/blob/a96b83b2fa7010a5823fefac2469b4a06a697cf1/buildscripts/resmokeconfig/suites/change_streams_downgrade.yml) uses a [`ShardedClusterFixture`](https://github.com/10gen/mongo/blob/a96b83b2fa7010a5823fefac2469b4a06a697cf1/buildscripts/resmokeconfig/suites/change_streams_downgrade.yml#L408) so the corresponding multiversion suites are
|
||||
- [`change_streams_downgrade_last_continuous_new_old_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_downgrade_last_continuous_new_old_old_new.yml)
|
||||
- [`change_streams_downgrade_last_lts_new_old_old_new`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/change_streams_downgrade_last_lts_new_old_old_new.yml)
|
||||
|
||||
- Shell fixture combinations:
|
||||
- `last-lts` (i.e. suite runs the shell fixture that spins up `last-lts` as the `old` versions,
|
||||
etc.)
|
||||
- `last-continuous`
|
||||
- Ex: [initial_sync_fuzzer](https://github.com/10gen/mongo/blob/908625ffdec050a71aa2ce47c35788739f629c60/buildscripts/resmokeconfig/suites/initial_sync_fuzzer.yml) uses a Shell Fixture, so the corresponding multiversion suites are
|
||||
- [`initial_sync_fuzzer_last_lts`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/initial_sync_fuzzer_last_lts.yml)
|
||||
- [`initial_sync_fuzzer_last_continuous`](https://github.com/10gen/mongo/blob/612814f4ce56282c47d501817ba28337c26d7aba/buildscripts/resmokeconfig/matrix_suites/mappings/initial_sync_fuzzer_last_continuous.yml)
|
||||
|
||||
If `last-lts` and `last-continuous` versions happen to be the same, or last-continuous is EOL, we skip `last-continuous`
|
||||
and run multiversion suites with only `last-lts` combinations in Evergreen.
|
||||
|
||||
|
||||
## Working with multiversion tasks in Evergreen
|
||||
|
||||
|
||||
### Multiversion task generation
|
||||
|
||||
Please refer to mongo-task-generator [documentation](https://github.com/mongodb/mongo-task-generator/blob/master/docs/generating_tasks.md#multiversion-testing)
|
||||
for generating multiversion tasks in Evergreen.
|
||||
|
||||
|
||||
### Exclude tests from multiversion testing
|
||||
|
||||
Sometimes tests are not designed to run in multiversion suites. To avoid implicit multiversion
|
||||
|
|
|
|||
|
|
@ -14,7 +14,7 @@ for details on how it works.
|
|||
## Configuring a task to be generated
|
||||
|
||||
In order to generate a task, we typically create a placeholder task. By convention the name of
|
||||
these tasks should end in "_gen". Most of the time, generated tasks should inherit the
|
||||
these tasks should end in "\_gen". Most of the time, generated tasks should inherit the
|
||||
[gen_task_template](https://github.com/mongodb/mongo/blob/31864e3866ce9cc54c08463019846ded2ad9e6e5/etc/evergreen_yml_components/definitions.yml#L99-L107)
|
||||
which configures the required dependencies.
|
||||
|
||||
|
|
@ -31,9 +31,9 @@ Once a placeholder task in defined, you can reference it just like a normal task
|
|||
Task generation is performed as a 2-step process.
|
||||
|
||||
1. The first step is to generator the configuration for the generated tasks and send that to
|
||||
evergreen to actually create the tasks. This is done by the `version_gen` task using the
|
||||
`mongo-task-generator` tool. This only needs to be done once for the entire version and rerunning
|
||||
this task will result in a no-op.
|
||||
evergreen to actually create the tasks. This is done by the `version_gen` task using the
|
||||
`mongo-task-generator` tool. This only needs to be done once for the entire version and rerunning
|
||||
this task will result in a no-op.
|
||||
|
||||
The tasks will be generated in an "inactive" state. This allows us to generate all available
|
||||
tasks, regardless of whether they are meant to be run or not. This way if we choose to run
|
||||
|
|
@ -45,10 +45,10 @@ Task generation is performed as a 2-step process.
|
|||
placeholder tasks from view.
|
||||
|
||||
2. After the tasks have been generated, the placeholder tasks are free to run. The placeholder tasks
|
||||
simply find the task generated for them and mark it activated. Since generated tasks are
|
||||
created in the "inactive" state, this will activate any generated tasks whose placeholder task
|
||||
runs. This enables users to select tasks to run on the initial task selection page even though
|
||||
the tasks have not yet been generated.
|
||||
simply find the task generated for them and mark it activated. Since generated tasks are
|
||||
created in the "inactive" state, this will activate any generated tasks whose placeholder task
|
||||
runs. This enables users to select tasks to run on the initial task selection page even though
|
||||
the tasks have not yet been generated.
|
||||
|
||||
**Note**: While this 2-step process allows a similar user experience to working with normal tasks,
|
||||
it does create a few UI quirks. For example, evergreen will hide "inactive" tasks in the UI, as a
|
||||
|
|
|
|||
|
|
@ -4,12 +4,12 @@
|
|||
|
||||
There are two types of timeouts that [evergreen supports](https://github.com/evergreen-ci/evergreen/wiki/Project-Commands#timeoutupdate):
|
||||
|
||||
* **Exec timeout**: The _exec_ timeout is the overall timeout for a task. Once the total runtime for
|
||||
a test hits this value, the timeout logic will be triggered. This value is specified by
|
||||
**exec_timeout_secs** in the evergreen configuration.
|
||||
* **Idle timeout**: The _idle_ timeout is the amount of time in which evergreen will wait for
|
||||
output to be created before it considers the task hung and triggers timeout logic. This value
|
||||
is specified by **timeout_secs** in the evergreen configuration.
|
||||
- **Exec timeout**: The _exec_ timeout is the overall timeout for a task. Once the total runtime for
|
||||
a test hits this value, the timeout logic will be triggered. This value is specified by
|
||||
**exec_timeout_secs** in the evergreen configuration.
|
||||
- **Idle timeout**: The _idle_ timeout is the amount of time in which evergreen will wait for
|
||||
output to be created before it considers the task hung and triggers timeout logic. This value
|
||||
is specified by **timeout_secs** in the evergreen configuration.
|
||||
|
||||
**Note**: In most cases, **exec_timeout** is usually the more useful of the timeouts.
|
||||
|
||||
|
|
@ -17,19 +17,19 @@ is specified by **timeout_secs** in the evergreen configuration.
|
|||
|
||||
There are a few ways in which the timeout can be determined for a task running in evergreen.
|
||||
|
||||
* **Specified in 'etc/evergreen.yml'**: Timeout can be specified directly in the 'evergreen.yml' file,
|
||||
both on tasks and build variants. This can be useful for setting default timeout values, but is limited
|
||||
since different build variants frequently have different runtime characteristics and it is not possible
|
||||
to set timeouts for a task running on a specific build variant.
|
||||
- **Specified in 'etc/evergreen.yml'**: Timeout can be specified directly in the 'evergreen.yml' file,
|
||||
both on tasks and build variants. This can be useful for setting default timeout values, but is limited
|
||||
since different build variants frequently have different runtime characteristics and it is not possible
|
||||
to set timeouts for a task running on a specific build variant.
|
||||
|
||||
* **etc/evergreen_timeouts.yml**: The 'etc/evergreen_timeouts.yml' file for overriding timeouts
|
||||
for specific tasks on specific build variants. This provides a work-around for the limitations of
|
||||
specifying the timeouts directly in the 'evergreen.yml'. In order to use this method, the task
|
||||
must run the "determine task timeout" and "update task timeout expansions" functions at the beginning
|
||||
of the task evergreen definition. Most resmoke tasks already do this.
|
||||
- **etc/evergreen_timeouts.yml**: The 'etc/evergreen_timeouts.yml' file for overriding timeouts
|
||||
for specific tasks on specific build variants. This provides a work-around for the limitations of
|
||||
specifying the timeouts directly in the 'evergreen.yml'. In order to use this method, the task
|
||||
must run the "determine task timeout" and "update task timeout expansions" functions at the beginning
|
||||
of the task evergreen definition. Most resmoke tasks already do this.
|
||||
|
||||
* **buildscripts/evergreen_task_timeout.py**: This is the script that reads the 'etc/evergreen_timeouts.yml'
|
||||
file and calculates the timeout to use. Additionally, it will check the historic test results of the
|
||||
task being run and see if there is enough information to calculate timeouts based on that. It can
|
||||
also be used for more advanced ways of determining timeouts (e.g. the script is used to set much
|
||||
more aggressive timeouts on tasks that are run in the commit-queue).
|
||||
- **buildscripts/evergreen_task_timeout.py**: This is the script that reads the 'etc/evergreen_timeouts.yml'
|
||||
file and calculates the timeout to use. Additionally, it will check the historic test results of the
|
||||
task being run and see if there is enough information to calculate timeouts based on that. It can
|
||||
also be used for more advanced ways of determining timeouts (e.g. the script is used to set much
|
||||
more aggressive timeouts on tasks that are run in the commit-queue).
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
# Exception Architecture
|
||||
|
||||
MongoDB code uses the following types of assertions that are available for use:
|
||||
|
||||
- `uassert` and `iassert`
|
||||
- Checks for per-operation user errors. Operation-fatal.
|
||||
- `tassert`
|
||||
|
|
@ -15,7 +16,7 @@ MongoDB code uses the following types of assertions that are available for use:
|
|||
- Checks process invariant. Process-fatal. Use to detect code logic errors ("pointer should
|
||||
never be null", "we should always be locked").
|
||||
|
||||
__Note__: Calling C function `assert` is not allowed. Use one of the above instead.
|
||||
**Note**: Calling C function `assert` is not allowed. Use one of the above instead.
|
||||
|
||||
The following types of assertions are deprecated:
|
||||
|
||||
|
|
@ -89,13 +90,13 @@ when we expect a failure, a failure might be recoverable, or failure accounting
|
|||
|
||||
### Choosing a unique location number
|
||||
|
||||
The current convention for choosing a unique location number is to use the 5 digit SERVER ticket number
|
||||
for the ticket being addressed when the assertion is added, followed by a two digit counter to distinguish
|
||||
between codes added as part of the same ticket. For example, if you're working on SERVER-12345, the first
|
||||
error code would be 1234500, the second would be 1234501, etc. This convention can also be used for LOGV2
|
||||
The current convention for choosing a unique location number is to use the 5 digit SERVER ticket number
|
||||
for the ticket being addressed when the assertion is added, followed by a two digit counter to distinguish
|
||||
between codes added as part of the same ticket. For example, if you're working on SERVER-12345, the first
|
||||
error code would be 1234500, the second would be 1234501, etc. This convention can also be used for LOGV2
|
||||
logging id numbers.
|
||||
|
||||
The only real constraint for unique location numbers is that they must be unique across the codebase. This is
|
||||
The only real constraint for unique location numbers is that they must be unique across the codebase. This is
|
||||
verified at compile time with a [python script][errorcodes_py].
|
||||
|
||||
## Exception
|
||||
|
|
@ -120,7 +121,7 @@ upwards harmlessly. The code should also expect, and properly handle, `UserExcep
|
|||
|
||||
MongoDB uses `ErrorCodes` both internally and externally: a subset of error codes (e.g.,
|
||||
`BadValue`) are used externally to pass errors over the wire and to clients. These error codes are
|
||||
the means for MongoDB processes (e.g., *mongod* and *mongo*) to communicate errors, and are visible
|
||||
the means for MongoDB processes (e.g., _mongod_ and _mongo_) to communicate errors, and are visible
|
||||
to client applications. Other error codes are used internally to indicate the underlying reason for
|
||||
a failed operation. For instance, `PeriodicJobIsStopped` is an internal error code that is passed
|
||||
to callback functions running inside a [`PeriodicRunner`][periodic_runner_h] once the runner is
|
||||
|
|
@ -162,10 +163,9 @@ Gotchas to watch out for:
|
|||
properly.
|
||||
- Think about the location of your asserts in constructors, as the destructor would not be
|
||||
called. But at a minimum, use `wassert` a lot therein, we want to know if something is wrong.
|
||||
- Do __not__ throw in destructors or allow exceptions to leak out (if you call a function that
|
||||
- Do **not** throw in destructors or allow exceptions to leak out (if you call a function that
|
||||
may throw).
|
||||
|
||||
|
||||
[raii]: https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization
|
||||
[error_codes_yml]: ../src/mongo/base/error_codes.yml
|
||||
[periodic_runner_h]: ../src/mongo/util/periodic_runner.h
|
||||
|
|
|
|||
|
|
@ -13,12 +13,13 @@ For more on what test-only means and how to enable the `configureFailPoint` comm
|
|||
A fail point must first be defined using `MONGO_FAIL_POINT_DEFINE(myFailPoint)`. This statement
|
||||
adds the fail point to a registry and allows it to be evaluated in code. There are three common
|
||||
patterns for evaluating a fail point:
|
||||
- Exercise a rarely used branch:
|
||||
`if (whenPigsFly || myFailPoint.shouldFail()) { ... }`
|
||||
- Block until the fail point is unset:
|
||||
`myFailPoint.pauseWhileSet();`
|
||||
- Use the fail point's payload to perform custom behavior:
|
||||
`myFailPoint.execute([](const BSONObj& data) { useMyPayload(data); };`
|
||||
|
||||
- Exercise a rarely used branch:
|
||||
`if (whenPigsFly || myFailPoint.shouldFail()) { ... }`
|
||||
- Block until the fail point is unset:
|
||||
`myFailPoint.pauseWhileSet();`
|
||||
- Use the fail point's payload to perform custom behavior:
|
||||
`myFailPoint.execute([](const BSONObj& data) { useMyPayload(data); };`
|
||||
|
||||
For more complete usage, see the [fail point header][fail_point] or the [fail point
|
||||
tests][fail_point_test].
|
||||
|
|
@ -34,11 +35,11 @@ a `FailPointEnableBlock` to enable and configure the fail point for a given bloc
|
|||
a fail point can also be set via setParameter by its name prefixed with "failpoint." (e.g.,
|
||||
"failpoint.myFailPoint").
|
||||
|
||||
Users can also wait until a fail point has been evaluated a certain number of times ***over its
|
||||
lifetime***. A `waitForFailPoint` command request will send a response back when the fail point has
|
||||
Users can also wait until a fail point has been evaluated a certain number of times **_over its
|
||||
lifetime_**. A `waitForFailPoint` command request will send a response back when the fail point has
|
||||
been evaluated the given number of times. For ease of use, the `configureFailPoint` JavaScript
|
||||
helper returns an object that can be used to wait a certain amount of times ***from when the fail
|
||||
point was enabled***. In C++ tests, users can invoke `FailPoint::waitForTimesEntered()` for similar
|
||||
helper returns an object that can be used to wait a certain amount of times **_from when the fail
|
||||
point was enabled_**. In C++ tests, users can invoke `FailPoint::waitForTimesEntered()` for similar
|
||||
behavior. `FailPointEnableBlock` records the amount of times the fail point had been evaluated when
|
||||
it was constructed, accessible via `FailPointEnableBlock::initialTimesEntered()`.
|
||||
|
||||
|
|
|
|||
|
|
@ -10,21 +10,23 @@ continue performing other work instead of waiting synchronously for those result
|
|||
|
||||
## A Few Definitions
|
||||
|
||||
- A `Future<T>` is a type that will eventually contain either a `T`, or an error indicating why the
|
||||
`T` could not be produced (in MongoDB, the error will take the form of either an exception or a
|
||||
`Status`).
|
||||
- A `Promise<T>` is a single-shot producer of a value (i.e., a `T`) for an associated `Future<T>`.
|
||||
That is, to put a value or error in a `Future<T>` and make it ready for use by consumers, the
|
||||
value is emplaced in the corresponding `Promise<T>`.
|
||||
- A continuation is a functor that can be chained on to `Future<T>` that will execute only once the
|
||||
`T` (or error) is available and ready. A continuation in this way can "consume" the produced `T`,
|
||||
and handle any errors.
|
||||
- A `Future<T>` is a type that will eventually contain either a `T`, or an error indicating why the
|
||||
`T` could not be produced (in MongoDB, the error will take the form of either an exception or a
|
||||
`Status`).
|
||||
- A `Promise<T>` is a single-shot producer of a value (i.e., a `T`) for an associated `Future<T>`.
|
||||
That is, to put a value or error in a `Future<T>` and make it ready for use by consumers, the
|
||||
value is emplaced in the corresponding `Promise<T>`.
|
||||
- A continuation is a functor that can be chained on to `Future<T>` that will execute only once the
|
||||
`T` (or error) is available and ready. A continuation in this way can "consume" the produced `T`,
|
||||
and handle any errors.
|
||||
|
||||
## A First Example
|
||||
|
||||
To build some intuition around futures and promises, let's see how they might be used. As an
|
||||
example, we'll look at how they help us rewrite some slow blocking code into fast, concurrent code.
|
||||
As a distributed system, MongoDB often needs to send RPCs from one machine to another. A sketch of a
|
||||
simple, synchronous way of doing so might look like this:
|
||||
|
||||
```c++
|
||||
Message call(Message& toSend) {
|
||||
...
|
||||
|
|
@ -40,13 +42,15 @@ Message call(Message& toSend) {
|
|||
...
|
||||
}
|
||||
```
|
||||
|
||||
This is fine, but some parts of networking are expensive! `TransportSession::sinkMessage` involves
|
||||
making expensive system calls to enqueue our message into the kernel's networking stack, and
|
||||
`TransportSession::sourceMessage` entails waiting for a network round-trip to occur! We don't want
|
||||
busy worker threads to be forced to wait around to hear back from the kernel for these sorts of
|
||||
expensive operations. Instead, we'd rather let these threads move on to perform other work, and
|
||||
handle the response from our expensive networking operations when they're available. Futures and
|
||||
handle the response from our expensive networking operations when they're available. Futures and
|
||||
promises allow us to do this. We can rewrite our example as follows:
|
||||
|
||||
```c++
|
||||
Future<Message> call(Message& toSend) {
|
||||
...
|
||||
|
|
@ -60,6 +64,7 @@ Future<Message> call(Message& toSend) {
|
|||
});
|
||||
}
|
||||
```
|
||||
|
||||
First, notice that our calls to `TransportSession::sourceMessage` and
|
||||
`TransportSession::sinkMessage` have been replaced with calls to asynchronous versions of those
|
||||
functions. These asynchronous versions are future-returning; they don't block, but also don't return
|
||||
|
|
@ -82,16 +87,18 @@ thread blocking and waiting. This is explained in more detail in the "How Are Re
|
|||
Down Continuation Chains?" section below.
|
||||
|
||||
## Filling In Some Details
|
||||
|
||||
The example above hopefully showed us how futures can be used to structure asynchronous programs at
|
||||
a high level, but we've left out some important details about how they work.
|
||||
|
||||
### How Are Futures Fulfilled With Values?
|
||||
|
||||
In our example, we looked at how some code that needs to wait for results can use `Future`s to be
|
||||
written in an asynchronous, performant way. But some thread running elsewhere needs to actually
|
||||
"fulfill" those futures with a value or error. Threads can fulfull the core "promise" of a
|
||||
`Future<T>` - that it will eventually contain a `T` or an error - by using the appropriately named
|
||||
`Promise<T>` type. Every pending `Future<T>` is associated with exactly one corresponding
|
||||
`Promise<T>` that can be used to ready the `Future<T>`, providing it with a value. (Note that a
|
||||
`Promise<T>` that can be used to ready the `Future<T>`, providing it with a value. (Note that a
|
||||
`Future<T>` may also be "born ready"/already filled with a value when constructed). The `Future<T>`
|
||||
can be "made ready" by emplacing a value or error in the associated promise with
|
||||
`Promise<T>::emplaceValue`, `Promise<T>::setError`, or related helper member functions (see the
|
||||
|
|
@ -121,6 +128,7 @@ extract as many associated `SharedSemiFuture`s as you'd like from a `SharedPromi
|
|||
`getFuture()` member function.
|
||||
|
||||
### Where Do Continuations Run?
|
||||
|
||||
In our example, we chained continuations onto futures using functions like `Future<T>::then()`, and
|
||||
explained that the continuations we chained will only be invoked once the future we've chained them
|
||||
onto is ready. But we haven't yet specified how this continuation is invoked: what thread will
|
||||
|
|
@ -143,10 +151,12 @@ Fortunately, the service can enforce these guarantees using two types closely re
|
|||
`Future<T>`: the types `SemiFuture<T>` and `ExecutorFuture<T>`.
|
||||
|
||||
#### SemiFuture
|
||||
|
||||
`SemiFuture`s are like regular futures, except that continuations cannot be chained to them.
|
||||
Instead, values and errors can only be extracted from them via blocking methods, which threads can
|
||||
call if they are willing to block. A `Future<T>` can always be transformed into a `SemiFuture<T>`
|
||||
using the member function `Future<T>::semi()`. Let's look at a quick example to make this clearer:
|
||||
|
||||
```c++
|
||||
// Code producing a `SemiFuture`
|
||||
SemiFuture<Work> SomeAsyncService::requestWork() {
|
||||
|
|
@ -168,6 +178,7 @@ SemiFuture<Work> sf = SomeAsyncService::requestWork();
|
|||
// sf.onError(...) won't compile for the same reason
|
||||
auto res = sf.get(); // OK; get blocks until sf is ready
|
||||
```
|
||||
|
||||
Our example begins when a thread makes a request for some asynchronous work to be performed by some
|
||||
service, using `SomeAsyncService::requestWork()`. As was the case in our initial example, this
|
||||
thread receives back a future that will be readied when its request has been completed and a value
|
||||
|
|
@ -181,6 +192,7 @@ run the continuations. By instead returning a `SemiFuture`, the `SomeAsyncServic
|
|||
that requests work from it from using its own internal `_privateExecutor` resource.
|
||||
|
||||
#### ExecutorFuture
|
||||
|
||||
`ExecutorFuture`s are another variation on the core `Future` type; they are like regular `Future`s,
|
||||
except for the fact that code constructing an `ExecutorFuture` is required to provide an
|
||||
[executor][executor] on which any continuations chained to the future will be run. (An executor is
|
||||
|
|
@ -193,20 +205,23 @@ clearer, so we'll reuse the one above. Let's imagine the thread that scheduled w
|
|||
`SomeAsyncService::requestWork()` can't afford to block until the result `SemiFuture` is readied.
|
||||
Instead, it consumes the asynchronous result by specifying a callback to run and an executor on
|
||||
which to run it like so:
|
||||
|
||||
```c++
|
||||
// Code consuming a `SemiFuture`
|
||||
SomeAsyncService::requestWork() // <-- temporary `SemiFuture`
|
||||
.thenRunOn(_executor) // <-- Transformed into a `ExecutorFuture`
|
||||
.then([](Work w) { doMoreWork(w); }); // <-- Which supports chaining
|
||||
```
|
||||
|
||||
By calling `.thenRunOn(_executor)` on the `SemiFuture` returned by
|
||||
`SomeAsyncService::requestWork()`, we transform it from a `SemiFuture` to an `ExecutorFuture`. This
|
||||
allows us to again chain continuations to run when the future is ready, but instead of those
|
||||
continuations being run on whatever thread readied the future, they will be run on `_executor`. In
|
||||
this way, the result of the future returned by `SomeAsyncService::requestWork()` is able to be
|
||||
this way, the result of the future returned by `SomeAsyncService::requestWork()` is able to be
|
||||
consumed by the `doMoreWork` function which will run on `_executor`.
|
||||
|
||||
### How Are Results Propagated Down Continuation Chains?
|
||||
|
||||
In our example for an asyncified `call()` function above, we saw that we could attach continuations
|
||||
onto futures, like the one returned by `TransportSession::asyncSinkMessage`. We also saw that once
|
||||
we attached one continuation to a future, we could attach subsequent ones, forming a continuation
|
||||
|
|
@ -219,15 +234,15 @@ in the form of a `Status` or `DBException`. Because a `Future<T>` can resolve to
|
|||
this way, we can chain different continuations to a `Future<T>` to consume its result, depending on
|
||||
what the type of the result is (i.e. a `T` or `Status`). We mentioned above that `.then()` is used
|
||||
to chain continuations that run when the future to which the continuation is chained resolves
|
||||
successfully. As a result, when a continuation is chained via `.then()` to a `Future<T>`, the
|
||||
successfully. As a result, when a continuation is chained via `.then()` to a `Future<T>`, the
|
||||
continuation must accept a `T`, the result of the `Future<T>`, as an argument to consume. In the
|
||||
case of a `Future<void>`, continuations chained via `.then()` accept no arguments. Similarly, as
|
||||
case of a `Future<void>`, continuations chained via `.then()` accept no arguments. Similarly, as
|
||||
`.onError()` is used to chain continuations that run when the future is resolved with an error,
|
||||
these continuations must accept a `Status` as argument, which contains the error the future it is
|
||||
chained to resolves with. Lastly, as `.onCompletion()` is used to chain continuations that run in
|
||||
chained to resolves with. Lastly, as `.onCompletion()` is used to chain continuations that run in
|
||||
case a `Future<T>` resolves with success or error, continuations chained via this function must
|
||||
accept an argument that can contain the results of successful resolution of the chained-to future or
|
||||
an error. When `T` is non-void, continuations chained via `.onCompletion()` must therefore accept a
|
||||
an error. When `T` is non-void, continuations chained via `.onCompletion()` must therefore accept a
|
||||
`StatusWith<T>` as argument, which will contain a `T` if the chained-to future resolved successfully
|
||||
and an error status otherwise. If `T` is void, a continuation chained via `.onCompletion()` must
|
||||
accept a `Status` as argument, indicating whether or not the future the continuation is chained to
|
||||
|
|
@ -249,7 +264,7 @@ will be bypassed and will never run. Next, the successful result reaches the con
|
|||
via `.then()`, which must take no arguments as `TransportLayer::asyncSinkMessage` returns a
|
||||
`Future<void>`. Because the future returned by `TransportLayer::asyncSinkMessage` resolved
|
||||
successfully, the continuation chained via `.then()` does run. The result of this continuation is
|
||||
the future returned by `TransportLayer::asyncSourceMessage`. When this future resolves, the result
|
||||
the future returned by `TransportLayer::asyncSourceMessage`. When this future resolves, the result
|
||||
will traverse the remaining continuation chain, and find the continuation chained via
|
||||
`.onCompletion()`, which always accepts the result of a future, however it resolves, and therefore
|
||||
is run.
|
||||
|
|
@ -281,7 +296,7 @@ extract the same error in the form of a `Status`. In the case of `.getAsync()`,
|
|||
converted to `Status`, and crucially, callables chained as continuations via `.getAsync()` cannot
|
||||
throw any exceptions, as there is no appropriate context with which to handle an asynchronous
|
||||
exception. If an exception is thrown from a continuation chained via `.getAsync()`, the entire
|
||||
process will be terminated (i.e. the program will crash).
|
||||
process will be terminated (i.e. the program will crash).
|
||||
|
||||
## Notes and Links
|
||||
|
||||
|
|
@ -291,6 +306,7 @@ and all the related types, check out the [header file][future] and search for th
|
|||
function you're interested in.
|
||||
|
||||
### Future Utilities
|
||||
|
||||
We have many utilities written to help make it easier for you to work with futures; check out
|
||||
[future_util.h][future_util.h] to see them. Their [unit tests][utilUnitTests] also help elucidate
|
||||
how they can be useful. Additionally, when making requests for asynchronous work through future-ful
|
||||
|
|
@ -300,13 +316,11 @@ the associated utilities. For more on them, see their architecture guide in [thi
|
|||
README][cancelationArch].
|
||||
|
||||
## General Promise/Future Docs
|
||||
|
||||
For intro-documentation on programming with promises and futures, this blog post about future use at
|
||||
[Facebook][fb] and the documentation for the use of promises and futures at [Twitter][twtr] are also
|
||||
very helpful.
|
||||
|
||||
|
||||
|
||||
|
||||
[future]: ../src/mongo/util/future.h
|
||||
[future_util.h]: ../src/mongo/util/future_util.h
|
||||
[executor]: ../src/mongo/util/out_of_line_executor.h
|
||||
|
|
|
|||
|
|
@ -1,117 +1,128 @@
|
|||
# Overview
|
||||
Golden Data test framework provides ability to run and manage tests that produce an output which is
|
||||
verified by comparing it to the checked-in, known valid output. Any differences result in test
|
||||
|
||||
Golden Data test framework provides ability to run and manage tests that produce an output which is
|
||||
verified by comparing it to the checked-in, known valid output. Any differences result in test
|
||||
failure and either the code or expected output has to be updated.
|
||||
|
||||
Golden Data tests excel at bulk diffing of failed test outputs and bulk accepting of new test
|
||||
outputs.
|
||||
Golden Data tests excel at bulk diffing of failed test outputs and bulk accepting of new test
|
||||
outputs.
|
||||
|
||||
# When to use Golden Data tests?
|
||||
* Code under test produces a deterministic output: That way tests can consistently succeed or fail.
|
||||
* Incremental changes to code under test or test fixture result in incremental changes to the
|
||||
output.
|
||||
* As an alternative to ASSERT for large output comparison: Serves the same purpose, but provides
|
||||
tools for diffing/updating.
|
||||
* The outputs can't be objectively verified (e.g. by verifying well known properties). Examples:
|
||||
* Verifying if sorting works, can be done by verifying that output is sorted. SHOULD NOT use
|
||||
Golden Data tests.
|
||||
* Verifying that pretty printing works, MAY use Golden Data tests to verify the output, as there
|
||||
might not be well known properties or those properties can easily change.
|
||||
* As stability/versioning/regression testing. Golden Data tests by storing recorded outputs, are
|
||||
good candidate for preserving behavior of legacy versions or detecting undesired changes in
|
||||
behavior, even in cases when new behavior meets other correctness criteria.
|
||||
|
||||
- Code under test produces a deterministic output: That way tests can consistently succeed or fail.
|
||||
- Incremental changes to code under test or test fixture result in incremental changes to the
|
||||
output.
|
||||
- As an alternative to ASSERT for large output comparison: Serves the same purpose, but provides
|
||||
tools for diffing/updating.
|
||||
- The outputs can't be objectively verified (e.g. by verifying well known properties). Examples:
|
||||
- Verifying if sorting works, can be done by verifying that output is sorted. SHOULD NOT use
|
||||
Golden Data tests.
|
||||
- Verifying that pretty printing works, MAY use Golden Data tests to verify the output, as there
|
||||
might not be well known properties or those properties can easily change.
|
||||
- As stability/versioning/regression testing. Golden Data tests by storing recorded outputs, are
|
||||
good candidate for preserving behavior of legacy versions or detecting undesired changes in
|
||||
behavior, even in cases when new behavior meets other correctness criteria.
|
||||
|
||||
# Best practices for working with Golden Data tests
|
||||
* Tests MUST produce text output that is diffable can be inspected in the pull request.
|
||||
|
||||
* Tests MUST produce an output that is deterministic and repeatable. Including running on different
|
||||
platforms. Same as with ASSERT_EQ.
|
||||
* Tests SHOULD produce an output that changes incrementally in response to the incremental test or
|
||||
code changes.
|
||||
- Tests MUST produce text output that is diffable can be inspected in the pull request.
|
||||
|
||||
* Multiple test variations MAY be bundled into a single test. Recommended when testing same feature
|
||||
with different inputs. This helps reviewing the outputs by grouping similar tests together, and also
|
||||
reduces the number of output files.
|
||||
- Tests MUST produce an output that is deterministic and repeatable. Including running on different
|
||||
platforms. Same as with ASSERT_EQ.
|
||||
- Tests SHOULD produce an output that changes incrementally in response to the incremental test or
|
||||
code changes.
|
||||
|
||||
* Changes to test fixture or test code that affect non-trivial amount test outputs MUST BE done in
|
||||
separate pull request from production code changes:
|
||||
* Pull request for test code only changes can be easily reviewed, even if large number of test
|
||||
outputs are modified. While such changes can still introduce merge conflicts, they don't introduce
|
||||
risk of regression (if outputs were valid
|
||||
* Pull requests with mixed production
|
||||
- Multiple test variations MAY be bundled into a single test. Recommended when testing same feature
|
||||
with different inputs. This helps reviewing the outputs by grouping similar tests together, and also
|
||||
reduces the number of output files.
|
||||
|
||||
* Tests in the same suite SHOULD share the fixtures when appropriate. This reduces cost of adding
|
||||
new tests to the suite. Changes to the fixture may only affect expected outputs from that fixtures,
|
||||
and those output can be updated in bulk.
|
||||
- Changes to test fixture or test code that affect non-trivial amount test outputs MUST BE done in
|
||||
separate pull request from production code changes:
|
||||
|
||||
* Tests in different suites SHOULD NOT reuse/share fixtures. Changes to the fixture can affect large
|
||||
number of expected outputs.
|
||||
There are exceptions to that rule, and tests in different suites MAY reuse/share fixtures if:
|
||||
* Test fixture is considered stable and changes rarely.
|
||||
* Tests suites are related, either by sharing tests, or testing similar components.
|
||||
* Setup/teardown costs are excessive, and sharing the same instance of a fixture for performance
|
||||
reasons can't be avoided.
|
||||
- Pull request for test code only changes can be easily reviewed, even if large number of test
|
||||
outputs are modified. While such changes can still introduce merge conflicts, they don't introduce
|
||||
risk of regression (if outputs were valid
|
||||
- Pull requests with mixed production
|
||||
|
||||
* Tests SHOULD print both inputs and outputs of the tested code. This makes it easy for reviewers to
|
||||
verify of the expected outputs are indeed correct by having both input and output next to each
|
||||
other.
|
||||
Otherwise finding the input used to produce the new output may not be practical, and might not even
|
||||
be included in the diff.
|
||||
- Tests in the same suite SHOULD share the fixtures when appropriate. This reduces cost of adding
|
||||
new tests to the suite. Changes to the fixture may only affect expected outputs from that fixtures,
|
||||
and those output can be updated in bulk.
|
||||
|
||||
* When resolving merge conflicts on the expected output files, one of the approaches below SHOULD be
|
||||
used:
|
||||
* "Accept theirs", rerun the tests and verify new outputs. This doesn't require knowledge of
|
||||
production/test code changes in "theirs" branch, but requires re-review and re-acceptance of c
|
||||
hanges done by local branch.
|
||||
* "Accept yours", rerun the tests and verify the new outputs. This approach requires knowledge of
|
||||
production/test code changes in "theirs" branch. However, if such changes resulted in
|
||||
straightforward and repetitive output changes, like due to printing code change or fixture change,
|
||||
it may be easier to verify than reinspecting local changes.
|
||||
- Tests in different suites SHOULD NOT reuse/share fixtures. Changes to the fixture can affect large
|
||||
number of expected outputs.
|
||||
There are exceptions to that rule, and tests in different suites MAY reuse/share fixtures if:
|
||||
|
||||
* Expected test outputs SHOULD be reused across tightly-coupled test suites. The suites are
|
||||
tightly-coupled if:
|
||||
* Share the same tests, inputs and fixtures.
|
||||
* Test similar scenarios.
|
||||
* Test different code paths, but changes to one of the code path is expected to be accompanied by
|
||||
changes to the other code paths as well.
|
||||
|
||||
Tests SHOULD use different test files, for legitimate and expected output differences between
|
||||
those suites.
|
||||
- Test fixture is considered stable and changes rarely.
|
||||
- Tests suites are related, either by sharing tests, or testing similar components.
|
||||
- Setup/teardown costs are excessive, and sharing the same instance of a fixture for performance
|
||||
reasons can't be avoided.
|
||||
|
||||
Examples:
|
||||
* Functional tests, integration tests and unit tests that test the same behavior in different
|
||||
environments.
|
||||
* Versioned tests, where expected behavior is the same for majority of test inputs/scenarios.
|
||||
- Tests SHOULD print both inputs and outputs of the tested code. This makes it easy for reviewers to
|
||||
verify of the expected outputs are indeed correct by having both input and output next to each
|
||||
other.
|
||||
Otherwise finding the input used to produce the new output may not be practical, and might not even
|
||||
be included in the diff.
|
||||
|
||||
* AVOID manually modifying expected output files. Those files are considered to be auto generated.
|
||||
Instead, run the tests and then copy the generated output as a new expected output file. See "How to
|
||||
diff and accept new test outputs" section for instructions.
|
||||
- When resolving merge conflicts on the expected output files, one of the approaches below SHOULD be
|
||||
used:
|
||||
|
||||
- "Accept theirs", rerun the tests and verify new outputs. This doesn't require knowledge of
|
||||
production/test code changes in "theirs" branch, but requires re-review and re-acceptance of c
|
||||
hanges done by local branch.
|
||||
- "Accept yours", rerun the tests and verify the new outputs. This approach requires knowledge of
|
||||
production/test code changes in "theirs" branch. However, if such changes resulted in
|
||||
straightforward and repetitive output changes, like due to printing code change or fixture change,
|
||||
it may be easier to verify than reinspecting local changes.
|
||||
|
||||
- Expected test outputs SHOULD be reused across tightly-coupled test suites. The suites are
|
||||
tightly-coupled if:
|
||||
|
||||
- Share the same tests, inputs and fixtures.
|
||||
- Test similar scenarios.
|
||||
- Test different code paths, but changes to one of the code path is expected to be accompanied by
|
||||
changes to the other code paths as well.
|
||||
|
||||
Tests SHOULD use different test files, for legitimate and expected output differences between
|
||||
those suites.
|
||||
|
||||
Examples:
|
||||
|
||||
- Functional tests, integration tests and unit tests that test the same behavior in different
|
||||
environments.
|
||||
- Versioned tests, where expected behavior is the same for majority of test inputs/scenarios.
|
||||
|
||||
- AVOID manually modifying expected output files. Those files are considered to be auto generated.
|
||||
Instead, run the tests and then copy the generated output as a new expected output file. See "How to
|
||||
diff and accept new test outputs" section for instructions.
|
||||
|
||||
# How to use write Golden Data tests?
|
||||
Each golden data test should produce a text output that will be later verified. The output format
|
||||
must be text, but otherwise test author can choose a most appropriate output format (text, json,
|
||||
bson, yaml or mixed). If a test consists of multiple variations each variation should be clearly
|
||||
separated from each other.
|
||||
|
||||
Note: Test output is usually only written. It is ok to focus on just writing serialization/printing
|
||||
code without a need to provide deserialization/parsing code.
|
||||
Each golden data test should produce a text output that will be later verified. The output format
|
||||
must be text, but otherwise test author can choose a most appropriate output format (text, json,
|
||||
bson, yaml or mixed). If a test consists of multiple variations each variation should be clearly
|
||||
separated from each other.
|
||||
|
||||
Note: Test output is usually only written. It is ok to focus on just writing serialization/printing
|
||||
code without a need to provide deserialization/parsing code.
|
||||
|
||||
When actual test output is different from expected output, test framework will fail the test, log
|
||||
both outputs and also create following files, that can be inspected later:
|
||||
* <output_path>/actual/<test_path> - with actual test output
|
||||
* <output_path>/expected/<test_path> - with expected test output
|
||||
both outputs and also create following files, that can be inspected later:
|
||||
|
||||
- <output_path>/actual/<test_path> - with actual test output
|
||||
- <output_path>/expected/<test_path> - with expected test output
|
||||
|
||||
## CPP tests
|
||||
`::mongo::unittest::GoldenTestConfig` - Provides a way to configure test suite(s). Defines where the
|
||||
expected output files are located in the source repo.
|
||||
|
||||
`::mongo::unittest::GoldenTestContext` - Provides an output stream where tests should write their
|
||||
`::mongo::unittest::GoldenTestConfig` - Provides a way to configure test suite(s). Defines where the
|
||||
expected output files are located in the source repo.
|
||||
|
||||
`::mongo::unittest::GoldenTestContext` - Provides an output stream where tests should write their
|
||||
outputs. Verifies the output with the expected output that is in the source repo
|
||||
|
||||
See: [golden_test.h](../src/mongo/unittest/golden_test.h)
|
||||
|
||||
**Example:**
|
||||
|
||||
```c++
|
||||
#include "mongo/unittest/golden_test.h"
|
||||
|
||||
|
|
@ -145,24 +156,26 @@ TEST_F(MySuiteFixture, MyFeatureBTest) {
|
|||
}
|
||||
```
|
||||
|
||||
Also see self-test:
|
||||
Also see self-test:
|
||||
[golden_test_test.cpp](../src/mongo/unittest/golden_test_test.cpp)
|
||||
|
||||
# How to diff and accept new test outputs on a workstation
|
||||
|
||||
Use buildscripts/golden_test.py command line tool to manage the test outputs. This includes:
|
||||
* diffing all output differences of all tests in a given test run output.
|
||||
* accepting all output differences of all tests in a given test run output.
|
||||
|
||||
- diffing all output differences of all tests in a given test run output.
|
||||
- accepting all output differences of all tests in a given test run output.
|
||||
|
||||
## Setup
|
||||
|
||||
buildscripts/golden_test.py requires a one-time workstation setup.
|
||||
|
||||
Note: this setup is only required to use buildscripts/golden_test.py itself. It is NOT required to
|
||||
just run the Golden Data tests when not using buildscripts/golden_test.py.
|
||||
|
||||
1. Create a yaml config file, as described by [Appendix - Config file reference](#appendix---config-file-reference).
|
||||
2. Set GOLDEN_TEST_CONFIG_PATH environment variable to config file location, so that is available
|
||||
when running tests and when running buildscripts/golden_test.py tool.
|
||||
2. Set GOLDEN_TEST_CONFIG_PATH environment variable to config file location, so that is available
|
||||
when running tests and when running buildscripts/golden_test.py tool.
|
||||
|
||||
### Automatic Setup
|
||||
|
||||
|
|
@ -171,6 +184,7 @@ Use buildscripts/golden_test.py builtin setup to initialize default config for y
|
|||
**Instructions for Linux**
|
||||
|
||||
Run buildscripts/golden_test.py setup utility
|
||||
|
||||
```bash
|
||||
buildscripts/golden_test.py setup
|
||||
```
|
||||
|
|
@ -179,6 +193,7 @@ buildscripts/golden_test.py setup
|
|||
|
||||
Run buildscripts/golden_test.py setup utility.
|
||||
You may be asked for a password, when not running in "Run as administrator" shell.
|
||||
|
||||
```cmd
|
||||
c:\python\python310\python.exe buildscripts/golden_test.py setup
|
||||
```
|
||||
|
|
@ -188,36 +203,43 @@ c:\python\python310\python.exe buildscripts/golden_test.py setup
|
|||
This is the same config as that would be setup by the [Automatic Setup](#automatic-setup)
|
||||
|
||||
This config uses a unique subfolder folder for each test run. (default)
|
||||
* Allows diffing each test run separately.
|
||||
* Works with multiple source repos.
|
||||
|
||||
- Allows diffing each test run separately.
|
||||
- Works with multiple source repos.
|
||||
|
||||
**Instructions for Linux/macOS:**
|
||||
|
||||
This config uses a unique subfolder folder for each test run. (default)
|
||||
* Allows diffing each test run separately.
|
||||
* Works with multiple source repos.
|
||||
|
||||
- Allows diffing each test run separately.
|
||||
- Works with multiple source repos.
|
||||
|
||||
Create ~/.golden_test_config.yml with following contents:
|
||||
|
||||
```yaml
|
||||
outputRootPattern: /var/tmp/test_output/out-%%%%-%%%%-%%%%-%%%%
|
||||
diffCmd: git diff --no-index "{{expected}}" "{{actual}}"
|
||||
```
|
||||
|
||||
Update .bashrc, .zshrc
|
||||
|
||||
```bash
|
||||
export GOLDEN_TEST_CONFIG_PATH=~/.golden_test_config.yml
|
||||
```
|
||||
|
||||
alternatively modify /etc/environment or other configuration if needed by Debugger/IDE etc.
|
||||
|
||||
**Instructions for Windows:**
|
||||
|
||||
Create %LocalAppData%\.golden_test_config.yml with the following contents:
|
||||
|
||||
```yaml
|
||||
outputRootPattern: 'C:\Users\Administrator\AppData\Local\Temp\test_output\out-%%%%-%%%%-%%%%-%%%%'
|
||||
diffCmd: 'git diff --no-index "{{expected}}" "{{actual}}"'
|
||||
```
|
||||
|
||||
Add GOLDEN_TEST_CONFIG_PATH=~/.golden_test_config.yml environment variable:
|
||||
|
||||
```cmd
|
||||
runas /profile /user:administrator "setx GOLDEN_TEST_CONFIG_PATH %LocalAppData%\.golden_test_config.yml"
|
||||
```
|
||||
|
|
@ -225,6 +247,7 @@ runas /profile /user:administrator "setx GOLDEN_TEST_CONFIG_PATH %LocalAppData%\
|
|||
## Usage
|
||||
|
||||
### List all available test outputs
|
||||
|
||||
```bash
|
||||
$> buildscripts/golden_test.py list
|
||||
```
|
||||
|
|
@ -234,28 +257,34 @@ $> buildscripts/golden_test.py list
|
|||
```bash
|
||||
$> buildscripts/golden_test.py diff
|
||||
```
|
||||
|
||||
This will run the diffCmd that was specified in the config file
|
||||
|
||||
### Diff test results from most recent test run:
|
||||
|
||||
```bash
|
||||
$> buildscripts/golden_test.py accept
|
||||
```
|
||||
This will copy all actual test outputs from that test run to the source repo and new expected
|
||||
|
||||
This will copy all actual test outputs from that test run to the source repo and new expected
|
||||
outputs.
|
||||
|
||||
|
||||
### Get paths from most recent test run (to be used by custom tools)
|
||||
|
||||
Get expected and actual output paths for most recent test run:
|
||||
|
||||
```bash
|
||||
$> buildscripts/golden_test.py get
|
||||
```
|
||||
|
||||
Get expected and actual output paths for most most recent test run:
|
||||
|
||||
```bash
|
||||
$> buildscripts/golden_test.py get_root
|
||||
```
|
||||
|
||||
Get all available commands and options:
|
||||
|
||||
```bash
|
||||
$> buildscripts/golden_test.py --help
|
||||
```
|
||||
|
|
@ -263,8 +292,9 @@ $> buildscripts/golden_test.py --help
|
|||
# How to diff test results from a non-workstation test run
|
||||
|
||||
## Bulk folder diff the results:
|
||||
1. Parse the test log to find the root output locations where expected and actual output files were
|
||||
written.
|
||||
|
||||
1. Parse the test log to find the root output locations where expected and actual output files were
|
||||
written.
|
||||
2. Then compare the folders to see the differences for tests that failed.
|
||||
|
||||
**Example: (linux/macOS)**
|
||||
|
|
@ -277,9 +307,11 @@ $> diff -ruN --unidirectional-new-file --color=always <expected_root> <actual_ro
|
|||
```
|
||||
|
||||
## Find the outputs of tests that failed.
|
||||
|
||||
Parse logs and find the the expected and actual outputs for each failed test.
|
||||
|
||||
**Example: (linux/macOS)**
|
||||
|
||||
```bash
|
||||
# Find all expected and actual outputs of tests that have failed
|
||||
$> cat test.log | grep "^{" | jq -s '.[] | select(.id == 6273501 ) | .attr.testPath,.attr.expectedOutput,.attr.actualOutput'
|
||||
|
|
@ -288,31 +320,28 @@ $> cat test.log | grep "^{" | jq -s '.[] | select(.id == 6273501 ) | .attr.testP
|
|||
# Appendix - Config file reference
|
||||
|
||||
Golden Data test config file is a YAML file specified as:
|
||||
|
||||
```yaml
|
||||
outputRootPattern:
|
||||
outputRootPattern:
|
||||
type: String
|
||||
optional: true
|
||||
description:
|
||||
Root path patten that will be used to write expected and actual test outputs for all tests
|
||||
in the test run.
|
||||
If not specified a temporary folder location will be used.
|
||||
Path pattern string may use '%' characters in the last part of the path. '%' characters in
|
||||
the last part of the path will be replaced with random lowercase hexadecimal digits.
|
||||
examples:
|
||||
/var/tmp/test_output/out-%%%%-%%%%-%%%%-%%%%
|
||||
/var/tmp/test_output
|
||||
Root path patten that will be used to write expected and actual test outputs for all tests
|
||||
in the test run.
|
||||
If not specified a temporary folder location will be used.
|
||||
Path pattern string may use '%' characters in the last part of the path. '%' characters in
|
||||
the last part of the path will be replaced with random lowercase hexadecimal digits.
|
||||
examples: /var/tmp/test_output/out-%%%%-%%%%-%%%%-%%%%
|
||||
/var/tmp/test_output
|
||||
|
||||
diffCmd:
|
||||
diffCmd:
|
||||
type: String
|
||||
optional: true
|
||||
description:
|
||||
Shell command to diff a single golden test run output.
|
||||
description: Shell command to diff a single golden test run output.
|
||||
{{expected}} and {{actual}} variables should be used and will be replaced with expected and
|
||||
actual output folder paths respectively.
|
||||
This property is not used to decide whether the test passes or fails; it is only used to
|
||||
display differences once we've decided that a test failed.
|
||||
examples:
|
||||
git diff --no-index "{{expected}}" "{{actual}}"
|
||||
diff -ruN --unidirectional-new-file --color=always "{{expected}}" "{{actual}}"
|
||||
examples: git diff --no-index "{{expected}}" "{{actual}}"
|
||||
diff -ruN --unidirectional-new-file --color=always "{{expected}}" "{{actual}}"
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@ LibFuzzer implements `int main`, and expects to be linked with an object
|
|||
file which provides the function under test. You will achieve this by
|
||||
writing a cpp file which implements
|
||||
|
||||
``` cpp
|
||||
```cpp
|
||||
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
|
||||
// Your code here
|
||||
}
|
||||
|
|
@ -39,16 +39,16 @@ lot of freedom in exactly what you choose to do. Just make sure your
|
|||
function crashes or produces an invariant when something interesting
|
||||
happens! As just a few ideas:
|
||||
|
||||
- You might choose to call multiple implementations of a single
|
||||
operation, and validate that they produce the same output when
|
||||
presented the same input.
|
||||
- You could tease out individual bytes from `Data` and provide them as
|
||||
different arguments to the function under test.
|
||||
- You might choose to call multiple implementations of a single
|
||||
operation, and validate that they produce the same output when
|
||||
presented the same input.
|
||||
- You could tease out individual bytes from `Data` and provide them as
|
||||
different arguments to the function under test.
|
||||
|
||||
Finally, your cpp file will need a SCons target. There is a method which
|
||||
defines fuzzer targets, much like how we define unittests. For example:
|
||||
|
||||
``` python
|
||||
```python
|
||||
env.CppLibfuzzerTest(
|
||||
target='op_msg_fuzzer',
|
||||
source=[
|
||||
|
|
@ -83,5 +83,5 @@ acquire and re-use a corpus from an earlier commit, if it can.
|
|||
|
||||
# References
|
||||
|
||||
- [LibFuzzer's official
|
||||
documentation](https://llvm.org/docs/LibFuzzer.html)
|
||||
- [LibFuzzer's official
|
||||
documentation](https://llvm.org/docs/LibFuzzer.html)
|
||||
|
|
|
|||
|
|
@ -3,95 +3,104 @@
|
|||
## C++ Linters
|
||||
|
||||
### `clang-format`
|
||||
The `buildscripts/clang_format.py` wrapper script runs the `clang-format` linter. You can see the
|
||||
|
||||
The `buildscripts/clang_format.py` wrapper script runs the `clang-format` linter. You can see the
|
||||
usage message for the wrapper by running `buildscripts/clang_format.py --help`.
|
||||
|
||||
Ex: `buildscripts/clang_format.py lint`
|
||||
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| --- | --- | --- | --- |
|
||||
| `clang-format` | `.clang-format` | `clang-format --help` | [https://clang.llvm.org/docs/ClangFormat.html](https://clang.llvm.org/docs/ClangFormat.html) |
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| -------------- | --------------------- | --------------------- | -------------------------------------------------------------------------------------------- |
|
||||
| `clang-format` | `.clang-format` | `clang-format --help` | [https://clang.llvm.org/docs/ClangFormat.html](https://clang.llvm.org/docs/ClangFormat.html) |
|
||||
|
||||
### `clang-tidy`
|
||||
The `evergreen/run_clang_tidy.sh` shell script runs the `clang-tidy` linter. In order to run
|
||||
|
||||
The `evergreen/run_clang_tidy.sh` shell script runs the `clang-tidy` linter. In order to run
|
||||
`clang-tidy` you must have a compilation database (`compile_commands.json` file).
|
||||
|
||||
Ex: `bash buildscripts/run_clang_tidy.sh`
|
||||
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| --- | --- | --- | --- |
|
||||
| `clang-tidy` | `.clang-tidy` | `clang-tidy --help` | [https://clang.llvm.org/extra/clang-tidy/index.html](https://clang.llvm.org/extra/clang-tidy/index.html) |
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| ------------ | --------------------- | ------------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| `clang-tidy` | `.clang-tidy` | `clang-tidy --help` | [https://clang.llvm.org/extra/clang-tidy/index.html](https://clang.llvm.org/extra/clang-tidy/index.html) |
|
||||
|
||||
### `errorcodes.py`
|
||||
The `buildscripts/errorcodes.py` script runs a custom error code linter, which verifies that all
|
||||
assertion codes are distinct. You can see the usage by running the following command:
|
||||
|
||||
The `buildscripts/errorcodes.py` script runs a custom error code linter, which verifies that all
|
||||
assertion codes are distinct. You can see the usage by running the following command:
|
||||
`buildscripts/errorcodes.py --help`.
|
||||
|
||||
Ex: `buildscripts/errorcodes.py`
|
||||
|
||||
### `quickmongolint.py`
|
||||
The `buildscripts/quickmongolint.py` script runs a simple MongoDB C++ linter. You can see the usage
|
||||
by running the following command: `buildscripts/quickmongolint.py --help`. You can take a look at
|
||||
|
||||
The `buildscripts/quickmongolint.py` script runs a simple MongoDB C++ linter. You can see the usage
|
||||
by running the following command: `buildscripts/quickmongolint.py --help`. You can take a look at
|
||||
`buildscripts/linter/mongolint.py` to better understand the rules for this linter.
|
||||
|
||||
Ex: `buildscripts/quickmongolint.py lint`
|
||||
|
||||
## Javascript Linters
|
||||
The `buildscripts/eslint.py` wrapper script runs the `eslint` javascript linter. You can see the
|
||||
|
||||
The `buildscripts/eslint.py` wrapper script runs the `eslint` javascript linter. You can see the
|
||||
usage message for the wrapper by running `buildscripts/eslint.py --help`.
|
||||
|
||||
Ex: `buildscripts/eslint.py lint`
|
||||
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| --- | --- | --- | --- |
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| -------- | ------------------------------- | --------------- | ------------------------------------------ |
|
||||
| `eslint` | `.eslintrc.yml` `.eslintignore` | `eslint --help` | [https://eslint.org/](https://eslint.org/) |
|
||||
|
||||
## Yaml Linters
|
||||
The `buildscripts/yamllinters.sh` shell script runs the yaml linters. The supported yaml linters
|
||||
are: `yamllint` & `evergreen-lint`. `evergreen-lint` is a custom MongoDB linter used specifically
|
||||
|
||||
The `buildscripts/yamllinters.sh` shell script runs the yaml linters. The supported yaml linters
|
||||
are: `yamllint` & `evergreen-lint`. `evergreen-lint` is a custom MongoDB linter used specifically
|
||||
for `evergreen` yaml files.
|
||||
|
||||
Ex: `bash buildscripts/yamllinters.sh`
|
||||
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| --- | --- | --- | --- |
|
||||
| `yamllint` | `etc/yamllint_config.yml` | `yamllint --help` | [https://readthedocs.org/projects/yamllint/](https://readthedocs.org/projects/yamllint/) |
|
||||
| `evergreen-lint` | `etc/evergreen_lint.yml` | `python -m evergreen_lint --help` | [https://github.com/evergreen-ci/config-linter](https://github.com/evergreen-ci/config-linter) |
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| ---------------- | ------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `yamllint` | `etc/yamllint_config.yml` | `yamllint --help` | [https://readthedocs.org/projects/yamllint/](https://readthedocs.org/projects/yamllint/) |
|
||||
| `evergreen-lint` | `etc/evergreen_lint.yml` | `python -m evergreen_lint --help` | [https://github.com/evergreen-ci/config-linter](https://github.com/evergreen-ci/config-linter) |
|
||||
|
||||
## Python Linters
|
||||
The `buildscripts/pylinters.py` wrapper script runs the Python linters. You can
|
||||
see the usage message for the wrapper by running the following command:
|
||||
`buildscripts/pylinters.py --help`. The following linters are supported: `pylint`, `mypy`,
|
||||
|
||||
The `buildscripts/pylinters.py` wrapper script runs the Python linters. You can
|
||||
see the usage message for the wrapper by running the following command:
|
||||
`buildscripts/pylinters.py --help`. The following linters are supported: `pylint`, `mypy`,
|
||||
`pydocstyle` & `yapf`.
|
||||
|
||||
Ex: `buildscripts/pylinters.py lint`
|
||||
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| --- | --- | --- | --- |
|
||||
| `pylint` | `.pylintrc` | `pylint --help` | [https://www.pylint.org/](https://www.pylint.org/) |
|
||||
| `mypy` | `.mypy.ini` | `mypy --help` | [https://readthedocs.org/projects/mypy/](https://readthedocs.org/projects/mypy/) |
|
||||
| `pydocstyle` | `.pydocstyle` | `pydocstyle --help` | [https://readthedocs.org/projects/pydocstyle/](https://readthedocs.org/projects/pydocstyle/) |
|
||||
| `yapf` | `.style.yapf` | `yapf --help` | [https://github.com/google/yapf](https://github.com/google/yapf) |
|
||||
| Linter | Configuration File(s) | Help Command | Documentation |
|
||||
| ------------ | --------------------- | ------------------- | -------------------------------------------------------------------------------------------- |
|
||||
| `pylint` | `.pylintrc` | `pylint --help` | [https://www.pylint.org/](https://www.pylint.org/) |
|
||||
| `mypy` | `.mypy.ini` | `mypy --help` | [https://readthedocs.org/projects/mypy/](https://readthedocs.org/projects/mypy/) |
|
||||
| `pydocstyle` | `.pydocstyle` | `pydocstyle --help` | [https://readthedocs.org/projects/pydocstyle/](https://readthedocs.org/projects/pydocstyle/) |
|
||||
| `yapf` | `.style.yapf` | `yapf --help` | [https://github.com/google/yapf](https://github.com/google/yapf) |
|
||||
|
||||
### SCons Linters
|
||||
`buildscripts/pylinters.py` has the `lint-scons` and `fix-scons` commands to lint
|
||||
and fix SCons and build system related code. Currently `yapf` is the only
|
||||
|
||||
`buildscripts/pylinters.py` has the `lint-scons` and `fix-scons` commands to lint
|
||||
and fix SCons and build system related code. Currently `yapf` is the only
|
||||
linter supported for SCons code.
|
||||
|
||||
## Using SCons for linting
|
||||
You can use SCons to run most of the linters listed above via their corresponding Python wrapper
|
||||
script. SCons also provides the ability to run multiple linters in a single command. At this time,
|
||||
|
||||
You can use SCons to run most of the linters listed above via their corresponding Python wrapper
|
||||
script. SCons also provides the ability to run multiple linters in a single command. At this time,
|
||||
SCons does not support `clang-tidy` or `buildscripts/yamllinters.sh`
|
||||
|
||||
Here are some examples:
|
||||
|
||||
| SCons Target | Linter(s) | Example |
|
||||
| --- | --- | --- |
|
||||
| `lint` | `clang-format` `errorcodes.py` `quickmongolint.py` `eslint` `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint` |
|
||||
| `lint-fast` | `clang-format` `errorcodes.py` `eslint` `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint-fast` |
|
||||
| `lint-clang-format` | `clang-format` | `buildscripts/scons.py lint-clang-format` |
|
||||
| `lint-errorcodes` | `errorcodes.py` | `buildscripts/scons.py lint-errorcodes` |
|
||||
| `lint-lint.py` | `quickmongolint.py` | `buildscripts/scons.py lint-lint.py` |
|
||||
| `lint-eslint` | `eslint` | `buildscripts/scons.py lint-eslint` |
|
||||
| `lint-pylinters` | `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint-pylinters` |
|
||||
| `lint-sconslinters` | `yapf` | `buildscripts/scons.py lint-sconslinters` |
|
||||
| SCons Target | Linter(s) | Example |
|
||||
| ------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------- |
|
||||
| `lint` | `clang-format` `errorcodes.py` `quickmongolint.py` `eslint` `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint` |
|
||||
| `lint-fast` | `clang-format` `errorcodes.py` `eslint` `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint-fast` |
|
||||
| `lint-clang-format` | `clang-format` | `buildscripts/scons.py lint-clang-format` |
|
||||
| `lint-errorcodes` | `errorcodes.py` | `buildscripts/scons.py lint-errorcodes` |
|
||||
| `lint-lint.py` | `quickmongolint.py` | `buildscripts/scons.py lint-lint.py` |
|
||||
| `lint-eslint` | `eslint` | `buildscripts/scons.py lint-eslint` |
|
||||
| `lint-pylinters` | `pylint` `mypy` `pydocstyle` `yapf` | `buildscripts/scons.py lint-pylinters` |
|
||||
| `lint-sconslinters` | `yapf` | `buildscripts/scons.py lint-sconslinters` |
|
||||
|
|
|
|||
|
|
@ -4,15 +4,16 @@
|
|||
endpoints behind load balancers requires proper configuration of the load balancers, `mongos`, and
|
||||
any drivers or shells used to connect to the database. Three conditions must be fulfilled for
|
||||
`mongos` to be used behind a load balancer:
|
||||
* `mongos` must be configured with the [MongoDB Server Parameter](https://docs.mongodb.com/manual/reference/parameters/) `loadBalancerPort` whose value can be specified at program start in any of the ways mentioned in the server parameter documentation.
|
||||
This option causes `mongos` to open a second port that expects _only_ load balanced connections. All connections made from load
|
||||
balancers _must_ be made over this port, and no regular connections may be made over this port.
|
||||
* The L4 load balancer _must_ be configured to emit a [proxy protocol][proxy-protocol-url] header
|
||||
at the [start of its connection stream](https://github.com/mongodb/mongo/commit/3a18d295d22b377cc7bc4c97bd3b6884d065bb85). `mongos` [supports](https://github.com/mongodb/mongo/commit/786482da93c3e5e58b1c690cb060f00c60864f69) both version 1 and version 2 of the proxy
|
||||
protocol standard.
|
||||
* The connection string used to establish the `mongos` connection must set the `loadBalanced` option,
|
||||
e.g., when connecting to a local `mongos` instance, if the `loadBalancerPort` server parameter was set to 20100, the
|
||||
connection string must be of the form `"mongodb://localhost:20100/?loadBalanced=true"`.
|
||||
|
||||
- `mongos` must be configured with the [MongoDB Server Parameter](https://docs.mongodb.com/manual/reference/parameters/) `loadBalancerPort` whose value can be specified at program start in any of the ways mentioned in the server parameter documentation.
|
||||
This option causes `mongos` to open a second port that expects _only_ load balanced connections. All connections made from load
|
||||
balancers _must_ be made over this port, and no regular connections may be made over this port.
|
||||
- The L4 load balancer _must_ be configured to emit a [proxy protocol][proxy-protocol-url] header
|
||||
at the [start of its connection stream](https://github.com/mongodb/mongo/commit/3a18d295d22b377cc7bc4c97bd3b6884d065bb85). `mongos` [supports](https://github.com/mongodb/mongo/commit/786482da93c3e5e58b1c690cb060f00c60864f69) both version 1 and version 2 of the proxy
|
||||
protocol standard.
|
||||
- The connection string used to establish the `mongos` connection must set the `loadBalanced` option,
|
||||
e.g., when connecting to a local `mongos` instance, if the `loadBalancerPort` server parameter was set to 20100, the
|
||||
connection string must be of the form `"mongodb://localhost:20100/?loadBalanced=true"`.
|
||||
|
||||
`mongos` will emit appropiate error messages on connection attempts if these requirements are not
|
||||
met.
|
||||
|
|
|
|||
227
docs/logging.md
227
docs/logging.md
|
|
@ -33,15 +33,15 @@ repetition possible, and shares attribute names with other related log lines.
|
|||
The `msg` field predicates a reader's interpretation of the log line. It should
|
||||
be crafted with care and attention.
|
||||
|
||||
* Concisely describe what the log line is reporting, providing enough
|
||||
context necessary for interpreting attribute field names and values
|
||||
* Capitalize the first letter, as in a sentence
|
||||
* Avoid unnecessary punctuation, but punctuate between sentences if using
|
||||
multiple sentences
|
||||
* Do not conclude with punctuation
|
||||
* You may occasionally encounter `msg` strings containing fmt-style
|
||||
`{expr}` braces. These are legacy artifacts and should be rephrased
|
||||
according to these guidelines.
|
||||
- Concisely describe what the log line is reporting, providing enough
|
||||
context necessary for interpreting attribute field names and values
|
||||
- Capitalize the first letter, as in a sentence
|
||||
- Avoid unnecessary punctuation, but punctuate between sentences if using
|
||||
multiple sentences
|
||||
- Do not conclude with punctuation
|
||||
- You may occasionally encounter `msg` strings containing fmt-style
|
||||
`{expr}` braces. These are legacy artifacts and should be rephrased
|
||||
according to these guidelines.
|
||||
|
||||
### Attributes (fields in the attr subdocument)
|
||||
|
||||
|
|
@ -57,10 +57,10 @@ For `attr` field names, do the following:
|
|||
|
||||
The bar for understanding should be:
|
||||
|
||||
* Someone with reasonable understanding of mongod behavior should understand
|
||||
immediately what is being logged
|
||||
* Someone with reasonable troubleshooting skill should be able to extract doc-
|
||||
or code-searchable phrases to learn about what is being logged
|
||||
- Someone with reasonable understanding of mongod behavior should understand
|
||||
immediately what is being logged
|
||||
- Someone with reasonable troubleshooting skill should be able to extract doc-
|
||||
or code-searchable phrases to learn about what is being logged
|
||||
|
||||
#### Precisely describe values and units
|
||||
|
||||
|
|
@ -77,46 +77,44 @@ attribute name.
|
|||
Alternatively, specify an `attr` name of “durationMillis” and provide the
|
||||
number of milliseconds as an integer type.
|
||||
|
||||
__Importantly__: downstream analysis tools will rely on this convention, as a
|
||||
**Importantly**: downstream analysis tools will rely on this convention, as a
|
||||
replacement for the "[0-9]+ms$" format of prior logs.
|
||||
|
||||
#### Use certain specific terms whenever possible
|
||||
|
||||
When logging the below information, do so with these specific terms:
|
||||
|
||||
* __namespace__ - when logging a value of the form
|
||||
"\<db name\>.\<collection name\>". Do not use "collection" or abbreviate to "ns"
|
||||
* __db__ - instead of "database"
|
||||
* __error__ - when an error occurs, instead of "status". Use this for objects
|
||||
of type Status and DBException
|
||||
* __reason__ - to provide rationale for an event/action when "error" isn't
|
||||
appropriate
|
||||
- **namespace** - when logging a value of the form
|
||||
"\<db name\>.\<collection name\>". Do not use "collection" or abbreviate to "ns"
|
||||
- **db** - instead of "database"
|
||||
- **error** - when an error occurs, instead of "status". Use this for objects
|
||||
of type Status and DBException
|
||||
- **reason** - to provide rationale for an event/action when "error" isn't
|
||||
appropriate
|
||||
|
||||
### Examples
|
||||
|
||||
- Example 1:
|
||||
- Example 1:
|
||||
|
||||
LOGV2(1041, "Transition to PRIMARY complete");
|
||||
LOGV2(1041, "Transition to PRIMARY complete");
|
||||
|
||||
JSON Output:
|
||||
JSON Output:
|
||||
|
||||
{ ... , "id": 1041, "msg": "Transition to PRIMARY complete", "attr": {} }
|
||||
{ ... , "id": 1041, "msg": "Transition to PRIMARY complete", "attr": {} }
|
||||
|
||||
- Example 2:
|
||||
- Example 2:
|
||||
|
||||
LOGV2(1042, "Slow query", "duration"_attr = getDurationMillis());
|
||||
LOGV2(1042, "Slow query", "duration"_attr = getDurationMillis());
|
||||
|
||||
JSON Output:
|
||||
JSON Output:
|
||||
|
||||
{ ..., "id": 1042, "msg": "Slow query", "attr": { "durationMillis": 1000 } }
|
||||
{ ..., "id": 1042, "msg": "Slow query", "attr": { "durationMillis": 1000 } }
|
||||
|
||||
- For adding STL containers as dynamic attributes, see
|
||||
[RollbackImpl::\_summarizeRollback][_summarizeRollback]
|
||||
|
||||
- For adding STL containers as dynamic attributes, see
|
||||
[RollbackImpl::_summarizeRollback][_summarizeRollback]
|
||||
|
||||
- For sharing a string between a log line and a status see [this section of
|
||||
InitialSyncer::_lastOplogEntryFetcherCallbackForStopTimestamp][
|
||||
_lastOplogEntryFetcherCallbackForStopTimestamp]
|
||||
- For sharing a string between a log line and a status see [this section of
|
||||
InitialSyncer::\_lastOplogEntryFetcherCallbackForStopTimestamp][ _lastOplogEntryFetcherCallbackForStopTimestamp]
|
||||
|
||||
# Basic Usage
|
||||
|
||||
|
|
@ -126,7 +124,7 @@ The log system is made available with the following header:
|
|||
|
||||
The macro `MONGO_LOGV2_DEFAULT_COMPONENT` is expanded by all logging macros.
|
||||
This configuration macro must expand at their point of use to a `LogComponent`
|
||||
expression, which is implicitly attached to the emitted message. It is
|
||||
expression, which is implicitly attached to the emitted message. It is
|
||||
conventionally defined near the top of a `.cpp` file after headers are included,
|
||||
and before any logging macros are invoked. Example:
|
||||
|
||||
|
|
@ -175,16 +173,16 @@ can use this pattern:
|
|||
|
||||
##### Examples
|
||||
|
||||
- No attributes.
|
||||
- No attributes.
|
||||
|
||||
LOGV2(1000, "Logging event, no replacement fields is OK");
|
||||
LOGV2(1000, "Logging event, no replacement fields is OK");
|
||||
|
||||
- Some attributes.
|
||||
- Some attributes.
|
||||
|
||||
LOGV2(1002,
|
||||
"Replication state change",
|
||||
"from"_attr = getOldState(),
|
||||
"to"_attr = getNewState());
|
||||
LOGV2(1002,
|
||||
"Replication state change",
|
||||
"from"_attr = getOldState(),
|
||||
"to"_attr = getNewState());
|
||||
|
||||
### Log Component
|
||||
|
||||
|
|
@ -206,17 +204,17 @@ log statement.
|
|||
to different severities there are separate logging macros to be used, they all
|
||||
take paramaters like `LOGV2`:
|
||||
|
||||
* `LOGV2_WARNING`
|
||||
* `LOGV2_ERROR`
|
||||
* `LOGV2_FATAL`
|
||||
* `LOGV2_FATAL_NOTRACE`
|
||||
* `LOGV2_FATAL_CONTINUE`
|
||||
- `LOGV2_WARNING`
|
||||
- `LOGV2_ERROR`
|
||||
- `LOGV2_FATAL`
|
||||
- `LOGV2_FATAL_NOTRACE`
|
||||
- `LOGV2_FATAL_CONTINUE`
|
||||
|
||||
There is also variations that take `LogOptions` if needed:
|
||||
|
||||
* `LOGV2_WARNING_OPTIONS`
|
||||
* `LOGV2_ERROR_OPTIONS`
|
||||
* `LOGV2_FATAL_OPTIONS`
|
||||
- `LOGV2_WARNING_OPTIONS`
|
||||
- `LOGV2_ERROR_OPTIONS`
|
||||
- `LOGV2_FATAL_OPTIONS`
|
||||
|
||||
Fatal level log statements using `LOGV2_FATAL` perform `fassert` after logging,
|
||||
using the provided ID as assert id. `LOGV2_FATAL_NOTRACE` perform
|
||||
|
|
@ -288,7 +286,7 @@ When finished, it is logged using the regular logging API but the
|
|||
`_attr` literals with the `DynamicAttributes` is not supported.
|
||||
|
||||
When using the `DynamicAttributes` you need to be careful about parameter
|
||||
lifetimes. The `DynamicAttributes` binds attributes *by reference* and the
|
||||
lifetimes. The `DynamicAttributes` binds attributes _by reference_ and the
|
||||
reference must be valid when passing the `DynamicAttributes` to the log
|
||||
statement.
|
||||
|
||||
|
|
@ -309,26 +307,26 @@ statement.
|
|||
|
||||
Many basic types have built in support:
|
||||
|
||||
* Boolean
|
||||
* Integral types
|
||||
* Single `char` is logged as integer
|
||||
* Enums
|
||||
* Logged as their underlying integral type
|
||||
* Floating point types
|
||||
* `long double` is prohibited
|
||||
* String types
|
||||
* `std::string`
|
||||
* `StringData`
|
||||
* `const char*`
|
||||
* Duration types
|
||||
* Special formatting, see below
|
||||
* BSON types
|
||||
* `BSONObj`
|
||||
* `BSONArray`
|
||||
* `BSONElement`
|
||||
* BSON appendable types
|
||||
* `BSONObjBuilder::append` overload available
|
||||
* `boost::optional<T>` of any loggable type `T`
|
||||
- Boolean
|
||||
- Integral types
|
||||
- Single `char` is logged as integer
|
||||
- Enums
|
||||
- Logged as their underlying integral type
|
||||
- Floating point types
|
||||
- `long double` is prohibited
|
||||
- String types
|
||||
- `std::string`
|
||||
- `StringData`
|
||||
- `const char*`
|
||||
- Duration types
|
||||
- Special formatting, see below
|
||||
- BSON types
|
||||
- `BSONObj`
|
||||
- `BSONArray`
|
||||
- `BSONElement`
|
||||
- BSON appendable types
|
||||
- `BSONObjBuilder::append` overload available
|
||||
- `boost::optional<T>` of any loggable type `T`
|
||||
|
||||
### User-defined types
|
||||
|
||||
|
|
@ -338,16 +336,16 @@ that the log system can bind to.
|
|||
The system binds and uses serialization functions by looking for functions in
|
||||
the following priority order:
|
||||
|
||||
- Structured serialization functions
|
||||
- `void x.serialize(BSONObjBuilder*) const` (member)
|
||||
- `BSONObj x.toBSON() const` (member)
|
||||
- `BSONArray x.toBSONArray() const` (member)
|
||||
- `toBSON(x)` (non-member)
|
||||
- Stringification functions
|
||||
- `toStringForLogging(x)` (non-member)
|
||||
- `x.serialize(&fmtMemoryBuffer)` (member)
|
||||
- `x.toString() ` (member)
|
||||
- `toString(x)` (non-member)
|
||||
- Structured serialization functions
|
||||
- `void x.serialize(BSONObjBuilder*) const` (member)
|
||||
- `BSONObj x.toBSON() const` (member)
|
||||
- `BSONArray x.toBSONArray() const` (member)
|
||||
- `toBSON(x)` (non-member)
|
||||
- Stringification functions
|
||||
- `toStringForLogging(x)` (non-member)
|
||||
- `x.serialize(&fmtMemoryBuffer)` (member)
|
||||
- `x.toString() ` (member)
|
||||
- `toString(x)` (non-member)
|
||||
|
||||
Enums cannot have member functions, but they will still try to bind to the
|
||||
`toStringForLogging(e)` or `toString(e)` non-members. If neither is available,
|
||||
|
|
@ -363,7 +361,7 @@ logging perhaps because it's needed for other non-logging formatting. Usually a
|
|||
`toString` (member or nonmember) is a sufficient customization point and should
|
||||
be preferred as a canonical stringification of the object.
|
||||
|
||||
*NOTE: No `operator<<` overload is used even if available*
|
||||
_NOTE: No `operator<<` overload is used even if available_
|
||||
|
||||
##### Example
|
||||
|
||||
|
|
@ -400,8 +398,8 @@ is a JSON object where the field names are the key.
|
|||
|
||||
Ranges is loggable via helpers to indicate what type of range it is
|
||||
|
||||
* `seqLog(begin, end)`
|
||||
* `mapLog(begin, end)`
|
||||
- `seqLog(begin, end)`
|
||||
- `mapLog(begin, end)`
|
||||
|
||||
seqLog indicates that it is a sequential range where the iterators point to
|
||||
loggable value directly.
|
||||
|
|
@ -411,28 +409,28 @@ the iterators point to a key-value pair.
|
|||
|
||||
##### Examples
|
||||
|
||||
- Logging a sequence:
|
||||
- Logging a sequence:
|
||||
|
||||
std::array<int, 20> arrayOfInts = ...;
|
||||
LOGV2(1010,
|
||||
"Log container directly",
|
||||
"values"_attr = arrayOfInts);
|
||||
LOGV2(1011,
|
||||
"Log iterator range",
|
||||
"values"_attr = seqLog(arrayOfInts.begin(), arrayOfInts.end());
|
||||
LOGV2(1012,
|
||||
"Log first five elements",
|
||||
"values"_attr = seqLog(arrayOfInts.data(), arrayOfInts.data() + 5);
|
||||
std::array<int, 20> arrayOfInts = ...;
|
||||
LOGV2(1010,
|
||||
"Log container directly",
|
||||
"values"_attr = arrayOfInts);
|
||||
LOGV2(1011,
|
||||
"Log iterator range",
|
||||
"values"_attr = seqLog(arrayOfInts.begin(), arrayOfInts.end());
|
||||
LOGV2(1012,
|
||||
"Log first five elements",
|
||||
"values"_attr = seqLog(arrayOfInts.data(), arrayOfInts.data() + 5);
|
||||
|
||||
- Logging a map-like container:
|
||||
- Logging a map-like container:
|
||||
|
||||
StringMap<BSONObj> bsonMap = ...;
|
||||
LOGV2(1013,
|
||||
"Log map directly",
|
||||
"values"_attr = bsonMap);
|
||||
LOGV2(1014,
|
||||
"Log map iterator range",
|
||||
"values"_attr = mapLog(bsonMap.begin(), bsonMap.end());
|
||||
StringMap<BSONObj> bsonMap = ...;
|
||||
LOGV2(1013,
|
||||
"Log map directly",
|
||||
"values"_attr = bsonMap);
|
||||
LOGV2(1014,
|
||||
"Log map iterator range",
|
||||
"values"_attr = mapLog(bsonMap.begin(), bsonMap.end());
|
||||
|
||||
#### Containers and `uint64_t`
|
||||
|
||||
|
|
@ -457,7 +455,6 @@ type.
|
|||
auto asDecimal128 = [](uint64_t i) { return Decimal128(i); };
|
||||
auto asString = [](uint64_t i) { return std::to_string(i); };
|
||||
|
||||
|
||||
### Duration types
|
||||
|
||||
Duration types have special formatting to match existing practices in the
|
||||
|
|
@ -474,27 +471,26 @@ formatted as a BSON object.
|
|||
|
||||
##### Examples
|
||||
|
||||
- "duration" attribute
|
||||
- "duration" attribute
|
||||
|
||||
C++ expression:
|
||||
|
||||
"duration"_attr = Milliseconds(10)
|
||||
"duration"_attr = Milliseconds(10)
|
||||
|
||||
JSON format:
|
||||
|
||||
"durationMillis": 10
|
||||
"durationMillis": 10
|
||||
|
||||
- Container of Duration objects
|
||||
- Container of Duration objects
|
||||
|
||||
C++ expression:
|
||||
|
||||
"samples"_attr = std::vector<Nanoseconds>{Nanoseconds(200),
|
||||
Nanoseconds(400)}
|
||||
"samples"_attr = std::vector<Nanoseconds>{Nanoseconds(200),
|
||||
Nanoseconds(400)}
|
||||
|
||||
JSON format:
|
||||
|
||||
"samples": [{"durationNanos": 200}, {"durationNanos": 400}]
|
||||
|
||||
"samples": [{"durationNanos": 200}, {"durationNanos": 400}]
|
||||
|
||||
# Attribute naming abstraction
|
||||
|
||||
|
|
@ -546,7 +542,6 @@ functions implemented.
|
|||
LOGV2(2002, "Statement", logAttrs(t));
|
||||
LOGV2(2002, "Statement", "name"_attr=t.name, "data"_attr=t.data);
|
||||
|
||||
|
||||
## Handling temporary lifetime with multiple attributes
|
||||
|
||||
To avoid lifetime issues (log attributes bind their values by reference) it is
|
||||
|
|
@ -604,7 +599,6 @@ The assertion reason string will be a plain text formatted log (replacement
|
|||
fields filled in format-string). If replacement fields are not provided in the
|
||||
message string, attribute values will be missing from the assertion message.
|
||||
|
||||
|
||||
##### Examples
|
||||
|
||||
LOGV2_ERROR_OPTIONS(1050000,
|
||||
|
|
@ -628,7 +622,6 @@ Would emit a `uassert` after performing the log that is equivalent to:
|
|||
uasserted(ErrorCodes::DataCorruptionDetected,
|
||||
"Data corruption detected for RecordId(123456)");
|
||||
|
||||
|
||||
## Unstructured logging for local development
|
||||
|
||||
To make it easier to use the log system for tracing in local development, there
|
||||
|
|
@ -698,8 +691,8 @@ Output:
|
|||
}
|
||||
}
|
||||
|
||||
---
|
||||
|
||||
-----
|
||||
[relaxed_json_2]: https://github.com/mongodb/specifications/blob/master/source/extended-json.rst
|
||||
[_lastOplogEntryFetcherCallbackForStopTimestamp]: https://github.com/mongodb/mongo/blob/13caf3c499a22c2274bd533043eb7e06e6f8e8a4/src/mongo/db/repl/initial_syncer.cpp#L1500-L1512
|
||||
[_summarizeRollback]: https://github.com/mongodb/mongo/blob/13caf3c499a22c2274bd533043eb7e06e6f8e8a4/src/mongo/db/repl/rollback_impl.cpp#L1263-L1305
|
||||
|
|
|
|||
|
|
@ -8,12 +8,10 @@
|
|||
addr2line -e mongod -ifC <offset>
|
||||
```
|
||||
|
||||
|
||||
## `c++filt`
|
||||
|
||||
Use [`c++filt`][2] to demangle function names by pasting the whole stack trace to stdin.
|
||||
|
||||
|
||||
## Finding the Right Binary
|
||||
|
||||
To find the correct binary for a specific log you need to:
|
||||
|
|
|
|||
|
|
@ -1,15 +1,18 @@
|
|||
# Poetry Project Execution
|
||||
|
||||
## Project Impetus
|
||||
We frequently encounter Python errors that are caused by a python dependency author updating their package that is backward breaking. The following tickets are a few examples of this happening:
|
||||
|
||||
We frequently encounter Python errors that are caused by a python dependency author updating their package that is backward breaking. The following tickets are a few examples of this happening:
|
||||
[SERVER-79126](https://jira.mongodb.org/browse/SERVER-79126), [SERVER-79798](https://jira.mongodb.org/browse/SERVER-79798), [SERVER-53348](https://jira.mongodb.org/browse/SERVER-53348), [SERVER-57036](https://jira.mongodb.org/browse/SERVER-57036), [SERVER-44579](https://jira.mongodb.org/browse/SERVER-44579), [SERVER-70845](https://jira.mongodb.org/browse/SERVER-70845), [SERVER-63974](https://jira.mongodb.org/browse/SERVER-63974), [SERVER-61791](https://jira.mongodb.org/browse/SERVER-61791), and [SERVER-60950](https://jira.mongodb.org/browse/SERVER-60950). We have always known this was a problem and have known there was a way to fix it. We finally had the bandwidth to tackle this problem.
|
||||
|
||||
## Project Prework
|
||||
|
||||
First, we wanted to test out using poetry so we converted mongo-container project to use poetry [SERVER-76974](https://jira.mongodb.org/browse/SERVER-76974). This showed promise and we considered this a green light to move forward on converting the server python to use poetry.
|
||||
|
||||
Before we could start the project we had to upgrade python to a version that was not EoL. This work is captured in [SERVER-72262](https://jira.mongodb.org/browse/SERVER-72262). We upgraded python to 3.10 on every system except windows. Windows could not be upgraded due to a test problem relating to some cipher suites [SERVER-79172](https://jira.mongodb.org/browse/SERVER-79172).
|
||||
|
||||
## Conversion to Poetry
|
||||
|
||||
After the prework was done we wrote, tested, and merged [SERVER-76751](https://jira.mongodb.org/browse/SERVER-76751) which is converting the mongo python dependencies to poetry. This ticket had an absurd amount of dependencies and required a significant amount of patch builds. The total number of changes was pretty small but it affected a lot of different projects.
|
||||
|
||||
Knowing there was a lot this touched we expected to see some bugs and were quick to try to fix them. Some of these were caught before merge and some were caught after.
|
||||
|
|
|
|||
|
|
@ -13,25 +13,24 @@ There are three main classes/interfaces that make up the PrimaryOnlyService mach
|
|||
### PrimaryOnlyServiceRegistry
|
||||
|
||||
The PrimaryOnlyServiceRegistry is a singleton that is installed as a decoration on the
|
||||
ServiceContext at startup and lives for the lifetime of the mongod process. During mongod global
|
||||
ServiceContext at startup and lives for the lifetime of the mongod process. During mongod global
|
||||
startup, all PrimaryOnlyServices must be registered against the PrimaryOnlyServiceRegistry before
|
||||
the ReplicationCoordinator is started up (as it is the ReplicationCoordinator startup that starts up
|
||||
the registered PrimaryOnlyServices). Specific PrimaryOnlyServices can be looked up from the registry
|
||||
at runtime, and are handed out by raw pointer, which is safe since the set of registered
|
||||
PrimaryOnlyServices does not change during runtime. The PrimaryOnlyServiceRegistry is itself a
|
||||
PrimaryOnlyServices does not change during runtime. The PrimaryOnlyServiceRegistry is itself a
|
||||
[ReplicaSetAwareService](../src/mongo/db/repl/README.md#ReplicaSetAwareService-interface), which is
|
||||
how it receives notifications about changes in and out of Primary state.
|
||||
|
||||
### PrimaryOnlyService
|
||||
|
||||
The PrimaryOnlyService interface is used to define a new Primary Only Service. A PrimaryOnlyService
|
||||
The PrimaryOnlyService interface is used to define a new Primary Only Service. A PrimaryOnlyService
|
||||
is a grouping of tasks (Instances) that run only when the node is Primary and are resumed after
|
||||
failover. Each PrimaryOnlyService must declare a unique, replicated collection (most likely in the
|
||||
failover. Each PrimaryOnlyService must declare a unique, replicated collection (most likely in the
|
||||
admin or config databases), where the state documents for all Instances of the service will be
|
||||
persisted. At stepUp, each PrimaryOnlyService will create and launch Instance objects for each
|
||||
persisted. At stepUp, each PrimaryOnlyService will create and launch Instance objects for each
|
||||
document found in this collection. This is how PrimaryOnlyService tasks get resumed after failover.
|
||||
|
||||
|
||||
### PrimaryOnlyService::Instance/TypedInstance
|
||||
|
||||
The PrimaryOnlyService::Instance interface is used to contain the state and core logic for running a
|
||||
|
|
@ -39,26 +38,25 @@ single task belonging to a PrimaryOnlyService. The Instance interface includes a
|
|||
method which is provided an executor which is used to run all work that is done on behalf of the
|
||||
Instance. Implementations should not extend PrimaryOnlyService::Instance directly, instead they
|
||||
should extend PrimaryOnlyService::TypedInstance, which allows individual Instances to be looked up
|
||||
and returned as pointers to the proper Instance sub-type. The InstanceID for an Instance is the _id
|
||||
and returned as pointers to the proper Instance sub-type. The InstanceID for an Instance is the \_id
|
||||
field of its state document.
|
||||
|
||||
|
||||
## Defining a new PrimaryOnlyService
|
||||
|
||||
To define a new PrimaryOnlyService one must add corresponding subclasses of both PrimaryOnlyService
|
||||
and PrimaryOnlyService::TypedInstance. The PrimaryOnlyService subclass just exists to specify what
|
||||
and PrimaryOnlyService::TypedInstance. The PrimaryOnlyService subclass just exists to specify what
|
||||
collection state documents for this service are stored in, and to hand out corresponding Instances
|
||||
of the proper type. Most of the work of a new PrimaryOnlyService will be implemented in the
|
||||
of the proper type. Most of the work of a new PrimaryOnlyService will be implemented in the
|
||||
PrimaryOnlyService::Instance subclass. PrimaryOnlyService::Instance subclasses will be responsible
|
||||
for running the work they need to perform to complete their task, as well as for managing and
|
||||
synchronizing their own in-memory and on-disk state. No part of the PrimaryOnlyService **machinery**
|
||||
ever performs writes to the PrimaryOnlyService state document collections. All writes to a given
|
||||
ever performs writes to the PrimaryOnlyService state document collections. All writes to a given
|
||||
Instance's state document (including creating it initially and deleting it when the work has been
|
||||
completed) are performed by Instance implementations. This means that for the majority of
|
||||
completed) are performed by Instance implementations. This means that for the majority of
|
||||
PrimaryOnlyServices, the first step of its Instance's run() method will be to insert an initial
|
||||
state document into the state document collection, to ensure that the Instance is now persisted and
|
||||
will be resumed after failover. When an Instance is resumed after failover, it is provided the
|
||||
current version of the state document as it exists in the state document collection. That document
|
||||
will be resumed after failover. When an Instance is resumed after failover, it is provided the
|
||||
current version of the state document as it exists in the state document collection. That document
|
||||
can be used to rebuild the in-memory state for this Instance so that when run() is called it knows
|
||||
what state it is in and thus what work still needs to be performed, and what work has already been
|
||||
completed by the previous Primary.
|
||||
|
|
@ -66,7 +64,6 @@ completed by the previous Primary.
|
|||
To see an example bare-bones PrimaryOnlyService implementation to use as a reference, check out the
|
||||
TestService defined in this unit test: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/primary_only_service_test.cpp
|
||||
|
||||
|
||||
## Behavior during state transitions
|
||||
|
||||
At stepUp, each PrimaryOnlyService queries its state document collection, and for each document
|
||||
|
|
@ -83,12 +80,12 @@ the same time on the same node.
|
|||
### Interrupting Instances at stepDown
|
||||
|
||||
At stepDown, there are 3 main ways that Instances are interrupted and we guarantee that no more work
|
||||
is performed on behalf of any PrimaryOnlyServices. The first is that the executor provided to each
|
||||
is performed on behalf of any PrimaryOnlyServices. The first is that the executor provided to each
|
||||
Instance's run() method gets shut down, preventing any more work from being scheduled on behalf of
|
||||
that Instance. The second is that all OperationContexts created on threads (Clients) that are part
|
||||
that Instance. The second is that all OperationContexts created on threads (Clients) that are part
|
||||
of an Executor owned by a PrimaryOnlyService get interrupted. The third is that each individual
|
||||
Instance is explicitly interrupted, so that it can unblock any work running on threads that are
|
||||
*not* a part of an executor owned by the PrimaryOnlyService that are dependent on that Instance
|
||||
_not_ a part of an executor owned by the PrimaryOnlyService that are dependent on that Instance
|
||||
signaling them (e.g. commands that are waiting on the Instance to reach a certain state). Currently
|
||||
this happens via a call to an interrupt() method that each Instance must override, but in the future
|
||||
this is likely to change to signaling a CancellationToken owned by the Instance instead.
|
||||
|
|
@ -96,9 +93,9 @@ this is likely to change to signaling a CancellationToken owned by the Instance
|
|||
## Instance lifetime
|
||||
|
||||
Instances are held by shared_ptr in their parent PrimaryOnlyService. Each PrimaryOnlyService
|
||||
releases all Instance shared_ptrs it owns on stepDown. Additionally, a PrimaryOnlyService will
|
||||
releases all Instance shared_ptrs it owns on stepDown. Additionally, a PrimaryOnlyService will
|
||||
release an Instance shared_ptr when the state document for that Instance is deleted (via an
|
||||
OpObserver). Since generally speaking it is logic from an Instance's run() method that will be
|
||||
OpObserver). Since generally speaking it is logic from an Instance's run() method that will be
|
||||
responsible for deleting its state document, such logic needs to be careful as the moment the state
|
||||
document is deleted, the corresponding PrimaryOnlyService is no longer keeping that Instance alive.
|
||||
If an Instance has any additional logic or internal state to update after deleting its state
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Links to Security Architecture Guide
|
||||
|
||||
- [Identity and Access Management](https://github.com/mongodb/mongo/blob/master/src/mongo/db/auth/README.md)
|
||||
- [TLS](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/README.md)
|
||||
- [FTDC](https://github.com/mongodb/mongo/blob/master/src/mongo/db/ftdc/README.md)
|
||||
- [LibFuzzer](https://github.com/mongodb/mongo/blob/master/docs/libfuzzer.md)
|
||||
- [SELinux](https://github.com/mongodb/mongodb-selinux/blob/master/README.md)
|
||||
- [Identity and Access Management](https://github.com/mongodb/mongo/blob/master/src/mongo/db/auth/README.md)
|
||||
- [TLS](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/README.md)
|
||||
- [FTDC](https://github.com/mongodb/mongo/blob/master/src/mongo/db/ftdc/README.md)
|
||||
- [LibFuzzer](https://github.com/mongodb/mongo/blob/master/docs/libfuzzer.md)
|
||||
- [SELinux](https://github.com/mongodb/mongodb-selinux/blob/master/README.md)
|
||||
|
|
|
|||
|
|
@ -1,15 +1,18 @@
|
|||
# Server Parameters
|
||||
Mongo database and router servers (i.e., `mongod` and `mongos`) provide a number of configuration
|
||||
options through server parameters. These parameters allow users to configure the behavior of the
|
||||
server at startup or runtime. For instance, `logLevel` is a server parameter that configures the
|
||||
|
||||
Mongo database and router servers (i.e., `mongod` and `mongos`) provide a number of configuration
|
||||
options through server parameters. These parameters allow users to configure the behavior of the
|
||||
server at startup or runtime. For instance, `logLevel` is a server parameter that configures the
|
||||
logging verbosity.
|
||||
|
||||
## How to define new parameters
|
||||
Parameters are defined by the elements of the `server_parameters` section of an IDL file. The IDL
|
||||
machinery will parse these files and generate C++ code, and corresponding header files where
|
||||
|
||||
Parameters are defined by the elements of the `server_parameters` section of an IDL file. The IDL
|
||||
machinery will parse these files and generate C++ code, and corresponding header files where
|
||||
appropriate. The generated code will self-register server parameters with the runtime.
|
||||
|
||||
Consider `logLevel` parameter from [`parameters.idl`][parameters.idl] for example:
|
||||
|
||||
```yaml
|
||||
...
|
||||
server_parameters:
|
||||
|
|
@ -23,126 +26,132 @@ server_parameters:
|
|||
...
|
||||
```
|
||||
|
||||
This defines a server parameter called `logLevel`, which is settable at startup or at runtime, and
|
||||
declares a C++ class for the parameter (i.e., `LogLevelServerParameter`). Refer to the
|
||||
This defines a server parameter called `logLevel`, which is settable at startup or at runtime, and
|
||||
declares a C++ class for the parameter (i.e., `LogLevelServerParameter`). Refer to the
|
||||
[Server Parameters Syntax](#server-parameters-syntax) documentation for the complete IDL syntax.
|
||||
|
||||
## How to change a defined parameter
|
||||
Users can set or modify a server parameter at startup and/or runtime, depending on the value
|
||||
specified for `set_at`. For instance, `logLevel` may be set at both startup and runtime, as
|
||||
|
||||
Users can set or modify a server parameter at startup and/or runtime, depending on the value
|
||||
specified for `set_at`. For instance, `logLevel` may be set at both startup and runtime, as
|
||||
indicated by `set_at` (see the above code snippet).
|
||||
|
||||
At startup, server parameters may be set using the `--setParameter` command line option.
|
||||
At runtime, the `setParameter` command may be used to modify server parameters.
|
||||
At startup, server parameters may be set using the `--setParameter` command line option.
|
||||
At runtime, the `setParameter` command may be used to modify server parameters.
|
||||
See the [`setParameter` documentation][set-parameter] for details.
|
||||
|
||||
## How to get the value provided for a parameter
|
||||
Server developers may retrieve the value of a server parameter by:
|
||||
* Accessing the C++ expression that corresponds to the parameter of interest. For example, reading
|
||||
from [`serverGlobalParams.quiet`][quiet-param] returns the current value for `quiet`.
|
||||
* Registering a callback to be notified about changes to the server parameter (e.g.,
|
||||
[`onUpdateFTDCFileSize`][ftdc-file-size-param] for `diagnosticDataCollectionFileSizeMB`).
|
||||
|
||||
Database users may use the [`getParameter`][get-parameter] command to query the current value for a
|
||||
Server developers may retrieve the value of a server parameter by:
|
||||
|
||||
- Accessing the C++ expression that corresponds to the parameter of interest. For example, reading
|
||||
from [`serverGlobalParams.quiet`][quiet-param] returns the current value for `quiet`.
|
||||
- Registering a callback to be notified about changes to the server parameter (e.g.,
|
||||
[`onUpdateFTDCFileSize`][ftdc-file-size-param] for `diagnosticDataCollectionFileSizeMB`).
|
||||
|
||||
Database users may use the [`getParameter`][get-parameter] command to query the current value for a
|
||||
server parameter.
|
||||
|
||||
## Server Parameters Syntax
|
||||
The following shows the IDL syntax for declaring server parameters. Field types are denoted in each
|
||||
section. For details regarding `string or expression map`, see that section
|
||||
|
||||
The following shows the IDL syntax for declaring server parameters. Field types are denoted in each
|
||||
section. For details regarding `string or expression map`, see that section
|
||||
[below](#string-or-expression-map).
|
||||
|
||||
```yaml
|
||||
server_parameters:
|
||||
"nameOfParameter": # string
|
||||
set_at: # string or list of strings
|
||||
description: # string
|
||||
cpp_vartype: # string
|
||||
cpp_varname: # string
|
||||
cpp_class: # string (name field) or map
|
||||
name: # string
|
||||
data: # string
|
||||
override_ctor: # bool
|
||||
override_set: # bool
|
||||
override_validate: # bool
|
||||
redact: # bool
|
||||
test_only: # bool
|
||||
default: # string or expression map
|
||||
deprecated_name: # string or list of strings
|
||||
on_update: # string
|
||||
condition:
|
||||
expr: # C++ bool expression, evaluated at run time
|
||||
constexpr: # C++ bool expression, evaluated at compilation time
|
||||
preprocessor: # C preprocessor condition
|
||||
min_fcv: # string
|
||||
feature_flag: # string
|
||||
validator: # Map containing one or more of the below
|
||||
lt: # string or expression map
|
||||
gt: # string or expression map
|
||||
lte: # string or expression map
|
||||
gte: # string or expression map
|
||||
callback: # string
|
||||
"nameOfParameter": # string
|
||||
set_at: # string or list of strings
|
||||
description: # string
|
||||
cpp_vartype: # string
|
||||
cpp_varname: # string
|
||||
cpp_class: # string (name field) or map
|
||||
name: # string
|
||||
data: # string
|
||||
override_ctor: # bool
|
||||
override_set: # bool
|
||||
override_validate: # bool
|
||||
redact: # bool
|
||||
test_only: # bool
|
||||
default: # string or expression map
|
||||
deprecated_name: # string or list of strings
|
||||
on_update: # string
|
||||
condition:
|
||||
expr: # C++ bool expression, evaluated at run time
|
||||
constexpr: # C++ bool expression, evaluated at compilation time
|
||||
preprocessor: # C preprocessor condition
|
||||
min_fcv: # string
|
||||
feature_flag: # string
|
||||
validator: # Map containing one or more of the below
|
||||
lt: # string or expression map
|
||||
gt: # string or expression map
|
||||
lte: # string or expression map
|
||||
gte: # string or expression map
|
||||
callback: # string
|
||||
```
|
||||
Each entry in the `server_parameters` map represents one server parameter. The name of the parameter
|
||||
|
||||
Each entry in the `server_parameters` map represents one server parameter. The name of the parameter
|
||||
must be unique across the server instance. More information on the specific fields:
|
||||
|
||||
* `set_at` (required): Must contain the value `startup`, `runtime`, [`startup`, `runtime`], or
|
||||
`cluster`. If `runtime` is specified along with `cpp_varname`, then `decltype(cpp_varname)` must
|
||||
refer to a thread-safe storage type, specifically: `AtomicWord<T>`, `AtomicDouble`, `std::atomic<T>`,
|
||||
or `boost::synchronized<T>`. Parameters declared as `cluster` can only be set at runtime and exhibit
|
||||
numerous differences. See [Cluster Server Parameters](cluster-server-parameters) below.
|
||||
- `set_at` (required): Must contain the value `startup`, `runtime`, [`startup`, `runtime`], or
|
||||
`cluster`. If `runtime` is specified along with `cpp_varname`, then `decltype(cpp_varname)` must
|
||||
refer to a thread-safe storage type, specifically: `AtomicWord<T>`, `AtomicDouble`, `std::atomic<T>`,
|
||||
or `boost::synchronized<T>`. Parameters declared as `cluster` can only be set at runtime and exhibit
|
||||
numerous differences. See [Cluster Server Parameters](cluster-server-parameters) below.
|
||||
|
||||
* `description` (required): Free-form text field currently used only for commenting the generated C++
|
||||
code. Future uses may preserve this value for a possible `{listSetParameters:1}` command or other
|
||||
programmatic and potentially user-facing purposes.
|
||||
- `description` (required): Free-form text field currently used only for commenting the generated C++
|
||||
code. Future uses may preserve this value for a possible `{listSetParameters:1}` command or other
|
||||
programmatic and potentially user-facing purposes.
|
||||
|
||||
* `cpp_vartype`: Declares the full storage type. If `cpp_vartype` is not defined, it may be inferred
|
||||
from the C++ variable referenced by `cpp_varname`.
|
||||
- `cpp_vartype`: Declares the full storage type. If `cpp_vartype` is not defined, it may be inferred
|
||||
from the C++ variable referenced by `cpp_varname`.
|
||||
|
||||
* `cpp_varname`: Declares the underlying variable or C++ `struct` member to use when setting or reading the
|
||||
server parameter. If defined together with `cpp_vartype`, the storage will be declared as a global
|
||||
variable, and externed in the generated header file. If defined alone, a variable of this name will
|
||||
assume to have been declared and defined by the implementer, and its type will be automatically
|
||||
inferred at compile time. If `cpp_varname` is not defined, then `cpp_class` must be specified.
|
||||
- `cpp_varname`: Declares the underlying variable or C++ `struct` member to use when setting or reading the
|
||||
server parameter. If defined together with `cpp_vartype`, the storage will be declared as a global
|
||||
variable, and externed in the generated header file. If defined alone, a variable of this name will
|
||||
assume to have been declared and defined by the implementer, and its type will be automatically
|
||||
inferred at compile time. If `cpp_varname` is not defined, then `cpp_class` must be specified.
|
||||
|
||||
* `cpp_class`: Declares a custom `ServerParameter` class in the generated header using the provided
|
||||
string, or the name field in the associated map. The declared class will require an implementation
|
||||
of `setFromString()`, and optionally `set()`, `append()`, and a constructor.
|
||||
See [Specialized Server Parameters](#specialized-server-parameters) below.
|
||||
- `cpp_class`: Declares a custom `ServerParameter` class in the generated header using the provided
|
||||
string, or the name field in the associated map. The declared class will require an implementation
|
||||
of `setFromString()`, and optionally `set()`, `append()`, and a constructor.
|
||||
See [Specialized Server Parameters](#specialized-server-parameters) below.
|
||||
|
||||
* `default`: String or expression map representation of the initial value.
|
||||
- `default`: String or expression map representation of the initial value.
|
||||
|
||||
* `redact`: Set to `true` to replace values of this setting with placeholders (e.g., for passwords).
|
||||
- `redact`: Set to `true` to replace values of this setting with placeholders (e.g., for passwords).
|
||||
|
||||
* `test_only`: Set to `true` to disable this set parameter if `enableTestCommands` is not specified.
|
||||
- `test_only`: Set to `true` to disable this set parameter if `enableTestCommands` is not specified.
|
||||
|
||||
* `deprecated_name`: One or more names which can be used with the specified setting and underlying
|
||||
storage. Reading or writing a setting using this name will result in a warning in the server log.
|
||||
- `deprecated_name`: One or more names which can be used with the specified setting and underlying
|
||||
storage. Reading or writing a setting using this name will result in a warning in the server log.
|
||||
|
||||
* `on_update`: C++ callback invoked after all validation rules have completed successfully and the
|
||||
new value has been stored. Prototype: `Status(const cpp_vartype&);`
|
||||
- `on_update`: C++ callback invoked after all validation rules have completed successfully and the
|
||||
new value has been stored. Prototype: `Status(const cpp_vartype&);`
|
||||
|
||||
* `condition`: Up to five conditional rules for deciding whether or not to apply this server
|
||||
parameter. `preprocessor` will be evaluated first, followed by `constexpr`, then finally `expr`. If
|
||||
no provided setting evaluates to `false`, the server parameter will be registered. `feature_flag` and
|
||||
`min_fcv` are evaluated after the parameter is registered, and instead affect whether the parameter
|
||||
is enabled. `min_fcv` is a string of the form `X.Y`, representing the minimum FCV version for which
|
||||
this parameter should be enabled. `feature_flag` is the name of a feature flag variable upon which
|
||||
this server parameter depends -- if the feature flag is disabled, this parameter will be disabled.
|
||||
`feature_flag` should be removed when all other instances of that feature flag are deleted, which
|
||||
typically is done after the next LTS version of the server is branched. `min_fcv` should be removed
|
||||
after it is no longer possible to downgrade to a FCV lower than that version - this occurs when the
|
||||
next LTS version of the server is branched.
|
||||
- `condition`: Up to five conditional rules for deciding whether or not to apply this server
|
||||
parameter. `preprocessor` will be evaluated first, followed by `constexpr`, then finally `expr`. If
|
||||
no provided setting evaluates to `false`, the server parameter will be registered. `feature_flag` and
|
||||
`min_fcv` are evaluated after the parameter is registered, and instead affect whether the parameter
|
||||
is enabled. `min_fcv` is a string of the form `X.Y`, representing the minimum FCV version for which
|
||||
this parameter should be enabled. `feature_flag` is the name of a feature flag variable upon which
|
||||
this server parameter depends -- if the feature flag is disabled, this parameter will be disabled.
|
||||
`feature_flag` should be removed when all other instances of that feature flag are deleted, which
|
||||
typically is done after the next LTS version of the server is branched. `min_fcv` should be removed
|
||||
after it is no longer possible to downgrade to a FCV lower than that version - this occurs when the
|
||||
next LTS version of the server is branched.
|
||||
|
||||
* `validator`: Zero or many validation rules to impose on the setting. All specified rules must pass
|
||||
to consider the new setting valid. `lt`, `gt`, `lte`, `gte` fields provide for simple numeric limits
|
||||
or expression maps which evaluate to numeric values. For all other validation cases, specify
|
||||
callback as a C++ function or static method. Note that validation rules (including callback) may run
|
||||
in any order. To perform an action after all validation rules have completed, `on_update` should be
|
||||
preferred instead. Callback prototype: `Status(const cpp_vartype&, const boost::optional<TenantId>&);`
|
||||
- `validator`: Zero or many validation rules to impose on the setting. All specified rules must pass
|
||||
to consider the new setting valid. `lt`, `gt`, `lte`, `gte` fields provide for simple numeric limits
|
||||
or expression maps which evaluate to numeric values. For all other validation cases, specify
|
||||
callback as a C++ function or static method. Note that validation rules (including callback) may run
|
||||
in any order. To perform an action after all validation rules have completed, `on_update` should be
|
||||
preferred instead. Callback prototype: `Status(const cpp_vartype&, const boost::optional<TenantId>&);`
|
||||
|
||||
Any symbols such as global variables or callbacks used by a server parameter must be imported using
|
||||
the usual IDL machinery via `globals.cpp_includes`. Similarly, all generated code will be nested
|
||||
Any symbols such as global variables or callbacks used by a server parameter must be imported using
|
||||
the usual IDL machinery via `globals.cpp_includes`. Similarly, all generated code will be nested
|
||||
inside the namespace defined by `globals.cpp_namespace`. Consider the following for example:
|
||||
|
||||
```yaml
|
||||
global:
|
||||
cpp_namespace: "mongo"
|
||||
|
|
@ -157,13 +166,15 @@ server_parameters:
|
|||
```
|
||||
|
||||
### String or Expression Map
|
||||
The default and implicit fields above, as well as the `gt`, `lt`, `gte`, and `lte` validators accept
|
||||
either a simple scalar string which is treated as a literal value, or a YAML map containing an
|
||||
attribute called `expr`, which must be a string containing an arbitrary C++ expression to be used
|
||||
as-is. Optionally, an expression map may also include the `is_constexpr: false` attribute, which
|
||||
|
||||
The default and implicit fields above, as well as the `gt`, `lt`, `gte`, and `lte` validators accept
|
||||
either a simple scalar string which is treated as a literal value, or a YAML map containing an
|
||||
attribute called `expr`, which must be a string containing an arbitrary C++ expression to be used
|
||||
as-is. Optionally, an expression map may also include the `is_constexpr: false` attribute, which
|
||||
will suspend enforcement of the value being a `constexpr`.
|
||||
|
||||
For example, consider:
|
||||
|
||||
```yaml
|
||||
server_parameters:
|
||||
connPoolMaxInUseConnsPerHost:
|
||||
|
|
@ -174,79 +185,88 @@ server_parameters:
|
|||
...
|
||||
```
|
||||
|
||||
Here, the server parameter's default value is the evaluation of the C++ expression
|
||||
`std::numeric_limits<int>::max()`. Additionally, since default was not explicitly given the
|
||||
`is_constexpr: false` attribute, it will be round-tripped through the following lambda to guarantee
|
||||
Here, the server parameter's default value is the evaluation of the C++ expression
|
||||
`std::numeric_limits<int>::max()`. Additionally, since default was not explicitly given the
|
||||
`is_constexpr: false` attribute, it will be round-tripped through the following lambda to guarantee
|
||||
that it does not rely on runtime information.
|
||||
|
||||
```cpp
|
||||
[]{ constexpr auto value = <expr>; return value; }()
|
||||
```
|
||||
|
||||
### Specialized Server Parameters
|
||||
When `cpp_class` is specified on a server parameter, a child class of `ServerParameter` will be
|
||||
created in the `gen.h` file named for either the string value of `cpp_class`, or if it is expressed
|
||||
|
||||
When `cpp_class` is specified on a server parameter, a child class of `ServerParameter` will be
|
||||
created in the `gen.h` file named for either the string value of `cpp_class`, or if it is expressed
|
||||
as a dictionary, then `cpp_class.name`. A `cpp_class` directive may also contain:
|
||||
|
||||
```yaml
|
||||
server_parameters:
|
||||
someParameter:
|
||||
cpp_class:
|
||||
name: string # Name to assign to the class (e.g., SomeParameterImpl)
|
||||
data: string # cpp data type to add to the class as a property named "_data"
|
||||
override_ctor: bool # True to allow defining a custom constructor, default: false
|
||||
override_set: bool # True to allow defining a custom set() method, default: false
|
||||
override_validate: bool # True to allow defining a custom validate() method, default: false
|
||||
someParameter:
|
||||
cpp_class:
|
||||
name: string # Name to assign to the class (e.g., SomeParameterImpl)
|
||||
data: string # cpp data type to add to the class as a property named "_data"
|
||||
override_ctor: bool # True to allow defining a custom constructor, default: false
|
||||
override_set: bool # True to allow defining a custom set() method, default: false
|
||||
override_validate: bool # True to allow defining a custom validate() method, default: false
|
||||
```
|
||||
|
||||
`override_ctor`: If `false`, the inherited constructor from the `ServerParameter` base class will be
|
||||
used. If `true`, then the implementer must provide a
|
||||
`{name}::{name}(StringData serverParameterName, ServerParameterType type)` constructor. In addition
|
||||
`override_ctor`: If `false`, the inherited constructor from the `ServerParameter` base class will be
|
||||
used. If `true`, then the implementer must provide a
|
||||
`{name}::{name}(StringData serverParameterName, ServerParameterType type)` constructor. In addition
|
||||
to any other work, this custom constructor must invoke its parent's constructor.
|
||||
|
||||
`override_set`: If `true`, the implementer must provide a `set` member function as:
|
||||
|
||||
```cpp
|
||||
Status {name}::set(const BSONElement& val, const boost::optional<TenantId>& tenantId);
|
||||
```
|
||||
|
||||
Otherwise the base class implementation `ServerParameter::set` is used. It
|
||||
invokes `setFromString` using a string representation of `val`, if the `val` is
|
||||
holding one of the supported types.
|
||||
|
||||
`override_validate`: If `true`, the implementer must provide a `validate` member function as:
|
||||
|
||||
```cpp
|
||||
Status {name}::validate(const BSONElement& newValueElement, const boost::optional<TenantId>& tenantId);
|
||||
```
|
||||
|
||||
Otherwise, the base class implementation `ServerParameter::validate` is used. This simply returns
|
||||
`Status::OK()` without performing any kind of validation of the new BSON element.
|
||||
|
||||
If `param.redact` was specified as `true`, then a standard append method will be provided which
|
||||
injects a placeholder value. If `param.redact` was not specified as `true`, then an implementation
|
||||
must be provided with the following signature:
|
||||
If `param.redact` was specified as `true`, then a standard append method will be provided which
|
||||
injects a placeholder value. If `param.redact` was not specified as `true`, then an implementation
|
||||
must be provided with the following signature:
|
||||
|
||||
```cpp
|
||||
Status {name}::append(OperationContext*, BSONObjBuilder*, StringData, const boost::optional<TenantId>& tenantId);
|
||||
```
|
||||
|
||||
Lastly, a `setFromString` method must always be provided with the following signature:
|
||||
|
||||
```cpp
|
||||
Status {name}::setFromString(StringData value, const boost::optional<TenantId>& tenantId);
|
||||
```
|
||||
|
||||
The following table summarizes `ServerParameter` method override rules.
|
||||
|
||||
| `ServerParameter` method | Override | Default Behavior |
|
||||
| ------------------------ | -------- | -------------------------------------------------------------------- |
|
||||
| constructor | Optional | Instantiates only the name and type. |
|
||||
| `set()` | Optional | Calls `setFromString()` on a string representation of the new value. |
|
||||
| `setFromString()` | Required | None, won't compile without implementation. |
|
||||
| `append() // redact=true` | Optional | Replaces parameter value with '###'. |
|
||||
| `append() // redact=false` | Required | None, won't compile without implementation. |
|
||||
| `validate()` | Optional | Returns `Status::OK()` without any checks. |
|
||||
| `ServerParameter` method | Override | Default Behavior |
|
||||
| -------------------------- | -------- | -------------------------------------------------------------------- |
|
||||
| constructor | Optional | Instantiates only the name and type. |
|
||||
| `set()` | Optional | Calls `setFromString()` on a string representation of the new value. |
|
||||
| `setFromString()` | Required | None, won't compile without implementation. |
|
||||
| `append() // redact=true` | Optional | Replaces parameter value with '###'. |
|
||||
| `append() // redact=false` | Required | None, won't compile without implementation. |
|
||||
| `validate()` | Optional | Returns `Status::OK()` without any checks. |
|
||||
|
||||
Note that by default, server parameters are not tenant-aware and thus will always have `boost::none`
|
||||
provided as `tenantId`, unless defined as cluster server parameters (discussed
|
||||
[below](#cluster-server-parameters)).
|
||||
|
||||
Each server parameter encountered will produce a block of code to run at process startup similar to
|
||||
Each server parameter encountered will produce a block of code to run at process startup similar to
|
||||
the following:
|
||||
|
||||
```cpp
|
||||
/**
|
||||
* Iteration count to use when creating new users with
|
||||
|
|
@ -263,20 +283,21 @@ MONGO_COMPILER_VARIABLE_UNUSED auto* scp_unique_ident = [] {
|
|||
}();
|
||||
```
|
||||
|
||||
Any additional validator and callback would be set on `ret` as determined by the server parameter
|
||||
Any additional validator and callback would be set on `ret` as determined by the server parameter
|
||||
configuration block.
|
||||
|
||||
## Cluster Server Parameters
|
||||
As indicated earlier, one of the options for the `set_at` field is `cluster`. If this value is
|
||||
selected, then the generated server parameter will be known as a _cluster server parameter_. These
|
||||
|
||||
As indicated earlier, one of the options for the `set_at` field is `cluster`. If this value is
|
||||
selected, then the generated server parameter will be known as a _cluster server parameter_. These
|
||||
server parameters are set at runtime via the `setClusterParameter` command and are propagated to all
|
||||
nodes in a sharded cluster or a replica set deployment. Cluster server parameters should be
|
||||
preferred to implementing custom parameter propagation whenever possible.
|
||||
|
||||
`setClusterParameter` persists the new value of the indicated cluster server parameter onto a
|
||||
majority of nodes on non-sharded replica sets. On sharded clusters, it majority-writes the new value
|
||||
onto every shard and the config server. This ensures that every **mongod** in the cluster will be able
|
||||
to recover the most recently written value for all cluster server parameters on restart.
|
||||
`setClusterParameter` persists the new value of the indicated cluster server parameter onto a
|
||||
majority of nodes on non-sharded replica sets. On sharded clusters, it majority-writes the new value
|
||||
onto every shard and the config server. This ensures that every **mongod** in the cluster will be able
|
||||
to recover the most recently written value for all cluster server parameters on restart.
|
||||
Additionally, `setClusterParameter` blocks until the majority write succeeds in a replica set
|
||||
deployment, which guarantees that the parameter value will not be rolled back after being set.
|
||||
In a sharded cluster deployment, the new value has to be majority-committed on the config shard and
|
||||
|
|
@ -289,38 +310,39 @@ server parameter values every `clusterServerParameterRefreshIntervalSecs` using
|
|||
`ClusterParameterRefresher` periodic job.
|
||||
|
||||
`getClusterParameter` returns the cached value of the requested cluster server parameter on the node
|
||||
that it is run on. It can accept a single cluster server parameter name, a list of names, or `*` to
|
||||
that it is run on. It can accept a single cluster server parameter name, a list of names, or `*` to
|
||||
return all cluster server parameter values on the node.
|
||||
|
||||
Specifying `cpp_vartype` for cluster server parameters must result in the usage of an IDL-defined
|
||||
type that has `ClusterServerParameter` listed as a chained structure. This chaining adds the
|
||||
following members to the resulting type:
|
||||
|
||||
* `_id` - cluster server parameters are uniquely identified by their names.
|
||||
* `clusterParameterTime` - `LogicalTime` at which the current value of the cluster server parameter
|
||||
was updated; used by runtime audit configuration, and to prevent concurrent and redundant cluster
|
||||
parameter updates.
|
||||
- `_id` - cluster server parameters are uniquely identified by their names.
|
||||
- `clusterParameterTime` - `LogicalTime` at which the current value of the cluster server parameter
|
||||
was updated; used by runtime audit configuration, and to prevent concurrent and redundant cluster
|
||||
parameter updates.
|
||||
|
||||
It is highly recommended to specify validation rules or a callback function via the `param.validator`
|
||||
field. These validators are called before the new value of the cluster server parameter is written
|
||||
to disk during `setClusterParameter`.
|
||||
field. These validators are called before the new value of the cluster server parameter is written
|
||||
to disk during `setClusterParameter`.
|
||||
See [server_parameter_with_storage_test.idl][cluster-server-param-with-storage-test] and
|
||||
[server_parameter_with_storage_test_structs.idl][cluster-server-param-with-storage-test-structs] for
|
||||
examples.
|
||||
|
||||
### Specialized Cluster Server Parameters
|
||||
|
||||
Cluster server parameters can also be specified as specialized server parameters. The table below
|
||||
summarizes `ServerParameter` method override rules in this case.
|
||||
|
||||
| `ServerParameter` method | Override | Default Behavior |
|
||||
| --------------------------- | ------------------- | --------------------------------------------|
|
||||
| constructor | Optional | Instantiates only the name and type. |
|
||||
| `set()` | Required | None, won't compile without implementation. |
|
||||
| `setFromString()` | Prohibited | Returns `ErrorCodes::BadValue`. |
|
||||
| `append()` | Required | None, won't compile without implementation. |
|
||||
| `validate()` | Optional | Return `Status::OK()` without any checks. |
|
||||
| `reset()` | Required | None, won't compile without implementation. |
|
||||
| `getClusterParameterTime()` | Required | Return `LogicalTime::kUninitialized`. |
|
||||
| `ServerParameter` method | Override | Default Behavior |
|
||||
| --------------------------- | ---------- | ------------------------------------------- |
|
||||
| constructor | Optional | Instantiates only the name and type. |
|
||||
| `set()` | Required | None, won't compile without implementation. |
|
||||
| `setFromString()` | Prohibited | Returns `ErrorCodes::BadValue`. |
|
||||
| `append()` | Required | None, won't compile without implementation. |
|
||||
| `validate()` | Optional | Return `Status::OK()` without any checks. |
|
||||
| `reset()` | Required | None, won't compile without implementation. |
|
||||
| `getClusterParameterTime()` | Required | Return `LogicalTime::kUninitialized`. |
|
||||
|
||||
Specifying `override_ctor` to true is optional. An override constructor can be useful for allocating
|
||||
additional resources at the time of parameter registration. Otherwise, the default likely suffices,
|
||||
|
|
@ -363,7 +385,7 @@ disabled due to either of these conditions, `setClusterParameter` on it will alw
|
|||
`getClusterParameter` will fail on **mongod**, and return the default value on **mongos** -- this
|
||||
difference in behavior is due to **mongos** being unaware of the current FCV.
|
||||
|
||||
See [server_parameter_specialized_test.idl][specialized-cluster-server-param-test-idl] and
|
||||
See [server_parameter_specialized_test.idl][specialized-cluster-server-param-test-idl] and
|
||||
[server_parameter_specialized_test.h][specialized-cluster-server-param-test-data] for examples.
|
||||
|
||||
### Implementation Details
|
||||
|
|
|
|||
|
|
@ -7,6 +7,7 @@ For string manipulation, use the util/mongoutils/str.h library.
|
|||
`util/mongoutils/str.h` provides string helper functions for each manipulation.
|
||||
|
||||
`str::stream()` is quite useful for assembling strings inline:
|
||||
|
||||
```
|
||||
uassert(12345, str::stream() << "bad ns:" << ns, isOk);
|
||||
```
|
||||
|
|
@ -27,5 +28,4 @@ class StringData {
|
|||
|
||||
See also [`bson/string_data.h`][1].
|
||||
|
||||
|
||||
[1]: ../src/mongo/base/string_data.h
|
||||
|
|
|
|||
|
|
@ -17,10 +17,10 @@ parameter for testing.
|
|||
|
||||
Some often-used commands that are test-only:
|
||||
|
||||
- [configureFailPoint][fail_point_cmd]
|
||||
- [emptyCapped][empty_capped_cmd]
|
||||
- [replSetTest][repl_set_test_cmd]
|
||||
- [sleep][sleep_cmd]
|
||||
- [configureFailPoint][fail_point_cmd]
|
||||
- [emptyCapped][empty_capped_cmd]
|
||||
- [replSetTest][repl_set_test_cmd]
|
||||
- [sleep][sleep_cmd]
|
||||
|
||||
As a very rough estimate, about 10% of all server commands are test-only. These additional commands
|
||||
will appear in `db.runCommand({listCommands: 1})` when the server has test commands enabled.
|
||||
|
|
@ -29,10 +29,9 @@ will appear in `db.runCommand({listCommands: 1})` when the server has test comma
|
|||
|
||||
A few pointers to relevant code that sets this up:
|
||||
|
||||
- [test_commands_enabled.h][test_commands_enabled]
|
||||
|
||||
- [MONGO_REGISTER_COMMAND][register_command]
|
||||
- [test_commands_enabled.h][test_commands_enabled]
|
||||
|
||||
- [MONGO_REGISTER_COMMAND][register_command]
|
||||
|
||||
[empty_capped_cmd]: ../src/mongo/db/commands/test_commands.cpp
|
||||
[fail_point_cmd]: ../src/mongo/db/commands/fail_point_cmd.cpp
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# FSM-based Concurrency Testing Framework
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
The FSM tests are meant to exercise concurrency within MongoDB. The suite
|
||||
consists of workloads, which define discrete units of work as states in a FSM,
|
||||
and runners, which define which tests to run and how they should be run. Each
|
||||
|
|
@ -42,19 +42,19 @@ some assertions, even when running a mixture of different workloads together.
|
|||
There are three assertion levels: `ALWAYS`, `OWN_COLL`, and `OWN_DB`. They can
|
||||
be thought of as follows:
|
||||
|
||||
* `ALWAYS`: A statement that remains unequivocally true, regardless of what
|
||||
another workload might be doing to the collection I was given (hint: think
|
||||
defensively). Examples include "1 = 1" or inserting a document into a
|
||||
collection (disregarding any unique indices).
|
||||
- `ALWAYS`: A statement that remains unequivocally true, regardless of what
|
||||
another workload might be doing to the collection I was given (hint: think
|
||||
defensively). Examples include "1 = 1" or inserting a document into a
|
||||
collection (disregarding any unique indices).
|
||||
|
||||
* `OWN_COLL`: A statement that is true only if I am the only workload operating
|
||||
on the collection I was given. Examples include counting the number of
|
||||
documents in a collection or updating a previously inserted document.
|
||||
- `OWN_COLL`: A statement that is true only if I am the only workload operating
|
||||
on the collection I was given. Examples include counting the number of
|
||||
documents in a collection or updating a previously inserted document.
|
||||
|
||||
* `OWN_DB`: A statement that is true only if I am the only workload operating on
|
||||
the database I was given. Examples include renaming a collection or verifying
|
||||
that a collection is capped. The workload typically relies on the use of
|
||||
another collection aside from the one given.
|
||||
- `OWN_DB`: A statement that is true only if I am the only workload operating on
|
||||
the database I was given. Examples include renaming a collection or verifying
|
||||
that a collection is capped. The workload typically relies on the use of
|
||||
another collection aside from the one given.
|
||||
|
||||
## Creating your own workload
|
||||
|
||||
|
|
@ -97,6 +97,7 @@ When finished executing, `$config` must return an object containing the properti
|
|||
above (some of which are optional, see below).
|
||||
|
||||
### Defining states
|
||||
|
||||
It's best to also declare states within its own closure so as not to interfere
|
||||
with the scope of $config. Each state takes two arguments, the db object and the
|
||||
collection name. For later, note that this db and collection are the only one
|
||||
|
|
@ -107,9 +108,9 @@ with a name as opposed to anonymously - this makes easier to read backtraces
|
|||
when things go wrong.
|
||||
|
||||
```javascript
|
||||
$config = (function() {
|
||||
$config = (function () {
|
||||
/* ... */
|
||||
var states = (function() {
|
||||
var states = (function () {
|
||||
function getRand() {
|
||||
return Random.randInt(10);
|
||||
}
|
||||
|
|
@ -119,18 +120,17 @@ $config = (function() {
|
|||
}
|
||||
|
||||
function scanGT(db, collName) {
|
||||
db[collName].find({ _id: { $gt: this.start } }).itcount();
|
||||
db[collName].find({_id: {$gt: this.start}}).itcount();
|
||||
}
|
||||
|
||||
function scanLTE(db, collName) {
|
||||
db[collName].find({ _id: { $lte: this.start } }).itcount();
|
||||
db[collName].find({_id: {$lte: this.start}}).itcount();
|
||||
}
|
||||
|
||||
|
||||
return {
|
||||
init: init,
|
||||
scanGT: scanGT,
|
||||
scanLTE: scanLTE
|
||||
scanLTE: scanLTE,
|
||||
};
|
||||
})();
|
||||
|
||||
|
|
@ -156,13 +156,12 @@ example below, we're denoting an equal probability of moving to either of the
|
|||
scan states from the init state:
|
||||
|
||||
```javascript
|
||||
|
||||
$config = (function() {
|
||||
$config = (function () {
|
||||
/* ... */
|
||||
var transitions = {
|
||||
init: { scanGT: 0.5, scanLTE: 0.5 },
|
||||
scanGT: { scanGT: 0.8, scanLTE: 0.2 },
|
||||
scanLTE: { scanGT: 0.2, scanLTE: 0.8 }
|
||||
init: {scanGT: 0.5, scanLTE: 0.5},
|
||||
scanGT: {scanGT: 0.8, scanLTE: 0.2},
|
||||
scanLTE: {scanGT: 0.2, scanLTE: 0.8},
|
||||
};
|
||||
/* ... */
|
||||
return {
|
||||
|
|
@ -186,25 +185,31 @@ against the provided `db` you should use the provided
|
|||
`cluster.executeOnMongodNodes` and `cluster.executeOnMongosNodes` functionality.
|
||||
|
||||
```javascript
|
||||
$config = (function() {
|
||||
$config = (function () {
|
||||
/* ... */
|
||||
function setup(db, collName, cluster) {
|
||||
// Workloads should NOT drop the collection db[collName], as doing so
|
||||
// is handled by jstests/concurrency/fsm_libs/runner.js before 'setup' is called.
|
||||
for (var i = 0; i < 1000; ++i) {
|
||||
db[collName].insert({ _id: i });
|
||||
db[collName].insert({_id: i});
|
||||
}
|
||||
cluster.executeOnMongodNodes(function(db) {
|
||||
db.adminCommand({ setParameter: 1, internalQueryExecYieldIterations: 5 });
|
||||
cluster.executeOnMongodNodes(function (db) {
|
||||
db.adminCommand({
|
||||
setParameter: 1,
|
||||
internalQueryExecYieldIterations: 5,
|
||||
});
|
||||
});
|
||||
cluster.executeOnMongosNodes(function(db) {
|
||||
cluster.executeOnMongosNodes(function (db) {
|
||||
printjson(db.serverCmdLineOpts());
|
||||
});
|
||||
}
|
||||
|
||||
function teardown(db, collName, cluster) {
|
||||
cluster.executeOnMongodNodes(function(db) {
|
||||
db.adminCommand({ setParameter: 1, internalQueryExecYieldIterations: 128 });
|
||||
cluster.executeOnMongodNodes(function (db) {
|
||||
db.adminCommand({
|
||||
setParameter: 1,
|
||||
internalQueryExecYieldIterations: 128,
|
||||
});
|
||||
});
|
||||
}
|
||||
/* ... */
|
||||
|
|
@ -233,9 +238,9 @@ composition, each workload has its own data, meaning you don't have to worry
|
|||
about properties being overridden by workloads other than the current one.
|
||||
|
||||
```javascript
|
||||
$config = (function() {
|
||||
$config = (function () {
|
||||
var data = {
|
||||
start: 0
|
||||
start: 0,
|
||||
};
|
||||
/* ... */
|
||||
return {
|
||||
|
|
@ -262,7 +267,7 @@ number of threads available due to system or performance constraints.
|
|||
#### `iterations`
|
||||
|
||||
This is just the number of states the FSM will go through before exiting. NOTE:
|
||||
it is *not* the number of times each state will be executed.
|
||||
it is _not_ the number of times each state will be executed.
|
||||
|
||||
#### `startState` (optional)
|
||||
|
||||
|
|
@ -298,8 +303,8 @@ workload you are extending has a function in its data object called
|
|||
|
||||
```javascript
|
||||
import {extendWorkload} from "jstests/concurrency/fsm_libs/extend_workload.js";
|
||||
load('jstests/concurrency/fsm_workload_modifiers/indexed_noindex.js'); // for indexedNoindex
|
||||
import {$config as $baseConfig} from 'jstests/concurrency/fsm_workloads/workload_with_index.js';
|
||||
load("jstests/concurrency/fsm_workload_modifiers/indexed_noindex.js"); // for indexedNoindex
|
||||
import {$config as $baseConfig} from "jstests/concurrency/fsm_workloads/workload_with_index.js";
|
||||
|
||||
export const $config = extendWorkload($baseConfig, indexedNoIndex);
|
||||
```
|
||||
|
|
@ -314,7 +319,6 @@ prefix defined by your workload name is a good idea since the workload file name
|
|||
can be assumed unique and will allow you to only affect your workload in these
|
||||
cases.
|
||||
|
||||
|
||||
## Test runners
|
||||
|
||||
By default, all runners below are allowed to open a maximum of
|
||||
|
|
@ -345,7 +349,6 @@ all complete, all threads have their teardown function run.
|
|||
|
||||

|
||||
|
||||
|
||||
### Existing runners
|
||||
|
||||
The existing runners all use `jstests/concurrency/fsm_libs/runner.js` to
|
||||
|
|
@ -358,10 +361,10 @@ is explained in the other components section below. Execution options for
|
|||
runWorkloads functions, the third argument, can contain the following options
|
||||
(some depend on the run mode):
|
||||
|
||||
* `numSubsets` - Not available in serial mode, determines how many subsets of
|
||||
workloads to execute in parallel mode
|
||||
* `subsetSize` - Not available in serial mode, determines how large each subset of
|
||||
workloads executed is
|
||||
- `numSubsets` - Not available in serial mode, determines how many subsets of
|
||||
workloads to execute in parallel mode
|
||||
- `subsetSize` - Not available in serial mode, determines how large each subset of
|
||||
workloads executed is
|
||||
|
||||
#### fsm_all.js
|
||||
|
||||
|
|
@ -443,16 +446,16 @@ use of the shell's built-in cluster test helpers like `ShardingTest` and
|
|||
`ReplSetTest`. clusterOptions are passed to cluster.js for initialization.
|
||||
clusterOptions include:
|
||||
|
||||
* `replication`: boolean, whether or not to use replication in the cluster
|
||||
* `sameCollection`: boolean, whether or not all workloads are passed the same
|
||||
collection
|
||||
* `sameDB`: boolean, whether or not all workloads are passed the same DB
|
||||
* `setupFunctions`: object, containing at most two functions under the keys
|
||||
'mongod' and 'mongos'. This allows you to run a function against all mongod or
|
||||
mongos nodes in the cluster as part of the cluster initialization. Each
|
||||
function takes a single argument, the db object against which configuration
|
||||
can be run (will be set for each mongod/mongos)
|
||||
* `sharded`: boolean, whether or not to use sharding in the cluster
|
||||
- `replication`: boolean, whether or not to use replication in the cluster
|
||||
- `sameCollection`: boolean, whether or not all workloads are passed the same
|
||||
collection
|
||||
- `sameDB`: boolean, whether or not all workloads are passed the same DB
|
||||
- `setupFunctions`: object, containing at most two functions under the keys
|
||||
'mongod' and 'mongos'. This allows you to run a function against all mongod or
|
||||
mongos nodes in the cluster as part of the cluster initialization. Each
|
||||
function takes a single argument, the db object against which configuration
|
||||
can be run (will be set for each mongod/mongos)
|
||||
- `sharded`: boolean, whether or not to use sharding in the cluster
|
||||
|
||||
Note that sameCollection and sameDB can increase contention for a resource, but
|
||||
will also decrease the strength of the assertions by ruling out the use of OwnDB
|
||||
|
|
@ -460,12 +463,12 @@ and OwnColl assertions.
|
|||
|
||||
### Miscellaneous Execution Notes
|
||||
|
||||
* A `CountDownLatch` (exposed through the v8-based mongo shell, as of MongoDB 3.0)
|
||||
is used as a synchronization primitive by the ThreadManager to wait until all
|
||||
spawned threads have finished being spawned before starting workload
|
||||
execution.
|
||||
* If more than 20% of the threads fail while spawning, we abort the test. If
|
||||
fewer than 20% of the threads fail while spawning we allow the non-failed
|
||||
threads to continue with the test. The 20% threshold is somewhat arbitrary;
|
||||
the goal is to abort if "mostly all" of the threads failed but to tolerate "a
|
||||
few" threads failing.
|
||||
- A `CountDownLatch` (exposed through the v8-based mongo shell, as of MongoDB 3.0)
|
||||
is used as a synchronization primitive by the ThreadManager to wait until all
|
||||
spawned threads have finished being spawned before starting workload
|
||||
execution.
|
||||
- If more than 20% of the threads fail while spawning, we abort the test. If
|
||||
fewer than 20% of the threads fail while spawning we allow the non-failed
|
||||
threads to continue with the test. The 20% threshold is somewhat arbitrary;
|
||||
the goal is to abort if "mostly all" of the threads failed but to tolerate "a
|
||||
few" threads failing.
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@
|
|||
The hang analyzer is a tool to collect cores and other information from processes
|
||||
that are suspected to have hung. Any task which exceeds its timeout in Evergreen
|
||||
will automatically be hang-analyzed, with information being written compressed
|
||||
and uploaded to S3.
|
||||
and uploaded to S3.
|
||||
|
||||
The hang analyzer can also be invoked locally at any time. For all non-Jepsen
|
||||
tasks, the invocation is `buildscripts/resmoke.py hang-analyzer -o file -o stdout -m exact -p python`. You may need to substitute `python` with the name of the python binary
|
||||
|
|
@ -13,6 +13,7 @@ you are using, which may be one of `python`, `python3`, or on Windows: `Python`,
|
|||
For jepsen tasks, the invocation is `buildscripts/resmoke.py hang-analyzer -o file -o stdout -p dbtest,java,mongo,mongod,mongos,python,_test`.
|
||||
|
||||
## Interesting Processes
|
||||
|
||||
The hang analyzer detects and runs against processes which are considered
|
||||
interesting.
|
||||
|
||||
|
|
@ -21,33 +22,37 @@ of `dbtest,java,mongo,mongod,mongos,python,_test`.
|
|||
|
||||
In all other scenarios, including local use of the hang-analyzer, an interesting
|
||||
process is any of:
|
||||
* process that starts with `python` or `live-record`
|
||||
* one which has been spawned as a child process of resmoke.
|
||||
|
||||
- process that starts with `python` or `live-record`
|
||||
- one which has been spawned as a child process of resmoke.
|
||||
|
||||
The resmoke subcommand `hang-analyzer` will send SIGUSR1/use SetEvent to signal
|
||||
resmoke to:
|
||||
* Print stack traces for all python threads
|
||||
* Collect core dumps and other information for any non-python child
|
||||
processes, see `Data Collection` below
|
||||
* Re-signal any python child processes to do the same
|
||||
|
||||
- Print stack traces for all python threads
|
||||
- Collect core dumps and other information for any non-python child
|
||||
processes, see `Data Collection` below
|
||||
- Re-signal any python child processes to do the same
|
||||
|
||||
## Data Collection
|
||||
Data collection occurs in the following sequence:
|
||||
* Pause all non-python processes
|
||||
* Grab debug symbols on non-Sanitizer builds
|
||||
* Signal python Processes
|
||||
* Dump cores of as many processes as possible, until the disk quota is exceeded.
|
||||
The default quota is 90% of total volume space.
|
||||
|
||||
* Collect additional, non-core data. Ideally:
|
||||
* Print C++ Stack traces
|
||||
* Print MozJS Stack Traces
|
||||
* Dump locks/mutexes info
|
||||
* Dump Server Sessions
|
||||
* Dump Recovery Units
|
||||
* Dump Storage engine info
|
||||
* Dump java processes (Jepsen tests) with jstack
|
||||
* SIGABRT (Unix)/terminate (Windows) go processes
|
||||
Data collection occurs in the following sequence:
|
||||
|
||||
- Pause all non-python processes
|
||||
- Grab debug symbols on non-Sanitizer builds
|
||||
- Signal python Processes
|
||||
- Dump cores of as many processes as possible, until the disk quota is exceeded.
|
||||
The default quota is 90% of total volume space.
|
||||
|
||||
- Collect additional, non-core data. Ideally:
|
||||
- Print C++ Stack traces
|
||||
- Print MozJS Stack Traces
|
||||
- Dump locks/mutexes info
|
||||
- Dump Server Sessions
|
||||
- Dump Recovery Units
|
||||
- Dump Storage engine info
|
||||
- Dump java processes (Jepsen tests) with jstack
|
||||
- SIGABRT (Unix)/terminate (Windows) go processes
|
||||
|
||||
Note that the list of non-core data collected is only accurate on Linux. Other
|
||||
platforms only perform a subset of these operations.
|
||||
|
|
@ -57,11 +62,12 @@ timeouts, and may not have enough time to collect all information before
|
|||
being terminated by the Evergreen agent. When running locally there is no
|
||||
timeout, and the hang analyzer may ironically hang indefinitely.
|
||||
|
||||
|
||||
### Implementations
|
||||
|
||||
Platform-specific concerns for data collection are handled by dumper objects in
|
||||
`buildscripts/resmokelib/hang_analyzer/dumper.py`.
|
||||
* Linux: See `GDBDumper`
|
||||
* MacOS: See `LLDBDumper`
|
||||
* Windows: See `WindowsDumper` and `JstackWindowsDumper`
|
||||
* Java (non-Windows): `JstackDumper`
|
||||
|
||||
- Linux: See `GDBDumper`
|
||||
- MacOS: See `LLDBDumper`
|
||||
- Windows: See `WindowsDumper` and `JstackWindowsDumper`
|
||||
- Java (non-Windows): `JstackDumper`
|
||||
|
|
|
|||
|
|
@ -1,8 +1,11 @@
|
|||
# Open telemetry (OTel) in resmoke
|
||||
|
||||
OTel is one of two systems we use to capture metrics from resmoke. For mongo-tooling-metrics please see the documentation [here](README.md).
|
||||
|
||||
## What Do We Capture
|
||||
|
||||
Using OTel we capture the following things
|
||||
|
||||
1. How long a resmoke suite takes to run (a collection of js tests)
|
||||
2. How long each test in a suite takes to run (a single js test)
|
||||
3. Duration of hooks before and after test/suite
|
||||
|
|
@ -13,17 +16,22 @@ To see this visually navigate to the [resmoke dataset](https://ui.honeycomb.io/m
|
|||
## A look at source code
|
||||
|
||||
### Configuration
|
||||
|
||||
The bulk of configuration is done in the
|
||||
`_set_up_tracing(...)` method in [configure_resmoke.py#L164](https://github.com/10gen/mongo/blob/976ce50f6134789e73c639848b35f10040f0ff4a/buildscripts/resmokelib/configure_resmoke.py#L164). This method includes documentation on how it works.
|
||||
|
||||
## BatchedBaggageSpanProcessor
|
||||
|
||||
See documentation [batched_baggage_span_processor.py#L8](https://github.com/mongodb/mongo/blob/976ce50f6134789e73c639848b35f10040f0ff4a/buildscripts/resmokelib/utils/batched_baggage_span_processor.py#L8)
|
||||
|
||||
## FileSpanExporter
|
||||
|
||||
See documentation [file_span_exporter.py#L16](https://github.com/10gen/mongo/blob/976ce50f6134789e73c639848b35f10040f0ff4a/buildscripts/resmokelib/utils/file_span_exporter.py#L16)
|
||||
|
||||
## Capturing Data
|
||||
|
||||
We mostly capture data by using a decorator on methods. Example taken from [job.py#L200](https://github.com/10gen/mongo/blob/6d36ac392086df85844870eef1d773f35020896c/buildscripts/resmokelib/testing/job.py#L200)
|
||||
|
||||
```
|
||||
TRACER = trace.get_tracer("resmoke")
|
||||
|
||||
|
|
@ -32,7 +40,9 @@ def func_name(...):
|
|||
span = trace.get_current_span()
|
||||
span.set_attribute("attr1", True)
|
||||
```
|
||||
|
||||
This system is nice because the decorator captures exceptions and other failures and a user can never forget to close a span. On occasion we will also start a span using the `with` clause in python. However, the decorator method is preferred since the method below makes more of a readability impact on the code. This example is taken from [job.py#L215](https://github.com/10gen/mongo/blob/6d36ac392086df85844870eef1d773f35020896c/buildscripts/resmokelib/testing/job.py#L215)
|
||||
|
||||
```
|
||||
with TRACER.start_as_current_span("func_name", attributes={}):
|
||||
func_name(...)
|
||||
|
|
@ -40,4 +50,5 @@ with TRACER.start_as_current_span("func_name", attributes={}):
|
|||
```
|
||||
|
||||
## Insights We Have Made (so far)
|
||||
|
||||
Using [this dashboard](https://ui.honeycomb.io/mongodb-4b/environments/production/board/3bATQLb38bh/Server-CI) and [this query](https://ui.honeycomb.io/mongodb-4b/environments/production/datasets/resmoke/result/GFa2YJ6d4vU/a/7EYuMJtH8KX/Slowest-Resmoke-Tests) we can see the most expensive single js tests. We plan to make tickets for teams to fix these long running tests for cloud savings as well as developer time savings.
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@ burden of starting and destroying a dedicated thead.
|
|||
## Classes
|
||||
|
||||
### `ThreadPoolInterface`
|
||||
|
||||
The [`ThreadPoolInterface`][thread_pool_interface.h] abstract interface is
|
||||
an extension of the `OutOfLineExecutor` (see [the executors architecture
|
||||
guide][executors]) abstract interface, adding `startup`, `shutdown`, and
|
||||
|
|
@ -58,4 +59,3 @@ resources it simulates a thread pool well enough to be used by a
|
|||
[network_interface_thread_pool.h]: ../src/mongo/executor/network_interface_thread_pool.h
|
||||
[network_interface.h]: ../src/mongo/executor/network_interface.h
|
||||
[thread_pool_mock.h]: ../src/mongo/executor/thread_pool_mock.h
|
||||
|
||||
|
|
|
|||
218
docs/vpat.md
218
docs/vpat.md
|
|
@ -4,133 +4,133 @@ Contact for more Information: https://www.mongodb.com/contact
|
|||
|
||||
## Summary Table
|
||||
|
||||
|
||||
|Criteria|Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|Section 1194.21 Software Applications and Operating Systems|Product has been coded to meet this standard subject to the remarks on the right.||
|
||||
|Section 1194.22 Web-based Internet Information and Applications|Product has been coded to meet this standard subject to the remarks on the right.||
|
||||
|Section 1194.23 Telecommunications Products|Not Applicable||
|
||||
|Section 1194.24 Video and Multi-media Products|Not Applicable||
|
||||
|Section 1194.25 Self-Contained, Closed Products|Not Applicable||
|
||||
|Section 1194.26 Desktop and Portable Computers|Not Applicable||
|
||||
|Section 1194.31 Functional Performance Criteria|Product has been coded to meet this standard subject to the remarks on the right.||
|
||||
|Section 1194.41 Information, Documentation and Support|Product has been coded to meet this standard subject to the remarks on the right.||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| --------------------------------------------------------------- | --------------------------------------------------------------------------------- | ------------------------ |
|
||||
| Section 1194.21 Software Applications and Operating Systems | Product has been coded to meet this standard subject to the remarks on the right. | |
|
||||
| Section 1194.22 Web-based Internet Information and Applications | Product has been coded to meet this standard subject to the remarks on the right. | |
|
||||
| Section 1194.23 Telecommunications Products | Not Applicable | |
|
||||
| Section 1194.24 Video and Multi-media Products | Not Applicable | |
|
||||
| Section 1194.25 Self-Contained, Closed Products | Not Applicable | |
|
||||
| Section 1194.26 Desktop and Portable Computers | Not Applicable | |
|
||||
| Section 1194.31 Functional Performance Criteria | Product has been coded to meet this standard subject to the remarks on the right. | |
|
||||
| Section 1194.41 Information, Documentation and Support | Product has been coded to meet this standard subject to the remarks on the right. | |
|
||||
|
||||
## Section 1194.21 Software Applications and Operating Systems – Detail
|
||||
|
||||
|Criteria |Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) When software is designed to run on a system that has a keyboard, product functions shall be executable from a keyboard where the function itself or the result of performing a function can be discerned textually.|Product has been coded to meet this standard subject to the remarks on the right.|All functions can be executed from the keyboard.|
|
||||
|(b) Applications shall not disrupt or disable activated features of other products that are identified as accessibility features, where those features are developed and documented according to industry standards. Applications also shall not disrupt or disable activated features of any operating system that are identified as accessibility features where the application programming interface for those accessibility features has been documented by the manufacturer of the operating system and is available to the product developer.|Product has been coded to meet this standard subject to the remarks on the right.|Does not interfere with Mouse Keys, Sticky Keys, Filter Keys or Toggle Keys.|
|
||||
|(c\) A well-defined on-screen indication of the current focus shall be provided that moves among interactive interface elements as the input focus changes. The focus shall be programmatically exposed so that Assistive Technology can track focus and focus changes.|Product has been coded to meet this standard subject to the remarks on the right.|Focus is programmatically exposed.|
|
||||
|(d) Sufficient information about a user interface element including the identity, operation and state of the element shall be available to Assistive Technology. When an image represents a program element, the information conveyed by the image must also be available in text.|Product has been coded to meet this standard subject to the remarks on the right.|Information about each UI element is programmatically exposed.|
|
||||
|(e) When bitmap images are used to identify controls, status indicators, or other programmatic elements, the meaning assigned to those images shall be consistent throughout an application's performance.|Product has been coded to meet this standard subject to the remarks on the right.|Does not use bitmap images.|
|
||||
|(f) Textual information shall be provided through operating system functions for displaying text. The minimum information that shall be made available is text content, text input caret location, and text attributes.|Product has been coded to meet this standard subject to the remarks on the right.|Information about each UI element is programmatically exposed.|
|
||||
|(g) Applications shall not override user selected contrast and color selections and other individual display attributes.|Product has been coded to meet this standard subject to the remarks on the right.|Windows or other OS-level color settings are not over-ruled by product.|
|
||||
|(h) When animation is displayed, the information shall be displayable in at least one non-animated presentation mode at the option of the user.|Product has been coded to meet this standard subject to the remarks on the right.|Does not use animation in UI.|
|
||||
|(i) Color coding shall not be used as the only means of conveying information, indicating an action, prompting a response, or distinguishing a visual element.|Product has been coded to meet this standard subject to the remarks on the right.|Color coding is not used.|
|
||||
|(j) When a product permits a user to adjust color and contrast settings, a variety of color selections capable of producing a range of contrast levels shall be provided.|Product has been coded to meet this standard subject to the remarks on the right.|Does not permit use to adjust color and contrast settings. |
|
||||
|(k) Software shall not use flashing or blinking text, objects, or other elements having a flash or blink frequency greater than 2 Hz and lower than 55 Hz.|Product has been coded to meet this standard subject to the remarks on the right. |There is no instance of blinking or flashing objects that are within the danger range of 2hz to 55hz.|
|
||||
|(l) When electronic forms are used, the form shall allow people using Assistive Technology to access the information, field elements, and functionality required for completion and submission of the form, including all directions and cues.|Product has been coded to meet this standard subject to the remarks on the right.|All functions can be executed from the keyboard.|
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| (a) When software is designed to run on a system that has a keyboard, product functions shall be executable from a keyboard where the function itself or the result of performing a function can be discerned textually. | Product has been coded to meet this standard subject to the remarks on the right. | All functions can be executed from the keyboard. |
|
||||
| (b) Applications shall not disrupt or disable activated features of other products that are identified as accessibility features, where those features are developed and documented according to industry standards. Applications also shall not disrupt or disable activated features of any operating system that are identified as accessibility features where the application programming interface for those accessibility features has been documented by the manufacturer of the operating system and is available to the product developer. | Product has been coded to meet this standard subject to the remarks on the right. | Does not interfere with Mouse Keys, Sticky Keys, Filter Keys or Toggle Keys. |
|
||||
| (c\) A well-defined on-screen indication of the current focus shall be provided that moves among interactive interface elements as the input focus changes. The focus shall be programmatically exposed so that Assistive Technology can track focus and focus changes. | Product has been coded to meet this standard subject to the remarks on the right. | Focus is programmatically exposed. |
|
||||
| (d) Sufficient information about a user interface element including the identity, operation and state of the element shall be available to Assistive Technology. When an image represents a program element, the information conveyed by the image must also be available in text. | Product has been coded to meet this standard subject to the remarks on the right. | Information about each UI element is programmatically exposed. |
|
||||
| (e) When bitmap images are used to identify controls, status indicators, or other programmatic elements, the meaning assigned to those images shall be consistent throughout an application's performance. | Product has been coded to meet this standard subject to the remarks on the right. | Does not use bitmap images. |
|
||||
| (f) Textual information shall be provided through operating system functions for displaying text. The minimum information that shall be made available is text content, text input caret location, and text attributes. | Product has been coded to meet this standard subject to the remarks on the right. | Information about each UI element is programmatically exposed. |
|
||||
| (g) Applications shall not override user selected contrast and color selections and other individual display attributes. | Product has been coded to meet this standard subject to the remarks on the right. | Windows or other OS-level color settings are not over-ruled by product. |
|
||||
| (h) When animation is displayed, the information shall be displayable in at least one non-animated presentation mode at the option of the user. | Product has been coded to meet this standard subject to the remarks on the right. | Does not use animation in UI. |
|
||||
| (i) Color coding shall not be used as the only means of conveying information, indicating an action, prompting a response, or distinguishing a visual element. | Product has been coded to meet this standard subject to the remarks on the right. | Color coding is not used. |
|
||||
| (j) When a product permits a user to adjust color and contrast settings, a variety of color selections capable of producing a range of contrast levels shall be provided. | Product has been coded to meet this standard subject to the remarks on the right. | Does not permit use to adjust color and contrast settings. |
|
||||
| (k) Software shall not use flashing or blinking text, objects, or other elements having a flash or blink frequency greater than 2 Hz and lower than 55 Hz. | Product has been coded to meet this standard subject to the remarks on the right. | There is no instance of blinking or flashing objects that are within the danger range of 2hz to 55hz. |
|
||||
| (l) When electronic forms are used, the form shall allow people using Assistive Technology to access the information, field elements, and functionality required for completion and submission of the form, including all directions and cues. | Product has been coded to meet this standard subject to the remarks on the right. | All functions can be executed from the keyboard. |
|
||||
|
||||
## Section 1194.22 Web-based Internet information and applications – Detail
|
||||
|
||||
|Criteria |Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) A text equivalent for every non-text element shall be provided (e.g., via "alt", "longdesc", or in element content).|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/.|
|
||||
|(b) Equivalent alternatives for any multimedia presentation shall be synchronized with the presentation.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/.|
|
||||
|(c\) Web pages shall be designed so that all information conveyed with color is also available without color, for example from context or markup.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(d) Documents shall be organized so they are readable without requiring an associated style sheet.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(e) Redundant text links shall be provided for each active region of a server-side image map.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(f) Client-side image maps shall be provided instead of server-side image maps except where the regions cannot be defined with an available geometric shape.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(g) Row and column headers shall be identified for data tables.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(h) Markup shall be used to associate data cells and header cells for data tables that have two or more logical levels of row or column headers.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(i) Frames shall be titled with text that facilitates frame identification and navigation|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(j) Pages shall be designed to avoid causing the screen to flicker with a frequency greater than 2 Hz and lower than 55 Hz.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(k) A text-only page, with equivalent information or functionality, shall be provided to make a web site comply with the provisions of this part, when compliance cannot be accomplished in any other way. The content of the text-only page shall be updated whenever the primary page changes.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/.|
|
||||
|(l) When pages utilize scripting languages to display content, or to create interface elements, the information provided by the script shall be identified with functional text that can be read by Assistive Technology.|Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(m) When a web page requires that an applet, plug-in or other application be present on the client system to interpret page content, the page must provide a link to a plug-in or applet that complies with §1194.21(a) through (l). |Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(n) When electronic forms are designed to be completed on-line, the form shall allow people using Assistive Technology to access the information, field elements, and functionality required for completion and submission of the form, including all directions and cues. |Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(o) A method shall be provided that permits users to skip repetitive navigation links. |Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|(p\) When a timed response is required, the user shall be alerted and given sufficient time to indicate more time is required. |Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. |Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
|
||||
| (a) A text equivalent for every non-text element shall be provided (e.g., via "alt", "longdesc", or in element content). | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (b) Equivalent alternatives for any multimedia presentation shall be synchronized with the presentation. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (c\) Web pages shall be designed so that all information conveyed with color is also available without color, for example from context or markup. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (d) Documents shall be organized so they are readable without requiring an associated style sheet. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (e) Redundant text links shall be provided for each active region of a server-side image map. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (f) Client-side image maps shall be provided instead of server-side image maps except where the regions cannot be defined with an available geometric shape. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (g) Row and column headers shall be identified for data tables. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (h) Markup shall be used to associate data cells and header cells for data tables that have two or more logical levels of row or column headers. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (i) Frames shall be titled with text that facilitates frame identification and navigation | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (j) Pages shall be designed to avoid causing the screen to flicker with a frequency greater than 2 Hz and lower than 55 Hz. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (k) A text-only page, with equivalent information or functionality, shall be provided to make a web site comply with the provisions of this part, when compliance cannot be accomplished in any other way. The content of the text-only page shall be updated whenever the primary page changes. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (l) When pages utilize scripting languages to display content, or to create interface elements, the information provided by the script shall be identified with functional text that can be read by Assistive Technology. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (m) When a web page requires that an applet, plug-in or other application be present on the client system to interpret page content, the page must provide a link to a plug-in or applet that complies with §1194.21(a) through (l). | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (n) When electronic forms are designed to be completed on-line, the form shall allow people using Assistive Technology to access the information, field elements, and functionality required for completion and submission of the form, including all directions and cues. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (o) A method shall be provided that permits users to skip repetitive navigation links. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
| (p\) When a timed response is required, the user shall be alerted and given sufficient time to indicate more time is required. | Core MongoDB documentation has been coded to meet this standard subject to the remarks on the right. | Our documentation complies to this criteria and can be found at https://docs.mongodb.com/manual/. |
|
||||
|
||||
### Note to 1194.22
|
||||
The Board interprets paragraphs (a) through (k) of this section as consistent with the following
|
||||
priority 1 Checkpoints of the Web Content Accessibility Guidelines 1.0 (WCAG 1.0) (May 5 1999) published by the Web
|
||||
Accessibility Initiative of the World Wide Web Consortium: Paragraph (a) - 1.1, (b) - 1.4, (c\) - 2.1, (d) - 6.1,
|
||||
|
||||
The Board interprets paragraphs (a) through (k) of this section as consistent with the following
|
||||
priority 1 Checkpoints of the Web Content Accessibility Guidelines 1.0 (WCAG 1.0) (May 5 1999) published by the Web
|
||||
Accessibility Initiative of the World Wide Web Consortium: Paragraph (a) - 1.1, (b) - 1.4, (c\) - 2.1, (d) - 6.1,
|
||||
(e) - 1.2, (f) - 9.1, (g) - 5.1, (h) - 5.2, (i) - 12.1, (j) - 7.1, (k) - 11.4.
|
||||
|
||||
## Section 1194.23 Telecommunications Products – Detail
|
||||
|
||||
|Criteria |Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) Telecommunications products or systems which provide a function allowing voice communication and which do not themselves provide a TTY functionality shall provide a standard non-acoustic connection point for TTYs. Microphones shall be capable of being turned on and off to allow the user to intermix speech with TTY use.|Not Applicable||
|
||||
|(b) Telecommunications products which include voice communication functionality shall support all commonly used cross-manufacturer non-proprietary standard TTY signal protocols.|Not Applicable||
|
||||
|(c\) Voice mail, auto-attendant, and interactive voice response telecommunications systems shall be usable by TTY users with their TTYs.|Not Applicable||
|
||||
|(d) Voice mail, messaging, auto-attendant, and interactive voice response telecommunications systems that require a response from a user within a time interval, shall give an alert when the time interval is about to run out, and shall provide sufficient time for the user to indicate more time is required.|Not Applicable||
|
||||
|(e) Where provided, caller identification and similar telecommunications functions shall also be available for users of TTYs, and for users who cannot see displays.|Not Applicable||
|
||||
|(f) For transmitted voice signals, telecommunications products shall provide a gain adjustable up to a minimum of 20 dB. For incremental volume control, at least one intermediate step of 12 dB of gain shall be provided.|Not Applicable||
|
||||
|(g) If the telecommunications product allows a user to adjust the receive volume, a function shall be provided to automatically reset the volume to the default level after every use.|Not Applicable||
|
||||
|(h) Where a telecommunications product delivers output by an audio transducer which is normally held up to the ear, a means for effective magnetic wireless coupling to hearing technologies shall be provided.|Not Applicable||
|
||||
|(i) Interference to hearing technologies (including hearing aids, cochlear implants, and assistive listening devices) shall be reduced to the lowest possible level that allows a user of hearing technologies to utilize the telecommunications product.|Not Applicable||
|
||||
|(j) Products that transmit or conduct information or communication, shall pass through cross-manufacturer, non-proprietary, industry-standard codes, translation protocols, formats or other information necessary to provide the information or communication in a usable format. Technologies which use encoding, signal compression, format transformation, or similar techniques shall not remove information needed for access or shall restore it upon delivery.|Not Applicable||
|
||||
|(k)(1) Products which have mechanically operated controls or keys shall comply with the following: Controls and Keys shall be tactilely discernible without activating the controls or keys.|Not Applicable||
|
||||
|(k)(2) Products which have mechanically operated controls or keys shall comply with the following: Controls and Keys shall be operable with one hand and shall not require tight grasping, pinching, twisting of the wrist. The force required to activate controls and keys shall be 5 lbs. (22.2N) maximum.|Not Applicable||
|
||||
|(k)(3) Products which have mechanically operated controls or keys shall comply with the following: If key repeat is supported, the delay before repeat shall be adjustable to at least 2 seconds. Key repeat rate shall be adjustable to 2 seconds per character.|Not Applicable||
|
||||
|(k)(4) Products which have mechanically operated controls or keys shall comply with the following: The status of all locking or toggle controls or keys shall be visually discernible, and discernible either through touch or sound.|Not Applicable||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | ------------------------ |
|
||||
| (a) Telecommunications products or systems which provide a function allowing voice communication and which do not themselves provide a TTY functionality shall provide a standard non-acoustic connection point for TTYs. Microphones shall be capable of being turned on and off to allow the user to intermix speech with TTY use. | Not Applicable | |
|
||||
| (b) Telecommunications products which include voice communication functionality shall support all commonly used cross-manufacturer non-proprietary standard TTY signal protocols. | Not Applicable | |
|
||||
| (c\) Voice mail, auto-attendant, and interactive voice response telecommunications systems shall be usable by TTY users with their TTYs. | Not Applicable | |
|
||||
| (d) Voice mail, messaging, auto-attendant, and interactive voice response telecommunications systems that require a response from a user within a time interval, shall give an alert when the time interval is about to run out, and shall provide sufficient time for the user to indicate more time is required. | Not Applicable | |
|
||||
| (e) Where provided, caller identification and similar telecommunications functions shall also be available for users of TTYs, and for users who cannot see displays. | Not Applicable | |
|
||||
| (f) For transmitted voice signals, telecommunications products shall provide a gain adjustable up to a minimum of 20 dB. For incremental volume control, at least one intermediate step of 12 dB of gain shall be provided. | Not Applicable | |
|
||||
| (g) If the telecommunications product allows a user to adjust the receive volume, a function shall be provided to automatically reset the volume to the default level after every use. | Not Applicable | |
|
||||
| (h) Where a telecommunications product delivers output by an audio transducer which is normally held up to the ear, a means for effective magnetic wireless coupling to hearing technologies shall be provided. | Not Applicable | |
|
||||
| (i) Interference to hearing technologies (including hearing aids, cochlear implants, and assistive listening devices) shall be reduced to the lowest possible level that allows a user of hearing technologies to utilize the telecommunications product. | Not Applicable | |
|
||||
| (j) Products that transmit or conduct information or communication, shall pass through cross-manufacturer, non-proprietary, industry-standard codes, translation protocols, formats or other information necessary to provide the information or communication in a usable format. Technologies which use encoding, signal compression, format transformation, or similar techniques shall not remove information needed for access or shall restore it upon delivery. | Not Applicable | |
|
||||
| (k)(1) Products which have mechanically operated controls or keys shall comply with the following: Controls and Keys shall be tactilely discernible without activating the controls or keys. | Not Applicable | |
|
||||
| (k)(2) Products which have mechanically operated controls or keys shall comply with the following: Controls and Keys shall be operable with one hand and shall not require tight grasping, pinching, twisting of the wrist. The force required to activate controls and keys shall be 5 lbs. (22.2N) maximum. | Not Applicable | |
|
||||
| (k)(3) Products which have mechanically operated controls or keys shall comply with the following: If key repeat is supported, the delay before repeat shall be adjustable to at least 2 seconds. Key repeat rate shall be adjustable to 2 seconds per character. | Not Applicable | |
|
||||
| (k)(4) Products which have mechanically operated controls or keys shall comply with the following: The status of all locking or toggle controls or keys shall be visually discernible, and discernible either through touch or sound. | Not Applicable | |
|
||||
|
||||
## Section 1194.24 Video and Multi-media Products – Detail
|
||||
|
||||
|Criteria|Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) All analog television displays 13 inches and larger, and computer equipment that includes analog television receiver or display circuitry, shall be equipped with caption decoder circuitry which appropriately receives, decodes, and displays closed captions from broadcast, cable, videotape, and DVD signals. As soon as practicable, but not later than July 1, 2002, widescreen digital television (DTV) displays measuring at least 7.8 inches vertically, DTV sets with conventional displays measuring at least 13 inches vertically, and stand-alone DTV tuners, whether or not they are marketed with display screens, and computer equipment that includes DTV receiver or display circuitry, shall be equipped with caption decoder circuitry which appropriately receives, decodes, and displays closed captions from broadcast, cable, videotape, and DVD signals.|Not Applicable||
|
||||
|(b) Television tuners, including tuner cards for use in computers, shall be equipped with secondary audio program playback circuitry.|Not Applicable||
|
||||
|(c\) All training and informational video and multimedia productions which support the agency's mission, regardless of format, that contain speech or other audio information necessary for the comprehension of the content, shall be open or closed captioned.|Not Applicable||
|
||||
|(d) All training and informational video and multimedia productions which support the agency's mission, regardless of format, that contain visual information necessary for the comprehension of the content, shall be audio described.|Not Applicable||
|
||||
|(e) Display or presentation of alternate text presentation or audio descriptions shall be user-selectable unless permanent.|Not Applicable||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | ------------------------ |
|
||||
| (a) All analog television displays 13 inches and larger, and computer equipment that includes analog television receiver or display circuitry, shall be equipped with caption decoder circuitry which appropriately receives, decodes, and displays closed captions from broadcast, cable, videotape, and DVD signals. As soon as practicable, but not later than July 1, 2002, widescreen digital television (DTV) displays measuring at least 7.8 inches vertically, DTV sets with conventional displays measuring at least 13 inches vertically, and stand-alone DTV tuners, whether or not they are marketed with display screens, and computer equipment that includes DTV receiver or display circuitry, shall be equipped with caption decoder circuitry which appropriately receives, decodes, and displays closed captions from broadcast, cable, videotape, and DVD signals. | Not Applicable | |
|
||||
| (b) Television tuners, including tuner cards for use in computers, shall be equipped with secondary audio program playback circuitry. | Not Applicable | |
|
||||
| (c\) All training and informational video and multimedia productions which support the agency's mission, regardless of format, that contain speech or other audio information necessary for the comprehension of the content, shall be open or closed captioned. | Not Applicable | |
|
||||
| (d) All training and informational video and multimedia productions which support the agency's mission, regardless of format, that contain visual information necessary for the comprehension of the content, shall be audio described. | Not Applicable | |
|
||||
| (e) Display or presentation of alternate text presentation or audio descriptions shall be user-selectable unless permanent. | Not Applicable | |
|
||||
|
||||
## Section 1194.25 Self-Contained, Closed Products – Detail
|
||||
|
||||
|Criteria |Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) Self contained products shall be usable by people with disabilities without requiring an end-user to attach Assistive Technology to the product. Personal headsets for private listening are not Assistive Technology.|Not Applicable||
|
||||
|(b) When a timed response is required, the user shall be alerted and given sufficient time to indicate more time is required.|Not Applicable||
|
||||
|(c\) Where a product utilizes touchscreens or contact-sensitive controls, an input method shall be provided that complies with §1194.23 (k) (1) through (4).|Not Applicable||
|
||||
|(d) When biometric forms of user identification or control are used, an alternative form of identification or activation, which does not require the user to possess particular biological characteristics, shall also be provided.|Not Applicable||
|
||||
|(e) When products provide auditory output, the audio signal shall be provided at a standard signal level through an industry standard connector that will allow for private listening. The product must provide the ability to interrupt, pause, and restart the audio at anytime.|Not Applicable||
|
||||
|(f) When products deliver voice output in a public area, incremental volume control shall be provided with output amplification up to a level of at least 65 dB. Where the ambient noise level of the environment is above 45 dB, a volume gain of at least 20 dB above the ambient level shall be user selectable. A function shall be provided to automatically reset the volume to the default level after every use.|Not Applicable||
|
||||
|(g) Color coding shall not be used as the only means of conveying information, indicating an action, prompting a response, or distinguishing a visual element.|Not Applicable||
|
||||
|(h) When a product permits a user to adjust color and contrast settings, a range of color selections capable of producing a variety of contrast levels shall be provided.|Not Applicable||
|
||||
|(i) Products shall be designed to avoid causing the screen to flicker with a frequency greater than 2 Hz and lower than 55 Hz.|Not Applicable||
|
||||
|(j)(1) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: The position of any operable control shall be determined with respect to a vertical plane, which is 48 inches in length, centered on the operable control, and at the maximum protrusion of the product within the 48 inch length on products which are freestanding, non-portable, and intended to be used in one location and which have operable controls.|Not Applicable||
|
||||
|(j)(2) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Where any operable control is 10 inches or less behind the reference plane, the height shall be 54 inches maximum and 15 inches minimum above the floor.|Not Applicable||
|
||||
|(j)(3) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Where any operable control is more than 10 inches and not more than 24 inches behind the reference plane, the height shall be 46 inches maximum and 15 inches minimum above the floor.|Not Applicable||
|
||||
|(j)(4) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Operable controls shall not be more than 24 inches behind the reference plane.|Not Applicable||
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | ------------------------ |
|
||||
| (a) Self contained products shall be usable by people with disabilities without requiring an end-user to attach Assistive Technology to the product. Personal headsets for private listening are not Assistive Technology. | Not Applicable | |
|
||||
| (b) When a timed response is required, the user shall be alerted and given sufficient time to indicate more time is required. | Not Applicable | |
|
||||
| (c\) Where a product utilizes touchscreens or contact-sensitive controls, an input method shall be provided that complies with §1194.23 (k) (1) through (4). | Not Applicable | |
|
||||
| (d) When biometric forms of user identification or control are used, an alternative form of identification or activation, which does not require the user to possess particular biological characteristics, shall also be provided. | Not Applicable | |
|
||||
| (e) When products provide auditory output, the audio signal shall be provided at a standard signal level through an industry standard connector that will allow for private listening. The product must provide the ability to interrupt, pause, and restart the audio at anytime. | Not Applicable | |
|
||||
| (f) When products deliver voice output in a public area, incremental volume control shall be provided with output amplification up to a level of at least 65 dB. Where the ambient noise level of the environment is above 45 dB, a volume gain of at least 20 dB above the ambient level shall be user selectable. A function shall be provided to automatically reset the volume to the default level after every use. | Not Applicable | |
|
||||
| (g) Color coding shall not be used as the only means of conveying information, indicating an action, prompting a response, or distinguishing a visual element. | Not Applicable | |
|
||||
| (h) When a product permits a user to adjust color and contrast settings, a range of color selections capable of producing a variety of contrast levels shall be provided. | Not Applicable | |
|
||||
| (i) Products shall be designed to avoid causing the screen to flicker with a frequency greater than 2 Hz and lower than 55 Hz. | Not Applicable | |
|
||||
| (j)(1) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: The position of any operable control shall be determined with respect to a vertical plane, which is 48 inches in length, centered on the operable control, and at the maximum protrusion of the product within the 48 inch length on products which are freestanding, non-portable, and intended to be used in one location and which have operable controls. | Not Applicable | |
|
||||
| (j)(2) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Where any operable control is 10 inches or less behind the reference plane, the height shall be 54 inches maximum and 15 inches minimum above the floor. | Not Applicable | |
|
||||
| (j)(3) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Where any operable control is more than 10 inches and not more than 24 inches behind the reference plane, the height shall be 46 inches maximum and 15 inches minimum above the floor. | Not Applicable | |
|
||||
| (j)(4) Products which are freestanding, non-portable, and intended to be used in one location and which have operable controls shall comply with the following: Operable controls shall not be more than 24 inches behind the reference plane. | Not Applicable | |
|
||||
|
||||
## Section 1194.26 Desktop and Portable Computers – Detail
|
||||
|
||||
|Criteria|Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) All mechanically operated controls and keys shall comply with §1194.23 (k) (1) through (4).|Not Applicable||
|
||||
|(b) If a product utilizes touchscreens or touch-operated controls, an input method shall be provided that complies with §1194.23 (k) (1) through (4).|Not Applicable||
|
||||
|(c\) When biometric forms of user identification or control are used, an alternative form of identification or activation, which does not require the user to possess particular biological characteristics, shall also be provided.|Not Applicable||
|
||||
|(d) Where provided, at least one of each type of expansion slots, ports and connectors shall comply with publicly available industry standards|Not Applicable||
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------- | ------------------------ |
|
||||
| (a) All mechanically operated controls and keys shall comply with §1194.23 (k) (1) through (4). | Not Applicable | |
|
||||
| (b) If a product utilizes touchscreens or touch-operated controls, an input method shall be provided that complies with §1194.23 (k) (1) through (4). | Not Applicable | |
|
||||
| (c\) When biometric forms of user identification or control are used, an alternative form of identification or activation, which does not require the user to possess particular biological characteristics, shall also be provided. | Not Applicable | |
|
||||
| (d) Where provided, at least one of each type of expansion slots, ports and connectors shall comply with publicly available industry standards | Not Applicable | |
|
||||
|
||||
## Section 1194.31 Functional Performance Criteria – Detail
|
||||
|
||||
|Criteria|Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) At least one mode of operation and information retrieval that does not require user vision shall be provided, or support for Assistive Technology used by people who are blind or visually impaired shall be provided.|Applicable|All user operation of the product are compatible with Assistive Technology.|
|
||||
|(b) At least one mode of operation and information retrieval that does not require visual acuity greater than 20/70 shall be provided in audio and enlarged print output working together or independently, or support for Assistive Technology used by people who are visually impaired shall be provided.|Applicable|All user operation of the product are compatible with Assistive Technology.|
|
||||
|(c\) At least one mode of operation and information retrieval that does not require user hearing shall be provided, or support for Assistive Technology used by people who are deaf or hard of hearing shall be provided|Not Applicable|There is no reliance on user hearing.|
|
||||
|(d) Where audio information is important for the use of a product, at least one mode of operation and information retrieval shall be provided in an enhanced auditory fashion, or support for assistive hearing devices shall be provided.|Not Applicable|There is no reliance on sound.|
|
||||
|(e) At least one mode of operation and information retrieval that does not require user speech shall be provided, or support for Assistive Technology used by people with disabilities shall be provided.|Not Applicable|There is no reliance on user speech.|
|
||||
|(f) At least one mode of operation and information retrieval that does not require fine motor control or simultaneous actions and that is operable with limited reach and strength shall be provided.|Not Applicable|There is no reliance on fine motor control.|
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | --------------------------------------------------------------------------- |
|
||||
| (a) At least one mode of operation and information retrieval that does not require user vision shall be provided, or support for Assistive Technology used by people who are blind or visually impaired shall be provided. | Applicable | All user operation of the product are compatible with Assistive Technology. |
|
||||
| (b) At least one mode of operation and information retrieval that does not require visual acuity greater than 20/70 shall be provided in audio and enlarged print output working together or independently, or support for Assistive Technology used by people who are visually impaired shall be provided. | Applicable | All user operation of the product are compatible with Assistive Technology. |
|
||||
| (c\) At least one mode of operation and information retrieval that does not require user hearing shall be provided, or support for Assistive Technology used by people who are deaf or hard of hearing shall be provided | Not Applicable | There is no reliance on user hearing. |
|
||||
| (d) Where audio information is important for the use of a product, at least one mode of operation and information retrieval shall be provided in an enhanced auditory fashion, or support for assistive hearing devices shall be provided. | Not Applicable | There is no reliance on sound. |
|
||||
| (e) At least one mode of operation and information retrieval that does not require user speech shall be provided, or support for Assistive Technology used by people with disabilities shall be provided. | Not Applicable | There is no reliance on user speech. |
|
||||
| (f) At least one mode of operation and information retrieval that does not require fine motor control or simultaneous actions and that is operable with limited reach and strength shall be provided. | Not Applicable | There is no reliance on fine motor control. |
|
||||
|
||||
## Section 1194.41 Information, Documentation and Support – Detail
|
||||
|
||||
|Criteria|Supporting Features|Remarks and explanations|
|
||||
|---|---|---|
|
||||
|(a) Product support documentation provided to end-users shall be made available in alternate formats upon request, at no additional charge| Support documentation for this product is available in accessible electronic format or print format.| MongoDB documentation is available online: http://docs.mongodb.org/manual/. Alternately, users may, from the same link, access the documentation as single page HTML, epub or PDF format. There is no additional charge for these alternate formats.|
|
||||
|(b) End-users shall have access to a description of the accessibility and compatibility features of products in alternate formats or alternate methods upon request, at no additional charge.|Description is available in accessible electronic format online. |Information regarding accessibility and compatibility features are available online: https://www.mongodb.com/accessibility/vpat Links to alternative formats may also be contained online, as applicable. There is no additional charge for these alternate formats.|
|
||||
|(c\) Support services for products shall accommodate the communication needs of end-users with disabilities.|MongoDB, Inc.'s support services via web support https://support.mongodb.com/ . | MongoDB, Inc. customers primarily use SalesForce Service Cloud for communications with support. SalesForce Service Cloud delivers content via a web interface that is accessible to existing screen readers. SalesForce Service Cloud has a VPAT located at http://www.sfdcstatic.com/assets/pdf/misc/VPAT_servicecloud_summer2013.pdf .|
|
||||
|
||||
| Criteria | Supporting Features | Remarks and explanations |
|
||||
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| (a) Product support documentation provided to end-users shall be made available in alternate formats upon request, at no additional charge | Support documentation for this product is available in accessible electronic format or print format. | MongoDB documentation is available online: http://docs.mongodb.org/manual/. Alternately, users may, from the same link, access the documentation as single page HTML, epub or PDF format. There is no additional charge for these alternate formats. |
|
||||
| (b) End-users shall have access to a description of the accessibility and compatibility features of products in alternate formats or alternate methods upon request, at no additional charge. | Description is available in accessible electronic format online. | Information regarding accessibility and compatibility features are available online: https://www.mongodb.com/accessibility/vpat Links to alternative formats may also be contained online, as applicable. There is no additional charge for these alternate formats. |
|
||||
| (c\) Support services for products shall accommodate the communication needs of end-users with disabilities. | MongoDB, Inc.'s support services via web support https://support.mongodb.com/ . | MongoDB, Inc. customers primarily use SalesForce Service Cloud for communications with support. SalesForce Service Cloud delivers content via a web interface that is accessible to existing screen readers. SalesForce Service Cloud has a VPAT located at http://www.sfdcstatic.com/assets/pdf/misc/VPAT_servicecloud_summer2013.pdf . |
|
||||
|
|
|
|||
|
|
@ -1,54 +1,88 @@
|
|||
# Javascript Test Guide
|
||||
|
||||
At MongoDB we write integration tests in JavaScript. These are tests written to exercise some behavior of a running MongoDB server, replica set, or sharded cluster. This guide aims to provide some general guidelines and best practices on how to write good tests.
|
||||
|
||||
## Principles
|
||||
|
||||
### Minimize the test case as much as possible while still exercising and testing the desired behavior.
|
||||
- For example, if you are testing that document deletion works correctly, it may be entirely sufficient to insert just a single document and then delete that document. Inserting multiple documents would be unnecessary. A guiding principle on this is to ask yourself how easy it would be for a new person coming to this test to quickly understand it. If there are multiple documents being inserted into a collection, in a test that only tests document deletion, a newcomer might ask the question: “is it important that the test uses multiple documents, or incidental?”. It is best if you can remove these kinds of questions from a person’s mind, by keeping only the absolute essential parts of a test.
|
||||
- We should always strive for unittesting when possible, so if the functionality you want to test can be covered by a unit test, we should write a unit test instead.
|
||||
### Add a block comment at the top of the JavaScript test file giving a clear and concise overview of what a test is trying to verify.
|
||||
- For tests that are more complicated, a brief description of the test steps might be useful as well.
|
||||
### Keep debuggability in mind.
|
||||
- Assertion error messages should contain all information relevant to debugging the test. This means the server’s response from the failed command should almost always be included in the assertion error message. It can also be helpful to include parameters that vary during the test to avoid requiring the investigator to use the logs/backtrace to determine what the test was attempting to do.
|
||||
- Think about how easy it would be to debug your test if something failed and a newcomer only had the logs of the test to look at. This can help guide your decision on what log messages to include and to what level of detail. The jsTestLog function is useful for this, as it is good at visually demarcating different phases of a test. As a tip, run your test a few times and just study the log messages, imagining you are an engineer debugging the test with only these logs to look at. Think about how understandable the logs would be to a newcomer. It is easy to add log messages to a test but then forget to see how they would actually appear.
|
||||
- Never insert identical documents unless necessary. It is very useful in debugging to be able to figure out where a given piece of data came from.
|
||||
- If a test does the same thing multiple times, consider factoring it out into a library. Shorter running tests are easier to debug and code duplication is always bad.
|
||||
### Do not hardcode collection or database names, especially if they are used multiple times throughout a test.
|
||||
|
||||
- For example, if you are testing that document deletion works correctly, it may be entirely sufficient to insert just a single document and then delete that document. Inserting multiple documents would be unnecessary. A guiding principle on this is to ask yourself how easy it would be for a new person coming to this test to quickly understand it. If there are multiple documents being inserted into a collection, in a test that only tests document deletion, a newcomer might ask the question: “is it important that the test uses multiple documents, or incidental?”. It is best if you can remove these kinds of questions from a person’s mind, by keeping only the absolute essential parts of a test.
|
||||
- We should always strive for unittesting when possible, so if the functionality you want to test can be covered by a unit test, we should write a unit test instead.
|
||||
|
||||
### Add a block comment at the top of the JavaScript test file giving a clear and concise overview of what a test is trying to verify.
|
||||
|
||||
- For tests that are more complicated, a brief description of the test steps might be useful as well.
|
||||
|
||||
### Keep debuggability in mind.
|
||||
|
||||
- Assertion error messages should contain all information relevant to debugging the test. This means the server’s response from the failed command should almost always be included in the assertion error message. It can also be helpful to include parameters that vary during the test to avoid requiring the investigator to use the logs/backtrace to determine what the test was attempting to do.
|
||||
- Think about how easy it would be to debug your test if something failed and a newcomer only had the logs of the test to look at. This can help guide your decision on what log messages to include and to what level of detail. The jsTestLog function is useful for this, as it is good at visually demarcating different phases of a test. As a tip, run your test a few times and just study the log messages, imagining you are an engineer debugging the test with only these logs to look at. Think about how understandable the logs would be to a newcomer. It is easy to add log messages to a test but then forget to see how they would actually appear.
|
||||
- Never insert identical documents unless necessary. It is very useful in debugging to be able to figure out where a given piece of data came from.
|
||||
- If a test does the same thing multiple times, consider factoring it out into a library. Shorter running tests are easier to debug and code duplication is always bad.
|
||||
|
||||
### Do not hardcode collection or database names, especially if they are used multiple times throughout a test.
|
||||
|
||||
It is best to use variable names that attempt to describe what a value is used for. For example, naming a variable that stores a collection name collectionToDrop is much better than just naming the variable collName.
|
||||
|
||||
### Make every effort to make your test as deterministic as possible.
|
||||
- Non-deterministic tests add noise to our build system and, in general, make it harder for yourself and other engineers to determine if the system really is working correctly or not. Flaky integration tests should be considered bugs, and we should not allow them to be committed to the server codebase. One way to make jstests more deterministic is to use failpoints to force the events happening in expected order. However, if we have to use failpoints to make this test deterministic, we should consider write a unit test instead.
|
||||
- Note that our fuzzer and concurrency test suites are often an exception to this rule. In those cases we sometimes give up some level of determinism in order to trigger a wider class of rare edge cases. For targeted JavaScript integration tests, however, highly deterministic tests should be the goal.
|
||||
### Think hard about all the assumptions that the test relies on.
|
||||
- For example, if a certain phase of the test ran much slower or much faster, would it cause your test to fail for the wrong reason?
|
||||
- If your test includes hard-coded timeouts, make sure they are set appropriately. If a test is waiting for a certain condition to be true, and the test should not proceed until that condition is met, it is often correct to just wait “indefinitely”, instead of adding some arbitrary timeout value, like 30 seconds. In practice this usually means setting some reasonable upper limit, for example, 10 minutes.
|
||||
- Also, for replication tests, make sure data exists on the right nodes at the right time. For example, if you a do a write and don’t explicitly wait for it to replicate, it might not reach a secondary node before you try to do the next step of the test.
|
||||
- Does your test require data to be stored persistently? Remember that we have test variants that run on in-memory/ephemeral storage engines
|
||||
- There are timeouts in the test suites and we aim to make all tests in the same suite finish before timeout. That says we should always make the test run quickly to keep the test short in terms of duration.
|
||||
### Make tests fail as early as possible.
|
||||
- If something goes wrong early in the test, it’s much harder to diagnose when that error becomes visible much later.
|
||||
- Wrap every command in assert.commandWorked, or assert.commandFailedWithCode. There is also assert.commandFailed that won't check the return error code, but we should always try to use assert.commandFailedWithCode to make sure the test won't pass on an unexpected error.
|
||||
|
||||
- Non-deterministic tests add noise to our build system and, in general, make it harder for yourself and other engineers to determine if the system really is working correctly or not. Flaky integration tests should be considered bugs, and we should not allow them to be committed to the server codebase. One way to make jstests more deterministic is to use failpoints to force the events happening in expected order. However, if we have to use failpoints to make this test deterministic, we should consider write a unit test instead.
|
||||
- Note that our fuzzer and concurrency test suites are often an exception to this rule. In those cases we sometimes give up some level of determinism in order to trigger a wider class of rare edge cases. For targeted JavaScript integration tests, however, highly deterministic tests should be the goal.
|
||||
|
||||
### Think hard about all the assumptions that the test relies on.
|
||||
|
||||
- For example, if a certain phase of the test ran much slower or much faster, would it cause your test to fail for the wrong reason?
|
||||
- If your test includes hard-coded timeouts, make sure they are set appropriately. If a test is waiting for a certain condition to be true, and the test should not proceed until that condition is met, it is often correct to just wait “indefinitely”, instead of adding some arbitrary timeout value, like 30 seconds. In practice this usually means setting some reasonable upper limit, for example, 10 minutes.
|
||||
- Also, for replication tests, make sure data exists on the right nodes at the right time. For example, if you a do a write and don’t explicitly wait for it to replicate, it might not reach a secondary node before you try to do the next step of the test.
|
||||
- Does your test require data to be stored persistently? Remember that we have test variants that run on in-memory/ephemeral storage engines
|
||||
- There are timeouts in the test suites and we aim to make all tests in the same suite finish before timeout. That says we should always make the test run quickly to keep the test short in terms of duration.
|
||||
|
||||
### Make tests fail as early as possible.
|
||||
|
||||
- If something goes wrong early in the test, it’s much harder to diagnose when that error becomes visible much later.
|
||||
- Wrap every command in assert.commandWorked, or assert.commandFailedWithCode. There is also assert.commandFailed that won't check the return error code, but we should always try to use assert.commandFailedWithCode to make sure the test won't pass on an unexpected error.
|
||||
|
||||
### Be aware of all the configurations and variants that your test might run under.
|
||||
- Make sure that your test still works correctly if is run in a different configuration or on a different platform than the one you might have tested on.
|
||||
- Varying storage engines and suites can often affect a test’s behavior. For example, maybe your test fails unexpectedly if it runs with authentication turned on with an in-memory storage engine. You don’t have to run a new test on every possible platform before committing it, but you should be confident that your test doesn’t break in an unexpected configuration.
|
||||
|
||||
- Make sure that your test still works correctly if is run in a different configuration or on a different platform than the one you might have tested on.
|
||||
- Varying storage engines and suites can often affect a test’s behavior. For example, maybe your test fails unexpectedly if it runs with authentication turned on with an in-memory storage engine. You don’t have to run a new test on every possible platform before committing it, but you should be confident that your test doesn’t break in an unexpected configuration.
|
||||
|
||||
### Avoid assertions that verify properties indirectly.
|
||||
|
||||
All assertions in a test should attempt to verify the most specific property possible. For example, if you are trying to test that a certain collection exists, it is better to assert that the collection’s exact name exists in the list of collections, as opposed to verifying that the collection count is equal to 1. The desired collection’s existence is sufficient for the collection count to be 1, but not necessary (a different collection could exist in its place). Be wary of adding these kind of indirect assertions in a test.
|
||||
|
||||
## Modules in Practice
|
||||
|
||||
We have fully migrated to the modularized JavaScript world so any new test should use modules and adapt the new style.
|
||||
|
||||
### Only import/export what you need.
|
||||
|
||||
It's always important to keep the test context clean so we should only import/export what we need.
|
||||
- The unused import is against [no-unused-vars](https://eslint.org/docs/latest/rules/no-unused-vars) rule in ESLint though we haven't enforced it.
|
||||
- We don't have a linter to check export since it's hard to tell the necessity, but we should only export the modules that are imported by other tests or will be needed in the future.
|
||||
|
||||
- The unused import is against [no-unused-vars](https://eslint.org/docs/latest/rules/no-unused-vars) rule in ESLint though we haven't enforced it.
|
||||
- We don't have a linter to check export since it's hard to tell the necessity, but we should only export the modules that are imported by other tests or will be needed in the future.
|
||||
|
||||
### Declare variables in proper scope.
|
||||
In the past, we have seen tests referring some "undeclared" or "redeclared" variables, which are actually introduced through `load()`. Now with modules, the scope is more clear. We can use global variables properly to setup the test and don't need to worry about polluting other tests.
|
||||
|
||||
In the past, we have seen tests referring some "undeclared" or "redeclared" variables, which are actually introduced through `load()`. Now with modules, the scope is more clear. We can use global variables properly to setup the test and don't need to worry about polluting other tests.
|
||||
|
||||
### Name variables properly when exporting.
|
||||
To avoid naming conflicts, we should not make the name of exported variables too general which could easily conflict with another variable from the test which import your module. For example, in the following case, the module exported a variable named `alphabet` and it will lead to a re-declaration error.
|
||||
|
||||
To avoid naming conflicts, we should not make the name of exported variables too general which could easily conflict with another variable from the test which import your module. For example, in the following case, the module exported a variable named `alphabet` and it will lead to a re-declaration error.
|
||||
|
||||
```
|
||||
import {alphabet} from "/matts/module.js";
|
||||
const alphabet = "xyz"; // ERROR
|
||||
```
|
||||
|
||||
### Prefer let/const over var
|
||||
|
||||
`let/const` should be preferred over `var` since these can help detect double declaration at the first place. Like, in the naming conflict example, if the second line is using var, it could easily mess up without throwing an error.
|
||||
|
||||
### Export in ES6 style
|
||||
|
||||
Due to legacy, we have a lot of code that is using the old style to do export, like the following.
|
||||
|
||||
```
|
||||
const MyModule = (function() {
|
||||
function myFeature() {}
|
||||
|
|
@ -57,7 +91,9 @@ function myOtherFeature() {}
|
|||
return {myFeature, myOtherFeature};
|
||||
})();
|
||||
```
|
||||
|
||||
Instead, we should use the ES6 way to do export, as follows.
|
||||
|
||||
```
|
||||
export function myFeature() {}
|
||||
export function myOtherFeature() {}
|
||||
|
|
@ -65,4 +101,5 @@ export function myOtherFeature() {}
|
|||
// When import from test
|
||||
import * as MyModule from "/path/to/my_module.js";
|
||||
```
|
||||
|
||||
This can help the language server to discover the methods and provide code navigation for it.
|
||||
|
|
|
|||
|
|
@ -1,26 +1,31 @@
|
|||
# Multiversion Tests
|
||||
|
||||
## Context
|
||||
|
||||
These tests test specific upgrade/downgrade behavior expected between
|
||||
different versions of MongoDB. Some of these tests will persist indefinitely & some of these tests
|
||||
will be removed upon branching. All targeted tests must go in a targeted test directory. Do not put
|
||||
different versions of MongoDB. Some of these tests will persist indefinitely & some of these tests
|
||||
will be removed upon branching. All targeted tests must go in a targeted test directory. Do not put
|
||||
any files in the multiVersion/ top-level directory.
|
||||
|
||||
## Generic Tests
|
||||
These tests test the general functionality of upgrades/downgrades regardless
|
||||
of version. These will persist indefinitely, as they should always pass regardless
|
||||
|
||||
These tests test the general functionality of upgrades/downgrades regardless
|
||||
of version. These will persist indefinitely, as they should always pass regardless
|
||||
of MongoDB version.
|
||||
|
||||
## Targeted Tests
|
||||
These tests are specific to the current development cycle. These can/will fail after branching and
|
||||
|
||||
These tests are specific to the current development cycle. These can/will fail after branching and
|
||||
are subject to removal during branching.
|
||||
|
||||
### targetedTestsLastLtsFeatures
|
||||
These tests rely on a specific last-lts version. After the next major release, last-lts is a
|
||||
different version than expected, so these are subject to failure. Tests in this directory will be
|
||||
removed after the next major release.
|
||||
|
||||
These tests rely on a specific last-lts version. After the next major release, last-lts is a
|
||||
different version than expected, so these are subject to failure. Tests in this directory will be
|
||||
removed after the next major release.
|
||||
|
||||
### targetedTestsLastContinuousFeatures
|
||||
These tests rely on a specific last-continuous version. After the next minor release,
|
||||
last-continuous is a different version than expected, so these are subject to failure. Tests in
|
||||
|
||||
These tests rely on a specific last-continuous version. After the next minor release,
|
||||
last-continuous is a different version than expected, so these are subject to failure. Tests in
|
||||
this directory will be removed after the next minor release.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
# Internal Client
|
||||
|
||||
## Replica set monitoring and host targeting
|
||||
|
||||
The internal client driver responsible for routing a command request to a replica set must determine
|
||||
which member to target. Host targeting involves finding which nodes in a topology satisfy the
|
||||
$readPreference. Node eligibility depends on the type of a node (i.e primary, secondary, etc.) and
|
||||
|
|
@ -33,16 +34,17 @@ through the hello response latency. Aside from the RTT, the remaining informatio
|
|||
read preferences is gathered through awaitable hello commands asynchronously sent to each node in
|
||||
the topology.
|
||||
|
||||
|
||||
#### Code references
|
||||
* [**Read Preference**](https://docs.mongodb.com/manual/core/read-preference/)
|
||||
* [**ReplicaSetMonitorInterface class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/replica_set_monitor_interface.h)
|
||||
* [**ReplicaSetMonitorManager class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/replica_set_monitor_manager.h)
|
||||
* [**RemoteCommandTargeter class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/remote_command_targeter.h)
|
||||
* [**ServerDiscoveryMonitor class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/server_discovery_monitor.cpp)
|
||||
* [**ServerPingMonitor class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/server_ping_monitor.h)
|
||||
* The specifications for
|
||||
[**Server Discovery and Monitoring (SDAM)**](https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst)
|
||||
* [**TopologyDescription class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/sdam/topology_description.h)
|
||||
* [**Replication Architecture Guide**](https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/README.md#replication-and-topology-coordinators)
|
||||
|
||||
- [**Read Preference**](https://docs.mongodb.com/manual/core/read-preference/)
|
||||
- [**ReplicaSetMonitorInterface class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/replica_set_monitor_interface.h)
|
||||
- [**ReplicaSetMonitorManager class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/replica_set_monitor_manager.h)
|
||||
- [**RemoteCommandTargeter class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/remote_command_targeter.h)
|
||||
- [**ServerDiscoveryMonitor class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/server_discovery_monitor.cpp)
|
||||
- [**ServerPingMonitor class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/server_ping_monitor.h)
|
||||
- The specifications for
|
||||
[**Server Discovery and Monitoring (SDAM)**](https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst)
|
||||
- [**TopologyDescription class**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/client/sdam/topology_description.h)
|
||||
- [**Replication Architecture Guide**](https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/README.md#replication-and-topology-coordinators)
|
||||
|
||||
---
|
||||
|
|
|
|||
|
|
@ -3,22 +3,22 @@
|
|||
At present, usage of JWS in MongoDB is limited to the Linux platform only and is not implemented on other platforms.
|
||||
Since signature validation is not available on other platforms, use of the unvalidated JWT types, while present, is not useful.
|
||||
|
||||
* [Glossary](#glossary)
|
||||
* [`JWKManager`](#jwkmanager)
|
||||
* [`JWSValidator`](#jwsvalidator)
|
||||
* [`JWSValidatedToken`](#jwsvalidatedtoken)
|
||||
* [Compact Serialization Format](#compact-serialization-format)
|
||||
- [Glossary](#glossary)
|
||||
- [`JWKManager`](#jwkmanager)
|
||||
- [`JWSValidator`](#jwsvalidator)
|
||||
- [`JWSValidatedToken`](#jwsvalidatedtoken)
|
||||
- [Compact Serialization Format](#compact-serialization-format)
|
||||
|
||||
## Glossary
|
||||
|
||||
* **JWK** (JSON Web Key): A human readable representation of a cryptographic key.
|
||||
* See [RFC 7517](https://www.rfc-editor.org/rfc/rfc7517) JSON Web Key
|
||||
* Note: This library currently supports [RSA](https://www.rfc-editor.org/rfc/rfc7517#section-9.3) based keys only.
|
||||
* **JWS** (JSON Web Signature): A cryptographic signature on a JWT, typically presented as a single object with the token and a header.
|
||||
* See [RFC 7515](https://www.rfc-editor.org/rfc/rfc7515) JSON Web Signature
|
||||
* Note: This library currently supports the [Compact Serialization](https://www.rfc-editor.org/rfc/rfc7515#section-3.1) only.
|
||||
* **JWT** (JSON Web Token): A JSON object representing a number of claims such as, but not limited to: bearer identity, issuer, and validity.
|
||||
* See [RFC 7519](https://www.rfc-editor.org/rfc/rfc7519) JSON Web Token
|
||||
- **JWK** (JSON Web Key): A human readable representation of a cryptographic key.
|
||||
- See [RFC 7517](https://www.rfc-editor.org/rfc/rfc7517) JSON Web Key
|
||||
- Note: This library currently supports [RSA](https://www.rfc-editor.org/rfc/rfc7517#section-9.3) based keys only.
|
||||
- **JWS** (JSON Web Signature): A cryptographic signature on a JWT, typically presented as a single object with the token and a header.
|
||||
- See [RFC 7515](https://www.rfc-editor.org/rfc/rfc7515) JSON Web Signature
|
||||
- Note: This library currently supports the [Compact Serialization](https://www.rfc-editor.org/rfc/rfc7515#section-3.1) only.
|
||||
- **JWT** (JSON Web Token): A JSON object representing a number of claims such as, but not limited to: bearer identity, issuer, and validity.
|
||||
- See [RFC 7519](https://www.rfc-editor.org/rfc/rfc7519) JSON Web Token
|
||||
|
||||
## JWKManager
|
||||
|
||||
|
|
@ -31,16 +31,16 @@ Later, when validating a client supplied token, the application will use the
|
|||
|
||||
### JSON Web Keys
|
||||
|
||||
* `JWK`: The base key material type in IDL. This parses only the `kid` (Key ID) and `kty` (Key Type) fields.
|
||||
In order to expect and process key specific data, the `kty` must already be known, therefore
|
||||
type specific IDL structs are defined as chaining the base `JWK` type.
|
||||
* `JWKRSA`: Chains the base `JWK` type and adds expected fields `n` and `e` which represent the
|
||||
modulus and public-exponent portions of the RSA key respectively.
|
||||
* `JWKSet`: A simple wrapper struct containing a single field named `keys` of type `array<object>`.
|
||||
This allows the [`JWKManager`](#jwkmanager) class to load a `JWKSet` URI resource and pull out a set of keys
|
||||
which are expected to conform to the `JWK` interface, and as of this writing, represent `JWKRSA` data specifically.
|
||||
- `JWK`: The base key material type in IDL. This parses only the `kid` (Key ID) and `kty` (Key Type) fields.
|
||||
In order to expect and process key specific data, the `kty` must already be known, therefore
|
||||
type specific IDL structs are defined as chaining the base `JWK` type.
|
||||
- `JWKRSA`: Chains the base `JWK` type and adds expected fields `n` and `e` which represent the
|
||||
modulus and public-exponent portions of the RSA key respectively.
|
||||
- `JWKSet`: A simple wrapper struct containing a single field named `keys` of type `array<object>`.
|
||||
This allows the [`JWKManager`](#jwkmanager) class to load a `JWKSet` URI resource and pull out a set of keys
|
||||
which are expected to conform to the `JWK` interface, and as of this writing, represent `JWKRSA` data specifically.
|
||||
|
||||
These types, as well as [`JWSHeader`](#jwsheader) and [`JWT`](#jwt) can be found in [jwt\_types.idl](jwt_types.idl).
|
||||
These types, as well as [`JWSHeader`](#jwsheader) and [`JWT`](#jwt) can be found in [jwt_types.idl](jwt_types.idl).
|
||||
|
||||
### Example JWK file
|
||||
|
||||
|
|
@ -48,20 +48,20 @@ A typical JSON file containing keys may look something like the following:
|
|||
|
||||
```json
|
||||
{
|
||||
"keys": [
|
||||
{
|
||||
"kid": "custom-key-1",
|
||||
"kty": "RSA",
|
||||
"n": "ALtUlNS31SzxwqMzMR9jKOJYDhHj8zZtLUYHi3s1en3wLdILp1Uy8O6Jy0Z66tPyM1u8lke0JK5gS-40yhJ-bvqioW8CnwbLSLPmzGNmZKdfIJ08Si8aEtrRXMxpDyz4Is7JLnpjIIUZ4lmqC3MnoZHd6qhhJb1v1Qy-QGlk4NJy1ZI0aPc_uNEUM7lWhPAJABZsWc6MN8flSWCnY8pJCdIk_cAktA0U17tuvVduuFX_94763nWYikZIMJS_cTQMMVxYNMf1xcNNOVFlUSJHYHClk46QT9nT8FWeFlgvvWhlXfhsp9aNAi3pX-KxIxqF2wABIAKnhlMa3CJW41323Js",
|
||||
"e": "AQAB"
|
||||
},
|
||||
{
|
||||
"kid": "custom-key-2",
|
||||
"kty": "RSA",
|
||||
"n": "ANBv7-YFoyL8EQVhig7yF8YJogUTW-qEkE81s_bs2CTsI1oepDFNAeMJ-Krfx1B7yllYAYtScZGo_l60R9Ou4X89LA66bnVRWVFCp1YV1r0UWtn5hJLlAbqKseSmjdwZlL_e420GlUAiyYsiIr6wltC1dFNYyykq62RhfYhM0xpnt0HiN-k71y9A0GO8H-dFU1WgOvEYMvHmDAZtAP6RTkALE3AXlIHNb4mkOc9gwwn-7cGBc08rufYcniKtS0ZHOtD1aE2CTi1MMQMKkqtVxWIdTI3wLJl1t966f9rBHR6qVtTV8Qpq1bquUc2oaHjR4lPTf0Z_hTaELJa5-BBbvJU",
|
||||
"e": "AQAB"
|
||||
}
|
||||
]
|
||||
"keys": [
|
||||
{
|
||||
"kid": "custom-key-1",
|
||||
"kty": "RSA",
|
||||
"n": "ALtUlNS31SzxwqMzMR9jKOJYDhHj8zZtLUYHi3s1en3wLdILp1Uy8O6Jy0Z66tPyM1u8lke0JK5gS-40yhJ-bvqioW8CnwbLSLPmzGNmZKdfIJ08Si8aEtrRXMxpDyz4Is7JLnpjIIUZ4lmqC3MnoZHd6qhhJb1v1Qy-QGlk4NJy1ZI0aPc_uNEUM7lWhPAJABZsWc6MN8flSWCnY8pJCdIk_cAktA0U17tuvVduuFX_94763nWYikZIMJS_cTQMMVxYNMf1xcNNOVFlUSJHYHClk46QT9nT8FWeFlgvvWhlXfhsp9aNAi3pX-KxIxqF2wABIAKnhlMa3CJW41323Js",
|
||||
"e": "AQAB"
|
||||
},
|
||||
{
|
||||
"kid": "custom-key-2",
|
||||
"kty": "RSA",
|
||||
"n": "ANBv7-YFoyL8EQVhig7yF8YJogUTW-qEkE81s_bs2CTsI1oepDFNAeMJ-Krfx1B7yllYAYtScZGo_l60R9Ou4X89LA66bnVRWVFCp1YV1r0UWtn5hJLlAbqKseSmjdwZlL_e420GlUAiyYsiIr6wltC1dFNYyykq62RhfYhM0xpnt0HiN-k71y9A0GO8H-dFU1WgOvEYMvHmDAZtAP6RTkALE3AXlIHNb4mkOc9gwwn-7cGBc08rufYcniKtS0ZHOtD1aE2CTi1MMQMKkqtVxWIdTI3wLJl1t966f9rBHR6qVtTV8Qpq1bquUc2oaHjR4lPTf0Z_hTaELJa5-BBbvJU",
|
||||
"e": "AQAB"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -75,10 +75,10 @@ as they are received from third parties such as connected clients.
|
|||
|
||||
Platform specific implementations of the cryptographic functions may be found in:
|
||||
|
||||
* Linux: [jws\_validator\_openssl.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_openssl.cpp)
|
||||
* Windows: [jws\_validator\_windows.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_windows.cpp) UNIMPLEMENTED
|
||||
* macOS: [jws\_validator\_apple.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_apple.cpp) UNIMPLEMENTED
|
||||
* Non-TLS builds: [jws\_validator\_none.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_none.cpp) UNIMPLEMENTED
|
||||
- Linux: [jws_validator_openssl.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_openssl.cpp)
|
||||
- Windows: [jws_validator_windows.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_windows.cpp) UNIMPLEMENTED
|
||||
- macOS: [jws_validator_apple.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_apple.cpp) UNIMPLEMENTED
|
||||
- Non-TLS builds: [jws_validator_none.cpp](https://github.com/mongodb/mongo/blob/master/src/mongo/crypto/jws_validator_none.cpp) UNIMPLEMENTED
|
||||
|
||||
## JWSValidatedToken
|
||||
|
||||
|
|
@ -88,12 +88,13 @@ maintaining a type on the post-processed token as well.
|
|||
|
||||
An application will construct a `JWSValidatedToken` by passing both a
|
||||
signed [`JWS Compact Serialization`](#compact-serialization-format) and a [`JWKManager`](#jwkmanager).
|
||||
|
||||
1. The token's header is parsed using IDL type [`JWSHeader`](#jwsheader) to determine the `kid` (Key ID) which was used for signing.
|
||||
2. The [`JWKManager`](#jwkmanager) is queried for a suitable [`JWSValidator`](#jwsvalidator).
|
||||
1. If the requested `kid` is unknown to the [`JWKManager`](#jwkmanager), it will requery its `JWKSet` URI to reload from the key server.
|
||||
3. That validator is used to check the provided signature against the header and body payload.
|
||||
4. The body of the token is parsed using IDL type [`JWT`](#jwt).
|
||||
5. Relevant validity claims `nbf` ([Not before](https://www.rfc-editor.org/rfc/rfc7519.html#section-4.1.5)) and `exp` ([Expires At](https://www.rfc-editor.org/rfc/rfc7519.html#section-4.1.4)) are verified.
|
||||
3. If the requested `kid` is unknown to the [`JWKManager`](#jwkmanager), it will requery its `JWKSet` URI to reload from the key server.
|
||||
4. That validator is used to check the provided signature against the header and body payload.
|
||||
5. The body of the token is parsed using IDL type [`JWT`](#jwt).
|
||||
6. Relevant validity claims `nbf` ([Not before](https://www.rfc-editor.org/rfc/rfc7519.html#section-4.1.5)) and `exp` ([Expires At](https://www.rfc-editor.org/rfc/rfc7519.html#section-4.1.4)) are verified.
|
||||
|
||||
If, at any point during this construction, an error is encountered, a `DBException` will be thrown and no `JWSValidatedToken` will be created.
|
||||
|
||||
|
|
@ -104,7 +105,7 @@ To access fields, use the `getBody()`/`getBodyBSON()` accessors on an as-needed
|
|||
from the `JWSValidatedToken` object rather than storing these values by themselves.
|
||||
This ensures that the data being used has been validated.
|
||||
|
||||
**Directly parsing the signed JWS compact serialization using *any* other means
|
||||
**Directly parsing the signed JWS compact serialization using _any_ other means
|
||||
than `JWSValidatedToken` should be considered an error and rejected during code review.**
|
||||
|
||||
## Compact Serialization Format
|
||||
|
|
@ -126,7 +127,7 @@ The first section in our example is the `base64url::encode()` output of the `JWS
|
|||
represented as `JSON`, so it would decode as:
|
||||
|
||||
```json
|
||||
{ "typ": "JWT", "alg": "RS256", "kid": "custom-key-1" }
|
||||
{"typ": "JWT", "alg": "RS256", "kid": "custom-key-1"}
|
||||
```
|
||||
|
||||
This tells us that the payload in the second field of the compact serialized signature is a `JWT`
|
||||
|
|
@ -134,7 +135,8 @@ This tells us that the payload in the second field of the compact serialized sig
|
|||
We also see that the `alg`orithm for signing uses [`RS256`](https://www.rfc-editor.org/rfc/rfc7518.html#section-3.3) (`RSASSA-PKCS1-v1_5 using SHA-256`) and that the signature can be verified using the key material associated with `custom-key-1`.
|
||||
|
||||
The IDL struct `JWSHeader` may be used to parse this section for access to the relevant `alg` and `kid` fields.
|
||||
* See also: [RFC 7515 Section 4](https://www.rfc-editor.org/rfc/rfc7515#section-4) JOSE Header
|
||||
|
||||
- See also: [RFC 7515 Section 4](https://www.rfc-editor.org/rfc/rfc7515#section-4) JOSE Header
|
||||
|
||||
### JWT
|
||||
|
||||
|
|
@ -148,20 +150,17 @@ So in this example, it would decode as:
|
|||
"sub": "user1@mongodb.com",
|
||||
"nbf": 1661374077,
|
||||
"exp": 2147483647,
|
||||
"aud": [
|
||||
"jwt@kernel.mongodb.com"
|
||||
],
|
||||
"aud": ["jwt@kernel.mongodb.com"],
|
||||
"nonce": "gdfhjj324ehj23k4",
|
||||
"mongodb-roles": [
|
||||
"myReadRole"
|
||||
]
|
||||
"mongodb-roles": ["myReadRole"]
|
||||
}
|
||||
```
|
||||
|
||||
The IDL struct `JWT` will be used to parse this section by [`JWSValidatedToken`](#jwsvalidatedtoken),
|
||||
however token payloads SHOULD NOT be inspected without processing them through
|
||||
the validating infrastructure.
|
||||
* See [RFC 7519 Section 4](https://www.rfc-editor.org/rfc/rfc7519#section-4) JWT Claims
|
||||
|
||||
- See [RFC 7519 Section 4](https://www.rfc-editor.org/rfc/rfc7519#section-4) JWT Claims
|
||||
|
||||
Note that this token payload contains an additional field not defined by any RFC or ietf-draft.
|
||||
The content of this, or any other unknown fields is treated as opaque and ignored by
|
||||
|
|
|
|||
|
|
@ -1,3 +1,3 @@
|
|||
# MongoDB Crypto library architecture guides
|
||||
|
||||
* [JSON Web Tokens (JWT)](README.JWT.md)
|
||||
- [JSON Web Tokens (JWT)](README.JWT.md)
|
||||
|
|
|
|||
|
|
@ -1,13 +1,12 @@
|
|||
# MongoDB Internals
|
||||
|
||||
This document aims to provide a high-level specification for the MongoDB's
|
||||
infrastructure to support client/server interaction and process globals.
|
||||
Examples for such components are `ServiceContext` and `OperationContext`.
|
||||
This document aims to provide a high-level specification for the MongoDB's
|
||||
infrastructure to support client/server interaction and process globals.
|
||||
Examples for such components are `ServiceContext` and `OperationContext`.
|
||||
This is a work in progress and more sections will be added gradually.
|
||||
|
||||
## Server-Internal Baton Pattern
|
||||
|
||||
For details on the server-internal *Baton* pattern, see [this document][baton].
|
||||
For details on the server-internal _Baton_ pattern, see [this document][baton].
|
||||
|
||||
[baton]: ../../../docs/baton.md
|
||||
|
||||
|
|
|
|||
|
|
@ -3,19 +3,20 @@
|
|||
## Shard Role API
|
||||
|
||||
Any code that accesses data collections with the intention to read or write is said to be operating
|
||||
in the _Shard Role_. This contrasts with _Router Role_ operations, which do not access data
|
||||
in the _Shard Role_. This contrasts with _Router Role_ operations, which do not access data
|
||||
collections directly — they only route operations to the appropriate shard.
|
||||
|
||||
Shard Role operations are sharding-aware and thus require establishing a consistent view of the _storage engine_, _local catalog_
|
||||
Shard Role operations are sharding-aware and thus require establishing a consistent view of the _storage engine_, _local catalog_
|
||||
and _sharding catalog_. The storage engine contains the "data". The local catalog contains
|
||||
shard-local metadata such as indexes and storage options. The sharding catalog contains the sharding
|
||||
description (whether the collection is sharded, its shard key pattern, etc.) and the
|
||||
description (whether the collection is sharded, its shard key pattern, etc.) and the
|
||||
ownership filter (which shard key ranges are owned by this shard).
|
||||
|
||||
Shard Role operations are also responsible for validating routing decisions taken by possibly-stale
|
||||
Shard Role operations are also responsible for validating routing decisions taken by possibly-stale
|
||||
upstream routers.
|
||||
|
||||
## Acquiring collections
|
||||
|
||||
[shard_role.h provides](https://github.com/mongodb/mongo/blob/23c92c3cca727209a68e22d2d9cabe46bac11bb1/src/mongo/db/shard_role.h#L333-L375)
|
||||
the `acquireCollection*` family of primitives to acquire a consistent view of the catalogs for collections and views. Shard role code is required to use these primitives to access collections/views.
|
||||
|
||||
|
|
@ -51,31 +52,38 @@ CollectionOrViewAcquisitions acquireCollectionsOrViewsMaybeLockFree(
|
|||
```
|
||||
|
||||
The dimensions of this family of methods are:
|
||||
- Collection/View: Whether the caller is okay with the namespace potentially corresponding to a view or not.
|
||||
- Locks/MaybeLockFree: The "MaybeLockFree" variant will skip acquiring locks if it is allowed given the opCtx state. It must be only used for read operations. An operation is allowed to skip locks if all the following conditions are met:
|
||||
- (i) it's not part of a multi-document transaction,
|
||||
- (ii) it is not already holding write locks,
|
||||
- (iii) does not already have a non-lock-free storage transaction open.
|
||||
|
||||
The normal variant acquires locks.
|
||||
- One or multiple acquisitions: The "plural" variants allow acquiring multiple collections/views in a single call. Acquiring multiple collections in the same acquireCollections call prevents the global lock from getting recursively locked, which would impede yielding.
|
||||
|
||||
- Collection/View: Whether the caller is okay with the namespace potentially corresponding to a view or not.
|
||||
- Locks/MaybeLockFree: The "MaybeLockFree" variant will skip acquiring locks if it is allowed given the opCtx state. It must be only used for read operations. An operation is allowed to skip locks if all the following conditions are met:
|
||||
|
||||
- (i) it's not part of a multi-document transaction,
|
||||
- (ii) it is not already holding write locks,
|
||||
- (iii) does not already have a non-lock-free storage transaction open.
|
||||
|
||||
The normal variant acquires locks.
|
||||
|
||||
- One or multiple acquisitions: The "plural" variants allow acquiring multiple collections/views in a single call. Acquiring multiple collections in the same acquireCollections call prevents the global lock from getting recursively locked, which would impede yielding.
|
||||
|
||||
For each collection/view the caller desires to acquire, `CollectionAcquisitionRequest`/`CollectionOrViewAcquisitionRequest` represents the prerequisites for it, which are:
|
||||
- `nsOrUUID`: The NamespaceString or uuid of the desired collection/view.
|
||||
- `placementConcern`: The sharding placementConcern, also known as ShardVersion and DatabaseVersion, that the router attached.
|
||||
- `operationType`: Whether we are acquiring this collection for reading (`kRead`) or for writing (`kWrite`). `kRead` operations will keep the same orphan filter and range preserver across yields. This way, even if chunk migrations commit, the query plan is guaranteed to keep seeing the documents for the owned ranges at the time the query started.
|
||||
- Optionally, `expectedUUID`: for requests where `nsOrUUID` takes the NamespaceString form, this is the UUID we expect the collection to have.
|
||||
|
||||
- `nsOrUUID`: The NamespaceString or uuid of the desired collection/view.
|
||||
- `placementConcern`: The sharding placementConcern, also known as ShardVersion and DatabaseVersion, that the router attached.
|
||||
- `operationType`: Whether we are acquiring this collection for reading (`kRead`) or for writing (`kWrite`). `kRead` operations will keep the same orphan filter and range preserver across yields. This way, even if chunk migrations commit, the query plan is guaranteed to keep seeing the documents for the owned ranges at the time the query started.
|
||||
- Optionally, `expectedUUID`: for requests where `nsOrUUID` takes the NamespaceString form, this is the UUID we expect the collection to have.
|
||||
|
||||
If the prerequisites can be met, then the acquisition will succeed and one or multiple `CollectionAcquisition`/`ViewAcquisition` objects are returned. These objects are the entry point for accessing the catalog information, including:
|
||||
- CollectionPtr: The local catalog.
|
||||
- CollectionDescription: The sharding catalog.
|
||||
- ShardingOwnershipFilter: Used to filter out orphaned documents.
|
||||
|
||||
- CollectionPtr: The local catalog.
|
||||
- CollectionDescription: The sharding catalog.
|
||||
- ShardingOwnershipFilter: Used to filter out orphaned documents.
|
||||
|
||||
Additionally, these objects hold several resources during their lifetime:
|
||||
- For locked acquisitions, the locks.
|
||||
- For sharded collections, the _RangePreserver_, which prevents documents that became orphans after having established the collectionAcquisition from being deleted.
|
||||
|
||||
- For locked acquisitions, the locks.
|
||||
- For sharded collections, the _RangePreserver_, which prevents documents that became orphans after having established the collectionAcquisition from being deleted.
|
||||
|
||||
As an example:
|
||||
|
||||
```
|
||||
CollectionAcquisition collection =
|
||||
acquireCollection(opCtx,
|
||||
|
|
@ -91,24 +99,29 @@ collection.getShardingFilter();
|
|||
```
|
||||
|
||||
## TransactionResources
|
||||
|
||||
`CollectionAcquisition`/`CollectionOrViewAcquisition` are reference-counted views to a `TransactionResources` object. `TransactionResources` is the holder of the acquisition's resources, which include the global/db/collection locks (in case of a locked acquisition), the local catalog snapshot (collectionPtr), the sharding catalog snapshot (collectionDescription) and ownershipFilter.
|
||||
|
||||
Copying a `CollectionAcquisition`/`CollectionOrViewAcquisition` object increases its associated `TransactionResources` reference counter. When it reaches zero, the resources are released.
|
||||
|
||||
## Acquisitions and query plans
|
||||
|
||||
Query plans are to use `CollectionAcquisitions` as the sole entry point to access the different catalogs (e.g. to access a CollectionPtr, to get the sharding description or the orphan filter). Plans should never store references to the catalogs because they can become invalid after a yield. Upon restore, they will find the `CollectionAcquisition` in a valid state.
|
||||
|
||||
## Yielding and restoring
|
||||
|
||||
`TransactionResources` can be detached from its current operation context and later attached to a different one -- this is the case for getMore. Acquisitions
|
||||
associated with a particular `TransactionResources` object must only be used by that operation context.
|
||||
|
||||
shard_role.h provides primitives for yielding and restoring. There are two different types of yields: One where the operation will resume on the same operation context (e.g. an update write operation), and the other where they will be restored to a different operation context (e.g. a getMore).
|
||||
|
||||
The restore procedure checks that the acquisition prerequisites are still met, namely:
|
||||
- That the collection still exists and has not been renamed.
|
||||
- That the sharding placement concern can still be met. For `kWrite` acquisitions, this means that the shard version has not changed. This can be relaxed for `kRead` acquisitions: It is allowed that the shard version changes, because the RangePreserver guarantees that all documents corresponding to that placement version are still on the shard.
|
||||
|
||||
- That the collection still exists and has not been renamed.
|
||||
- That the sharding placement concern can still be met. For `kWrite` acquisitions, this means that the shard version has not changed. This can be relaxed for `kRead` acquisitions: It is allowed that the shard version changes, because the RangePreserver guarantees that all documents corresponding to that placement version are still on the shard.
|
||||
|
||||
### Yield and restore to the same operation context
|
||||
|
||||
[`yieldTransactionResourcesFromOperationContext`](https://github.com/mongodb/mongo/blob/2e0259b3050e4c27d47e353222395d21bb80b9e4/src/mongo/db/shard_role.h#L442-L453)
|
||||
yields the resources associated with the acquisition, yielding its locks, and returns a `YieldedTransactionResources`
|
||||
object holding the yielded resources. After that call,
|
||||
|
|
@ -138,11 +151,13 @@ myPlanExecutor.getNext();
|
|||
```
|
||||
|
||||
### Yield and restore to a different operation context
|
||||
|
||||
Operations that build a plan executor and return a cursor to be consumed over repeated getMore commands do so by stashing its resources to the CursorManager. [`stashTransactionResourcesFromOperationContext`](https://github.com/mongodb/mongo/blob/2e0259b3050e4c27d47e353222395d21bb80b9e4/src/mongo/db/shard_role.h#L512-L516) yields the `TransactionResources` and detaches it from the current operation context. The yielded TransactionResources are stashed to to the CursorManager.
|
||||
|
||||
When executing a getMore, the yielded TransactionResources is retrieved from the CursorManager and attached to the new operation context. This is done by constructing the `HandleTransactionResourcesFromCursor` RAII object. Its destructor will re-stash the TransactionResources back to the CursorManager. In case of failure during getMore, `HandleTransactionResourcesFromCursor::dismissRestoredResources()` must be called to dismiss its resources.
|
||||
|
||||
As an example, build a PlanExectuor and stash it to the CursorManager:
|
||||
|
||||
```
|
||||
CollectionAcquisition collection = acquireCollection(opCtx1, CollectionAcquisitionRequest(nss, kRead, ...));
|
||||
|
||||
|
|
@ -159,11 +174,13 @@ stashTransactionResourcesFromOperationContext(opCtx, pinnedCursor.getCursor());
|
|||
|
||||
// [Command ends]
|
||||
```
|
||||
|
||||
And now getMore consumes more documents from the cursor:
|
||||
|
||||
```
|
||||
// --------
|
||||
// [getMore command]
|
||||
auto cursorPin = uassertStatusOK(CursorManager::get(opCtx2)->pinCursor(opCtx2, cursorId));
|
||||
auto cursorPin = uassertStatusOK(CursorManager::get(opCtx2)->pinCursor(opCtx2, cursorId));
|
||||
|
||||
// Restore the stashed TransactionResources to the current opCtx.
|
||||
HandleTransactionResourcesFromCursor transactionResourcesHandler(opCtx2, cursorPin.getCursor());
|
||||
|
|
@ -175,4 +192,3 @@ while (...) {
|
|||
|
||||
// ~HandleTransactionResourcesFromCursor will re-stash the TransactionResources to 'cursorPin'
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -16,46 +16,45 @@ API versions.
|
|||
For any API version V the following changes are prohibited, and must be introduced in a new API
|
||||
version W.
|
||||
|
||||
- Remove StableCommand (where StableCommand is some command in V).
|
||||
- Remove a documented StableCommand parameter.
|
||||
- Prohibit a formerly permitted StableCommand parameter value.
|
||||
- Remove a field from StableCommand's reply.
|
||||
- Change the type of a field in StableCommand's reply, or expand the set of types it may be.
|
||||
- Add a new value to a StableCommand reply field's enum-like fixed set of values, e.g. a new index
|
||||
type (unless there's an opt-in mechanism besides API version).
|
||||
- Change semantics of StableCommand in a manner that may cause existing applications to misbehave.
|
||||
- Change an error code returned in a particular error scenario, if drivers rely on the code.
|
||||
- Remove a label L from an error returned in a particular error scenario which had returned an error
|
||||
labeled with L before.
|
||||
- Prohibit any currently permitted CRUD syntax element, including but not limited to query and
|
||||
aggregation operators, aggregation stages and expressions, and CRUD operators.
|
||||
- Remove support for a BSON type, or any other BSON format change (besides adding a type).
|
||||
- Drop support for a wire protocol message type.
|
||||
- Making the authorization requirements for StableCommand more restrictive.
|
||||
- Increase hello.minWireVersion (or decrease maxWireVersion, which we won't do).
|
||||
- Remove StableCommand (where StableCommand is some command in V).
|
||||
- Remove a documented StableCommand parameter.
|
||||
- Prohibit a formerly permitted StableCommand parameter value.
|
||||
- Remove a field from StableCommand's reply.
|
||||
- Change the type of a field in StableCommand's reply, or expand the set of types it may be.
|
||||
- Add a new value to a StableCommand reply field's enum-like fixed set of values, e.g. a new index
|
||||
type (unless there's an opt-in mechanism besides API version).
|
||||
- Change semantics of StableCommand in a manner that may cause existing applications to misbehave.
|
||||
- Change an error code returned in a particular error scenario, if drivers rely on the code.
|
||||
- Remove a label L from an error returned in a particular error scenario which had returned an error
|
||||
labeled with L before.
|
||||
- Prohibit any currently permitted CRUD syntax element, including but not limited to query and
|
||||
aggregation operators, aggregation stages and expressions, and CRUD operators.
|
||||
- Remove support for a BSON type, or any other BSON format change (besides adding a type).
|
||||
- Drop support for a wire protocol message type.
|
||||
- Making the authorization requirements for StableCommand more restrictive.
|
||||
- Increase hello.minWireVersion (or decrease maxWireVersion, which we won't do).
|
||||
|
||||
The following changes are permitted in V:
|
||||
|
||||
- Add a command.
|
||||
- Add an optional command parameter.
|
||||
- Permit a formerly prohibited command parameter or parameter value.
|
||||
- Any change in an undocumented command parameter.
|
||||
- Change any aspect of internal sharding/replication/etc. protocols.
|
||||
- Add a command reply field.
|
||||
- Add a new error code (provided this does not break compatibility with existing drivers and
|
||||
applications).
|
||||
- Add a label to an error.
|
||||
- Change order of fields in reply docs and sub-docs.
|
||||
- Add a CRUD syntax element.
|
||||
- Making the authorization requirements for StableCommand less restrictive.
|
||||
- Add and dropping support for an authentication mechanism. Authenticate mechanisms may need to be
|
||||
removed due to security vulnerabilties and as such, there is no guarantee about their
|
||||
compatibility.
|
||||
- Deprecate a behavior
|
||||
- Increase hello.maxWireVersion.
|
||||
- Any change in behaviors not in V.
|
||||
- Performance changes.
|
||||
|
||||
- Add a command.
|
||||
- Add an optional command parameter.
|
||||
- Permit a formerly prohibited command parameter or parameter value.
|
||||
- Any change in an undocumented command parameter.
|
||||
- Change any aspect of internal sharding/replication/etc. protocols.
|
||||
- Add a command reply field.
|
||||
- Add a new error code (provided this does not break compatibility with existing drivers and
|
||||
applications).
|
||||
- Add a label to an error.
|
||||
- Change order of fields in reply docs and sub-docs.
|
||||
- Add a CRUD syntax element.
|
||||
- Making the authorization requirements for StableCommand less restrictive.
|
||||
- Add and dropping support for an authentication mechanism. Authenticate mechanisms may need to be
|
||||
removed due to security vulnerabilties and as such, there is no guarantee about their
|
||||
compatibility.
|
||||
- Deprecate a behavior
|
||||
- Increase hello.maxWireVersion.
|
||||
- Any change in behaviors not in V.
|
||||
- Performance changes.
|
||||
|
||||
### Enforcing Compatibility
|
||||
|
||||
|
|
@ -72,49 +71,55 @@ from 5.0.0 onwards. This compatibility checker script will run in evergreen patc
|
|||
and in the commit queue. The script that evergreen runs is [here](https://github.com/mongodb/mongo/blob/4594ea6598ce28d01c5c5d76164b1cfeeba1494f/evergreen/check_idl_compat.sh).
|
||||
|
||||
### Running the Compatibility Checker Locally
|
||||
|
||||
To run the compatibility checker locally, first run
|
||||
|
||||
```
|
||||
python buildscripts/idl/checkout_idl_files_from_past_releases.py -v idls
|
||||
```
|
||||
|
||||
This creates subfolders of past releases in the `idls` folder. Then, for the old release you want to
|
||||
check against, run
|
||||
|
||||
```
|
||||
python buildscripts/idl/idl_check_compatibility.py -v --old-include idls/<old_release_dir>/src --old-include idls/<old_release_dir>/src/mongo/db/modules/enterprise/src --new-include src --new-include src/mongo/db/modules/enterprise/src idls/<old_release_dir>/src src
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
python buildscripts/idl/idl_check_compatibility.py -v --old-include idls/r6.0.3/src --old-include idls/r6.0.3/src/mongo/db/modules/enterprise/src --new-include src --new-include src/mongo/db/modules/enterprise/src idls/r6.0.3/src src
|
||||
```
|
||||
|
||||
## Adding new commands and fields
|
||||
***Any additions to the Stable API must be approved by the Stable API PM and code reviewed by the
|
||||
Query Optimization Team.***
|
||||
|
||||
Adding a new IDL command requires the `api_version` field, which indicates which Stable API version
|
||||
this command is in. ***By default, the `api_version` field should be `""`.*** Only if you are adding the
|
||||
command to the Stable API, then `api_version` should be the API version you are adding it to
|
||||
(currently `"1"`). ***By adding it to the Stable API, this means you cannot remove this
|
||||
command within this API version.***
|
||||
**_Any additions to the Stable API must be approved by the Stable API PM and code reviewed by the
|
||||
Query Optimization Team._**
|
||||
|
||||
Adding a new IDL command requires the `api_version` field, which indicates which Stable API version
|
||||
this command is in. **_By default, the `api_version` field should be `""`._** Only if you are adding the
|
||||
command to the Stable API, then `api_version` should be the API version you are adding it to
|
||||
(currently `"1"`). **_By adding it to the Stable API, this means you cannot remove this
|
||||
command within this API version._**
|
||||
|
||||
Adding a new command parameter or reply field requires the `stability` field. This field indicates
|
||||
whether the command parameter/reply field is part of the Stable API. There are three options for
|
||||
field: `unstable`, `internal`, and `stable`. If you are unsure what the `stability` field for the
|
||||
new command parameter or reply field should be, it ***should be marked as `stability: unstable`***.
|
||||
field: `unstable`, `internal`, and `stable`. If you are unsure what the `stability` field for the
|
||||
new command parameter or reply field should be, it **_should be marked as `stability: unstable`_**.
|
||||
|
||||
Only if the field should be added to the Stable API, then you should mark the field as
|
||||
Only if the field should be added to the Stable API, then you should mark the field as
|
||||
`stability: stable`in IDL. Additionally, in `idl_check_compatibility.py` you must add the field to
|
||||
the `ALLOWED_STABLE_FIELDS_LIST`. This list was added so that engineers are aware that by making a
|
||||
field part of the stable API, ***the field cannot be changed in any way that would violate the
|
||||
Stable API guidelines*** (see [above](https://github.com/mongodb/mongo/blob/master/src/mongo/db/STABLE_API_README.md#compatibility)).
|
||||
Crucially, this means the field ***cannot be removed or changed to `stability: unstable` or
|
||||
`stability: internal`*** while we are in the current API version.
|
||||
field part of the stable API, **_the field cannot be changed in any way that would violate the
|
||||
Stable API guidelines_** (see [above](https://github.com/mongodb/mongo/blob/master/src/mongo/db/STABLE_API_README.md#compatibility)).
|
||||
Crucially, this means the field **_cannot be removed or changed to `stability: unstable` or
|
||||
`stability: internal`_** while we are in the current API version.
|
||||
|
||||
The format of adding a field to the list is `<command_name>-<command_param_or_reply_field>-<field_name>`.
|
||||
|
||||
### `stability: unstable` vs. `stability: internal`
|
||||
|
||||
If the field should not be part of the Stable API, it should be marked as either
|
||||
If the field should not be part of the Stable API, it should be marked as either
|
||||
`stability: unstable` or `stability: internal`. Both of these mean that the field will not be a part
|
||||
of the Stable API. The difference is that when we send commands from a mongos to a shard, the shard
|
||||
will perform parsing validation that checks that all the command fields are part of the Stable API,
|
||||
|
|
@ -125,15 +130,16 @@ marked as `stability: unstable`, unless it will go through this parsing validati
|
|||
should be marked as `stability: internal`.
|
||||
|
||||
### `IGNORE_STABLE_TO_UNSTABLE_LIST`
|
||||
|
||||
The `IGNORE_STABLE_TO_UNSTABLE_LIST` exists because there have been cases where a field was added
|
||||
to the Stable API accidentally, and since the field was strictly internal / not documented to users,
|
||||
we changed the field to be unstable. (Note that these kinds of changes have to go through the same
|
||||
approval process.) Normally changing a field from `stability: stable` to `stability: unstable` or
|
||||
`stability: internal` would throw an error, so the `IGNORE_STABLE_TO_UNSTABLE_LIST` acts as an allow
|
||||
list for these exceptions.
|
||||
list for these exceptions.
|
||||
|
||||
***Additions to the `IGNORE_STABLE_TO_UNSTABLE_LIST` must be approved by the Stable API PM and code
|
||||
reviewed by the Query Optimization Team.***
|
||||
**_Additions to the `IGNORE_STABLE_TO_UNSTABLE_LIST` must be approved by the Stable API PM and code
|
||||
reviewed by the Query Optimization Team._**
|
||||
|
||||
### The BSON serialization `any` type
|
||||
|
||||
|
|
@ -141,7 +147,7 @@ The `bson_serialization_type` is used to define the BSON type that an IDL field
|
|||
In some cases, we need custom serializers defined in C++ to perform more complex logic,
|
||||
such as validating the given type or accepting multiple types for the field. If we use these custom
|
||||
serializers, we specify the `bson_serialization_type` to be `any`. However, the compatibility
|
||||
checker script can’t type check `any` , since the main logic for the type exists outside of the
|
||||
checker script can’t type check `any` , since the main logic for the type exists outside of the
|
||||
IDL file. As many commands have valid reasons for using type `any`, we do not restrict usage.
|
||||
Instead, the command must be added to an [allowlist](https://github.com/mongodb/mongo/blob/6aaad044a819a50a690b932afeda9aa278ba0f2e/buildscripts/idl/idl_check_compatibility.py#L52).
|
||||
This also applies to any fields marked as `stability: unstable`. This is to prevent unexpected
|
||||
|
|
@ -191,7 +197,7 @@ Rules for feature compatibility version and API version:
|
|||
### Rule 1
|
||||
|
||||
**The first release to support an API version W can add W in its upgraded FCV, but cannot add W in
|
||||
its downgraded FCV.**
|
||||
its downgraded FCV.**
|
||||
|
||||
Some API versions will introduce behaviors that require disk format changes or intracluster protocol
|
||||
changes that don't take effect until setFCV("R"), so for consistency, we always wait for setFCV("R")
|
||||
|
|
@ -200,7 +206,7 @@ before supporting a new API version.
|
|||
### Rule 2
|
||||
|
||||
**So that applications can upgrade without downtime from V to W, at least one release must support
|
||||
both V and W in its upgraded FCV.**
|
||||
both V and W in its upgraded FCV.**
|
||||
|
||||
This permits zero-downtime API version upgrades. If release R in its upgraded FCV "R" supports both
|
||||
V and W, the customer can first upgrade to R with FCV "R" while their application is running with
|
||||
|
|
|
|||
|
|
@ -2,41 +2,41 @@
|
|||
|
||||
## Table of Contents
|
||||
|
||||
- [High Level Overview](#high-level-overview)
|
||||
- [Authentication](#authentication)
|
||||
- [SASL](#sasl)
|
||||
- [Speculative Auth](#speculative-authentication)
|
||||
- [SASL Supported Mechs](#sasl-supported-mechs)
|
||||
- [X509 Authentication](#x509-authentication)
|
||||
- [Cluster Authentication](#cluster-authentication)
|
||||
- [X509 Intracluster Auth](#x509-intracluster-auth-and-member-certificate-rotation)
|
||||
- [Keyfile Intracluster Auth](#keyfile-intracluster-auth)
|
||||
- [Localhost Auth Bypass](#localhost-auth-bypass)
|
||||
- [Authorization](#authorization)
|
||||
- [AuthName](#authname) (`UserName` and `RoleName`)
|
||||
- [Users](#users)
|
||||
- [User Roles](#user-roles)
|
||||
- [User Credentials](#user-credentials)
|
||||
- [User Authentication Restrictions](#user-authentication-restrictions)
|
||||
- [Roles](#roles)
|
||||
- [Role subordinate Roles](#role-subordinate-roles)
|
||||
- [Role Privileges](#role-privileges)
|
||||
- [Role Authentication Restrictions](#role-authentication-restrictions)
|
||||
- [User and Role Management](#user-and-role-management)
|
||||
- [UMC Transactions](#umc-transactions)
|
||||
- [Privilege](#privilege)
|
||||
- [ResourcePattern](#resourcepattern)
|
||||
- [ActionType](#actiontype)
|
||||
- [Command Execution](#command-execution)
|
||||
- [Authorization Caching](#authorization-caching)
|
||||
- [Authorization Manager External State](#authorization-manager-external-state)
|
||||
- [Types of Authorization](#types-of-authorization)
|
||||
- [Local Authorization](#local-authorization)
|
||||
- [LDAP Authorization](#ldap-authorization)
|
||||
- [X.509 Authorization](#x509-authorization)
|
||||
- [Cursors and Operations](#cursors-and-operations)
|
||||
- [Contracts](#contracts)
|
||||
- [External References](#external-references)
|
||||
- [High Level Overview](#high-level-overview)
|
||||
- [Authentication](#authentication)
|
||||
- [SASL](#sasl)
|
||||
- [Speculative Auth](#speculative-authentication)
|
||||
- [SASL Supported Mechs](#sasl-supported-mechs)
|
||||
- [X509 Authentication](#x509-authentication)
|
||||
- [Cluster Authentication](#cluster-authentication)
|
||||
- [X509 Intracluster Auth](#x509-intracluster-auth-and-member-certificate-rotation)
|
||||
- [Keyfile Intracluster Auth](#keyfile-intracluster-auth)
|
||||
- [Localhost Auth Bypass](#localhost-auth-bypass)
|
||||
- [Authorization](#authorization)
|
||||
- [AuthName](#authname) (`UserName` and `RoleName`)
|
||||
- [Users](#users)
|
||||
- [User Roles](#user-roles)
|
||||
- [User Credentials](#user-credentials)
|
||||
- [User Authentication Restrictions](#user-authentication-restrictions)
|
||||
- [Roles](#roles)
|
||||
- [Role subordinate Roles](#role-subordinate-roles)
|
||||
- [Role Privileges](#role-privileges)
|
||||
- [Role Authentication Restrictions](#role-authentication-restrictions)
|
||||
- [User and Role Management](#user-and-role-management)
|
||||
- [UMC Transactions](#umc-transactions)
|
||||
- [Privilege](#privilege)
|
||||
- [ResourcePattern](#resourcepattern)
|
||||
- [ActionType](#actiontype)
|
||||
- [Command Execution](#command-execution)
|
||||
- [Authorization Caching](#authorization-caching)
|
||||
- [Authorization Manager External State](#authorization-manager-external-state)
|
||||
- [Types of Authorization](#types-of-authorization)
|
||||
- [Local Authorization](#local-authorization)
|
||||
- [LDAP Authorization](#ldap-authorization)
|
||||
- [X.509 Authorization](#x509-authorization)
|
||||
- [Cursors and Operations](#cursors-and-operations)
|
||||
- [Contracts](#contracts)
|
||||
- [External References](#external-references)
|
||||
|
||||
## High Level Overview
|
||||
|
||||
|
|
@ -63,7 +63,7 @@ user credentials and roles. The authorization session is then used to check perm
|
|||
## Authentication
|
||||
|
||||
On a server with authentication enabled, all but a small handful of commands require clients to
|
||||
authenticate before performing any action. This typically occurs with a 1 to 3 round trip
|
||||
authenticate before performing any action. This typically occurs with a 1 to 3 round trip
|
||||
conversation using the `saslStart` and `saslContinue` commands, or though a single call to the
|
||||
`authenticate` command. See [SASL](#SASL) and [X.509](#X509) below for the details of these
|
||||
exchanges.
|
||||
|
|
@ -109,8 +109,8 @@ encountered.
|
|||
|
||||
To reduce connection overhead time, clients may begin and possibly complete their authentication
|
||||
exchange as part of the
|
||||
[`CmdHello`]((https://github.com/mongodb/mongo/blob/r4.7.0/src/mongo/db/repl/replication_info.cpp#L234))
|
||||
exchange. In this mode, the body of the `saslStart` or `authenticate` command used for
|
||||
[`CmdHello`](<(https://github.com/mongodb/mongo/blob/r4.7.0/src/mongo/db/repl/replication_info.cpp#L234)>)
|
||||
exchange. In this mode, the body of the `saslStart` or `authenticate` command used for
|
||||
authentication may be embedded into the `hello` command under the field `{speculativeAuthenticate:
|
||||
$bodyOfAuthCmd}`.
|
||||
|
||||
|
|
@ -122,82 +122,82 @@ included in the `hello` command response.
|
|||
#### SASL Supported Mechs
|
||||
|
||||
When using the [SASL](#SASL) authentication workflow, it is necessary to select a specific mechanism
|
||||
to authenticate with (e.g. SCRAM-SHA-1, SCRAM-SHA-256, PLAIN, GSSAPI, etc...). If the user has not
|
||||
to authenticate with (e.g. SCRAM-SHA-1, SCRAM-SHA-256, PLAIN, GSSAPI, etc...). If the user has not
|
||||
included the mechanism in the mongodb:// URI, then the client can ask the server what mechanisms are
|
||||
available on a per-user basis before attempting to authenticate.
|
||||
|
||||
Therefore, during the initial handshake using
|
||||
[`CmdHello`](https://github.com/mongodb/mongo/blob/r4.7.0/src/mongo/db/repl/replication_info.cpp#L234),
|
||||
the client will notify the server of the user it is about to authenticate by including
|
||||
`{saslSupportedMechs: 'username'}` with the `hello` command. The server will then include
|
||||
`{saslSupportedMechs: 'username'}` with the `hello` command. The server will then include
|
||||
`{saslSupportedMechs: [$listOfMechanisms]}` in the `hello` command's response.
|
||||
|
||||
This allows clients to proceed with authentication by choosing an appropriate mechanism. The
|
||||
different named SASL mechanisms are listed below. If a mechanism can use a different storage method,
|
||||
the storage mechanism is listed as a sub-bullet below.
|
||||
|
||||
- [**SCRAM-SHA-1**](https://tools.ietf.org/html/rfc5802)
|
||||
- See the section on `SCRAM-SHA-256` for details on `SCRAM`. `SCRAM-SHA-1` uses `SHA-1` for the
|
||||
hashing algorithm.
|
||||
- [**SCRAM-SHA-256**](https://tools.ietf.org/html/rfc7677)
|
||||
- `SCRAM` stands for Salted Challenge Response Authentication Mechanism. `SCRAM-SHA-256` implements
|
||||
the `SCRAM` protocol and uses `SHA-256` as a hashing algorithm to complement it. `SCRAM`
|
||||
involves four steps, a client and server first, and a client and server final. During the client
|
||||
first, the client sends the username for lookup. The server uses the username to retrieve the
|
||||
relevant authentication information for the client. This generally includes the salt, StoredKey,
|
||||
ServerKey, and iteration count. The client then computes a set of values (defined in [section
|
||||
3](https://tools.ietf.org/html/rfc5802#section-3) of the `SCRAM` RFC), most notably the client
|
||||
proof and the server signature. It sends the client proof (used to authenticate the client) to
|
||||
the server, and the server then responds by sending the server proof. The hashing function used
|
||||
to hash the client password that is stored by the server is what differentiates `SCRAM-SHA-1` vs
|
||||
`SCRAM-SHA-256`, `SHA-1` is used in `SCRAM-SHA-1`. `SCRAM-SHA-256` is the preferred mechanism
|
||||
over `SCRAM-SHA-1`. Note also that `SCRAM-SHA-256` performs [RFC 4013 SASLprep Unicode
|
||||
normalization](https://tools.ietf.org/html/rfc4013) on all provided passwords before hashing,
|
||||
while for backward compatibility reasons, `SCRAM-SHA-1` does not.
|
||||
- [**PLAIN**](https://tools.ietf.org/html/rfc4616)
|
||||
- The `PLAIN` mechanism involves two steps for authentication. First, the client concatenates a
|
||||
message using the authorization id, the authentication id (also the username), and the password
|
||||
for a user and sends it to the server. The server validates that the information is correct and
|
||||
authenticates the user. For storage, the server hashes one copy using SHA-1 and another using
|
||||
SHA-256 so that the password is not stored in plaintext. Even when using the PLAIN mechanism,
|
||||
the same secrets as used for the SCRAM methods are stored and used for validation. The chief
|
||||
difference between using PLAIN and SCRAM-SHA-256 (or SCRAM-SHA-1) is that using SCRAM provides
|
||||
mutual authentication and avoids transmitting the password to the server. With PLAIN, it is
|
||||
less difficult for a MitM attacker to compromise original credentials.
|
||||
- **With local users**
|
||||
- When the PLAIN mechanism is used with internal users, the user information is stored in the
|
||||
[user
|
||||
collection](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_manager.cpp#L56)
|
||||
on the database. See [authorization](#authorization) for more information.
|
||||
- **With Native LDAP**
|
||||
- When the PLAIN mechanism uses `Native LDAP`, the credential information is sent to and
|
||||
received from LDAP when creating and authorizing a user. The mongo server sends user
|
||||
credentials over the wire to the LDAP server and the LDAP server requests a password. The
|
||||
mongo server sends the password in plain text and LDAP responds with whether the password is
|
||||
correct. Here the communication with the driver and the mongod is the same, but the storage
|
||||
mechanism for the credential information is different.
|
||||
- **With Cyrus SASL / saslauthd**
|
||||
- When using saslauthd, the mongo server communicates with a process called saslauthd running on
|
||||
the same machine. The saslauthd process has ways of communicating with many other servers,
|
||||
LDAP servers included. Saslauthd works in the same way as Native LDAP except that the
|
||||
mongo process communicates using unix domain sockets.
|
||||
- [**GSSAPI**](https://tools.ietf.org/html/rfc4752)
|
||||
- GSSAPI is an authentication mechanism that supports [Kerberos](https://web.mit.edu/kerberos/)
|
||||
authentication. GSSAPI is the communication method used to communicate with Kerberos servers and
|
||||
with clients. When initializing this auth mechanism, the server tries to acquire its credential
|
||||
information from the KDC by calling
|
||||
[`tryAcquireServerCredential`](https://github.com/10gen/mongo-enterprise-modules/blob/r4.4.0/src/sasl/mongo_gssapi.h#L36).
|
||||
If this is not approved, the server fasserts and the mechanism is not registered. On Windows,
|
||||
SChannel provides a `GSSAPI` library for the server to use. On other platforms, the Cyrus SASL
|
||||
library is used to make calls to the KDC (Kerberos key distribution center).
|
||||
- [**SCRAM-SHA-1**](https://tools.ietf.org/html/rfc5802)
|
||||
- See the section on `SCRAM-SHA-256` for details on `SCRAM`. `SCRAM-SHA-1` uses `SHA-1` for the
|
||||
hashing algorithm.
|
||||
- [**SCRAM-SHA-256**](https://tools.ietf.org/html/rfc7677)
|
||||
- `SCRAM` stands for Salted Challenge Response Authentication Mechanism. `SCRAM-SHA-256` implements
|
||||
the `SCRAM` protocol and uses `SHA-256` as a hashing algorithm to complement it. `SCRAM`
|
||||
involves four steps, a client and server first, and a client and server final. During the client
|
||||
first, the client sends the username for lookup. The server uses the username to retrieve the
|
||||
relevant authentication information for the client. This generally includes the salt, StoredKey,
|
||||
ServerKey, and iteration count. The client then computes a set of values (defined in [section
|
||||
3](https://tools.ietf.org/html/rfc5802#section-3) of the `SCRAM` RFC), most notably the client
|
||||
proof and the server signature. It sends the client proof (used to authenticate the client) to
|
||||
the server, and the server then responds by sending the server proof. The hashing function used
|
||||
to hash the client password that is stored by the server is what differentiates `SCRAM-SHA-1` vs
|
||||
`SCRAM-SHA-256`, `SHA-1` is used in `SCRAM-SHA-1`. `SCRAM-SHA-256` is the preferred mechanism
|
||||
over `SCRAM-SHA-1`. Note also that `SCRAM-SHA-256` performs [RFC 4013 SASLprep Unicode
|
||||
normalization](https://tools.ietf.org/html/rfc4013) on all provided passwords before hashing,
|
||||
while for backward compatibility reasons, `SCRAM-SHA-1` does not.
|
||||
- [**PLAIN**](https://tools.ietf.org/html/rfc4616)
|
||||
- The `PLAIN` mechanism involves two steps for authentication. First, the client concatenates a
|
||||
message using the authorization id, the authentication id (also the username), and the password
|
||||
for a user and sends it to the server. The server validates that the information is correct and
|
||||
authenticates the user. For storage, the server hashes one copy using SHA-1 and another using
|
||||
SHA-256 so that the password is not stored in plaintext. Even when using the PLAIN mechanism,
|
||||
the same secrets as used for the SCRAM methods are stored and used for validation. The chief
|
||||
difference between using PLAIN and SCRAM-SHA-256 (or SCRAM-SHA-1) is that using SCRAM provides
|
||||
mutual authentication and avoids transmitting the password to the server. With PLAIN, it is
|
||||
less difficult for a MitM attacker to compromise original credentials.
|
||||
- **With local users**
|
||||
- When the PLAIN mechanism is used with internal users, the user information is stored in the
|
||||
[user
|
||||
collection](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_manager.cpp#L56)
|
||||
on the database. See [authorization](#authorization) for more information.
|
||||
- **With Native LDAP**
|
||||
- When the PLAIN mechanism uses `Native LDAP`, the credential information is sent to and
|
||||
received from LDAP when creating and authorizing a user. The mongo server sends user
|
||||
credentials over the wire to the LDAP server and the LDAP server requests a password. The
|
||||
mongo server sends the password in plain text and LDAP responds with whether the password is
|
||||
correct. Here the communication with the driver and the mongod is the same, but the storage
|
||||
mechanism for the credential information is different.
|
||||
- **With Cyrus SASL / saslauthd**
|
||||
- When using saslauthd, the mongo server communicates with a process called saslauthd running on
|
||||
the same machine. The saslauthd process has ways of communicating with many other servers,
|
||||
LDAP servers included. Saslauthd works in the same way as Native LDAP except that the
|
||||
mongo process communicates using unix domain sockets.
|
||||
- [**GSSAPI**](https://tools.ietf.org/html/rfc4752)
|
||||
- GSSAPI is an authentication mechanism that supports [Kerberos](https://web.mit.edu/kerberos/)
|
||||
authentication. GSSAPI is the communication method used to communicate with Kerberos servers and
|
||||
with clients. When initializing this auth mechanism, the server tries to acquire its credential
|
||||
information from the KDC by calling
|
||||
[`tryAcquireServerCredential`](https://github.com/10gen/mongo-enterprise-modules/blob/r4.4.0/src/sasl/mongo_gssapi.h#L36).
|
||||
If this is not approved, the server fasserts and the mechanism is not registered. On Windows,
|
||||
SChannel provides a `GSSAPI` library for the server to use. On other platforms, the Cyrus SASL
|
||||
library is used to make calls to the KDC (Kerberos key distribution center).
|
||||
|
||||
The specific properties that each SASL mechanism provides is outlined in this table below.
|
||||
|
||||
| | Mutual Auth | No Plain Text |
|
||||
|---------------|-------------|---------------|
|
||||
| SCRAM | X | X |
|
||||
| PLAIN | | |
|
||||
| GSS-API | X | X |
|
||||
| | Mutual Auth | No Plain Text |
|
||||
| ------- | ----------- | ------------- |
|
||||
| SCRAM | X | X |
|
||||
| PLAIN | | |
|
||||
| GSS-API | X | X |
|
||||
|
||||
### <a name="x509atn"></a>X509 Authentication
|
||||
|
||||
|
|
@ -205,23 +205,24 @@ The specific properties that each SASL mechanism provides is outlined in this ta
|
|||
certificate key exchange. When the peer certificate validation happens during the SSL handshake, an
|
||||
[`SSLPeerInfo`](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/util/net/ssl_types.h#L113-L143)
|
||||
is created and attached to the transport layer SessionHandle. During `MONGODB-X509` auth, the server
|
||||
first determines whether or not the client is a driver or a peer server. The server inspects the
|
||||
first determines whether or not the client is a driver or a peer server. The server inspects the
|
||||
following criteria in this order to determine whether the connecting client is a peer server node:
|
||||
|
||||
1. `net.tls.clusterAuthX509.attributes` is set on the server and the parsed certificate's subject name
|
||||
contains all of the attributes and values specified in that option.
|
||||
contains all of the attributes and values specified in that option.
|
||||
2. `net.tls.clusterAuthX509.extensionValue` is set on the server and the parsed certificate contains
|
||||
the OID 1.3.6.1.4.1.34601.2.1.2 with a value matching the one specified in that option. This OID
|
||||
is reserved for the MongoDB cluster membership extension.
|
||||
the OID 1.3.6.1.4.1.34601.2.1.2 with a value matching the one specified in that option. This OID
|
||||
is reserved for the MongoDB cluster membership extension.
|
||||
3. Neither of the above options are set on the server and the parsed certificate's subject name contains
|
||||
the same DC, O, and OU as the certificate the server presents to inbound connections (`tls.certificateKeyFile`).
|
||||
4. `tlsClusterAuthX509Override.attributes` is set on the server and the parsed certificate's subject name
|
||||
contains all of the attributes and values specified in that option.
|
||||
contains all of the attributes and values specified in that option.
|
||||
5. `tlsClusterAuthX509Override.extensionValue` is set on the server and the parsed certificate contains
|
||||
the OID 1.3.6.1.4.1.34601.2.1.2 with a value matching the one specified in that option.
|
||||
If all of these conditions fail, then the server grabs the client's username from the `SSLPeerInfo`
|
||||
struct and verifies that the client name matches the username provided by the command object and exists
|
||||
in the `$external` database. In that case, the client is authenticated as that user in `$external`.
|
||||
Otherwise, authentication fails with ErrorCodes.UserNotFound.
|
||||
the OID 1.3.6.1.4.1.34601.2.1.2 with a value matching the one specified in that option.
|
||||
If all of these conditions fail, then the server grabs the client's username from the `SSLPeerInfo`
|
||||
struct and verifies that the client name matches the username provided by the command object and exists
|
||||
in the `$external` database. In that case, the client is authenticated as that user in `$external`.
|
||||
Otherwise, authentication fails with ErrorCodes.UserNotFound.
|
||||
|
||||
### Cluster Authentication
|
||||
|
||||
|
|
@ -231,9 +232,10 @@ a server, they can use any of the authentication mechanisms described [below in
|
|||
section](#sasl). When a mongod or a mongos needs to authenticate to a mongodb server, it does not
|
||||
pass in distinguishing user credentials to authenticate (all servers authenticate to other servers
|
||||
as the `__system` user), so most of the options described below will not necessarily work. However,
|
||||
two options are available for authentication - keyfile auth and X509 auth.
|
||||
two options are available for authentication - keyfile auth and X509 auth.
|
||||
|
||||
#### X509 Intracluster Auth and Member Certificate Rotation
|
||||
|
||||
`X509` auth is described in more detail above, but a precondition to using it is having TLS enabled.
|
||||
It is possible for customers to rotate their certificates or change the criteria that is used to
|
||||
determine X.509 cluster membership without any downtime. When the server uses the default criteria
|
||||
|
|
@ -253,9 +255,10 @@ determine X.509 cluster membership without any downtime. When the server uses th
|
|||
|
||||
An administrator can update the criteria the server uses to determine cluster membership alongside
|
||||
certificate rotation without downtime via the following procedure:
|
||||
|
||||
1. Update server nodes' config files to contain the old certificate subject DN attributes or extension
|
||||
value in `setParameter.tlsClusterAuthX509Override` and the new certificate subject DN attributes
|
||||
or extension value in `net.tls.clusterAuthX509.attributes` or `net.tls.clusterAuthX509.extensionValue`.
|
||||
or extension value in `net.tls.clusterAuthX509.attributes` or `net.tls.clusterAuthX509.extensionValue`.
|
||||
2. Perform a rolling restart of server nodes so that they all load in the override value and new
|
||||
config options.
|
||||
3. Update server nodes' config files to contain the new certificates in `net.tls.clusterFile`
|
||||
|
|
@ -268,6 +271,7 @@ certificate rotation without downtime via the following procedure:
|
|||
meeting the old criteria as peers.
|
||||
|
||||
#### Keyfile Intracluster Auth
|
||||
|
||||
`keyfile` auth instructors servers to authenticate to each other using the `SCRAM-SHA-256` mechanism
|
||||
as the `local.__system` user who's password can be found in the named key file. A keyfile is a file
|
||||
stored on disk that servers load on startup, sending them when they behave as clients to another
|
||||
|
|
@ -337,7 +341,7 @@ empty AuthorizedRoles set), and is thus "unauthorized", also known as "pre-auth"
|
|||
|
||||
When a client connects to a database and authorization is enabled, authentication sends a request to
|
||||
get the authorization information of a specific user by calling addAndAuthorizeUser() on the
|
||||
AuthorizationSession and passing in the `UserName` as an identifier. The `AuthorizationSession` calls
|
||||
AuthorizationSession and passing in the `UserName` as an identifier. The `AuthorizationSession` calls
|
||||
functions defined in the
|
||||
[`AuthorizationManager`](https://github.com/mongodb/mongo/blob/r4.7.0/src/mongo/db/auth/authorization_manager.h)
|
||||
(described in the next paragraph) to both get the correct `User` object (defined below) from the
|
||||
|
|
@ -357,11 +361,11 @@ The [AuthName](auth_name.h) template
|
|||
provides the generic implementation for `UserName` and `RoleName` implementations.
|
||||
Each of these objects is made up of three component pieces of information.
|
||||
|
||||
| Field | Accessor | Use |
|
||||
| -- | -- | -- |
|
||||
| `_name` | `getName()` | The symbolic name associated with the user or role, (e.g. 'Alice') |
|
||||
| `_db` | `getDB()` | The authentication database associated with the named auth identifier (e.g. 'admin' or 'test') |
|
||||
| `_tenant` | `getTenant()` | When used in multitenancy mode, this value retains a `TenantId` for authorization checking. |
|
||||
| Field | Accessor | Use |
|
||||
| --------- | ------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `_name` | `getName()` | The symbolic name associated with the user or role, (e.g. 'Alice') |
|
||||
| `_db` | `getDB()` | The authentication database associated with the named auth identifier (e.g. 'admin' or 'test') |
|
||||
| `_tenant` | `getTenant()` | When used in multitenancy mode, this value retains a `TenantId` for authorization checking. |
|
||||
|
||||
[`UserName`](user_name.h) and [`RoleName`](role_name.h) specializations are CRTP defined
|
||||
to include additional `getUser()` and `getRole()` accessors which proxy to `getName()`,
|
||||
|
|
@ -369,8 +373,8 @@ and provide a set of `constexpr StringData` identifiers relating to their type.
|
|||
|
||||
#### Serializations
|
||||
|
||||
* `getDisplayName()` and `toString()` create a new string of the form `name@db` for use in log messages.
|
||||
* `getUnambiguousName()` creates a new string of the form `db.name` for use in generating `_id` fields for authzn documents and generating unique hashes for logical session identifiers.
|
||||
- `getDisplayName()` and `toString()` create a new string of the form `name@db` for use in log messages.
|
||||
- `getUnambiguousName()` creates a new string of the form `db.name` for use in generating `_id` fields for authzn documents and generating unique hashes for logical session identifiers.
|
||||
|
||||
#### Multitenancy
|
||||
|
||||
|
|
@ -385,7 +389,7 @@ When a `TenantId` is associated with an `AuthName`, it will NOT be included in `
|
|||
which is a cache value object from the ReadThroughCache (described in [Authorization
|
||||
Caching](#authorization-caching)). There can be multiple authenticated users for a single `Client`
|
||||
object. The most important elements of a `User` document are the username and the roles set that the
|
||||
user has. While each `AuthorizationManagerExternalState` implementation may define its own
|
||||
user has. While each `AuthorizationManagerExternalState` implementation may define its own
|
||||
storage mechanism for `User` object data, they all ultimately surface this data in a format
|
||||
compatible with the `Local` implementation, stored in the `admin.system.users` collection
|
||||
with a schema as follows:
|
||||
|
|
@ -432,20 +436,20 @@ with a schema as follows:
|
|||
|
||||
In order to define a set of privileges (see [role privileges](#role-privileges) below)
|
||||
granted to a given user, the user must be granted one or more `roles` on their user document,
|
||||
or by their external authentication provider. Again, a user with no roles has no privileges.
|
||||
or by their external authentication provider. Again, a user with no roles has no privileges.
|
||||
|
||||
#### User Credentials
|
||||
|
||||
The contents of the `credentials` field will depend on the configured authentication
|
||||
mechanisms enabled for the user. For external authentication providers,
|
||||
this will simply contain `$external: 1`. For `local` authentication providers,
|
||||
mechanisms enabled for the user. For external authentication providers,
|
||||
this will simply contain `$external: 1`. For `local` authentication providers,
|
||||
this will contain any necessary parameters for validating authentications
|
||||
such as the `SCRAM-SHA-256` example above.
|
||||
|
||||
#### User Authentication Restrictions
|
||||
|
||||
A user definition may optionally list any number of authentication restrictions.
|
||||
Currently, only endpoint based restrictions are permitted. These require that a
|
||||
Currently, only endpoint based restrictions are permitted. These require that a
|
||||
connecting client must come from a specific IP address range (given in
|
||||
[CIDR notation](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing)) and/or
|
||||
connect to a specific server address.
|
||||
|
|
@ -537,21 +541,21 @@ For users possessing a given set of roles, their effective privileges and
|
|||
#### Role Privileges
|
||||
|
||||
Each role imparts privileges in the form of a set of `actions` permitted
|
||||
against a given `resource`. The strings in the `actions` list correspond
|
||||
against a given `resource`. The strings in the `actions` list correspond
|
||||
1:1 with `ActionType` values as specified [here](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/db/auth/action_type.h).
|
||||
Resources may be specified in any of the following nine formats:
|
||||
|
||||
| `resource` | Meaning |
|
||||
| --- | --- |
|
||||
| {} | Any `normal` collection |
|
||||
| { db: 'test', collection: '' } | All `normal` collections on the named DB |
|
||||
| { db: '', collection: 'system.views' } | The specific named collection on all DBs |
|
||||
| { db: 'test', collection: 'system.view' } | The specific namespace (db+collection) as written |
|
||||
| { cluster: true } | Used only by cluster-level actions such as `replsetConfigure`. |
|
||||
| { system_bucket: '' } | Any collection with a prefix of `system.buckets.` in any db|
|
||||
| { db: '', system_buckets: 'example' } | A collection named `system.buckets.example` in any db|
|
||||
| { db: 'test', system_buckets: '' } | Any collection with a prefix of `system.buckets.` in `test` db|
|
||||
| { db: 'test', system_buckets: 'example' } | A collected named `system.buckets.example` in `test` db|
|
||||
| `resource` | Meaning |
|
||||
| ----------------------------------------- | -------------------------------------------------------------- |
|
||||
| {} | Any `normal` collection |
|
||||
| { db: 'test', collection: '' } | All `normal` collections on the named DB |
|
||||
| { db: '', collection: 'system.views' } | The specific named collection on all DBs |
|
||||
| { db: 'test', collection: 'system.view' } | The specific namespace (db+collection) as written |
|
||||
| { cluster: true } | Used only by cluster-level actions such as `replsetConfigure`. |
|
||||
| { system_bucket: '' } | Any collection with a prefix of `system.buckets.` in any db |
|
||||
| { db: '', system_buckets: 'example' } | A collection named `system.buckets.example` in any db |
|
||||
| { db: 'test', system_buckets: '' } | Any collection with a prefix of `system.buckets.` in `test` db |
|
||||
| { db: 'test', system_buckets: 'example' } | A collected named `system.buckets.example` in `test` db |
|
||||
|
||||
#### Normal resources
|
||||
|
||||
|
|
@ -563,7 +567,7 @@ All other collections are considered `normal` collections.
|
|||
#### Role Authentication Restrictions
|
||||
|
||||
Authentication restrictions defined on a role have the same meaning as
|
||||
those defined directly on users. The effective set of `authenticationRestrictions`
|
||||
those defined directly on users. The effective set of `authenticationRestrictions`
|
||||
imposed on a user is the union of all direct and indirect authentication restrictions.
|
||||
|
||||
### Privilege
|
||||
|
|
@ -577,21 +581,21 @@ the full set of privileges across all resource and actionype conbinations for th
|
|||
|
||||
#### ResourcePattern
|
||||
|
||||
A resource pattern is a combination of a [MatchType](action_type.idl) with a `NamespaceString` to possibly narrow the scope of that `MatchType`. Most MatchTypes refer to some storage resource, such as a specific collection or database, however `kMatchClusterResource` refers to an entire host, replica set, or cluster.
|
||||
A resource pattern is a combination of a [MatchType](action_type.idl) with a `NamespaceString` to possibly narrow the scope of that `MatchType`. Most MatchTypes refer to some storage resource, such as a specific collection or database, however `kMatchClusterResource` refers to an entire host, replica set, or cluster.
|
||||
|
||||
| MatchType | As encoded in a privilege doc | Usage |
|
||||
| -- | -- | -- |
|
||||
| `kMatchNever` | _Unexpressable_ | A base type only used internally to indicate that the privilege specified by the ResourcePattern can not match any real resource |
|
||||
| `kMatchClusterResource` | `{ cluster : true }` | Commonly used with host and cluster management actions such as `ActionType::addShard`, `ActionType::setParameter`, or `ActionType::shutdown`. |
|
||||
| `kMatchAnyResource` | `{ anyResource: true }` | Matches all storage resources, even [non-normal namespaces](#normal-namespace) such as `db.system.views`. |
|
||||
| `kMatchAnyNormalResource` | `{ db: '', collection: '' }` | Matches all [normal](#normal-namespace) storage resources. Used with [builtin role](builtin_roles.cpp) `readWriteAnyDatabase`. |
|
||||
| `kMatchDatabaseName` | `{ db: 'dbname', collection: '' }` | Matches all [normal](#normal-namespace) storage resources for a specific named database. Used with [builtin role](builtin_roles.cpp) `readWrite`. |
|
||||
| `kMatchCollectionName` | `{ db: '', collection: 'collname' }` | Matches all storage resources, normal or not, which have the exact collection suffix '`collname`'. For example, to provide read-only access to `*.system.js`. |
|
||||
| `kMatchExactNamespace` | `{ db: 'dbname', collection: 'collname' }` | Matches the exact namespace '`dbname`.`collname`'. |
|
||||
| `kMatchAnySystemBucketResource` | `{ db: '', system_buckets: '' }` | Matches the namespace pattern `*.system.buckets.*`. |
|
||||
| `kMatchAnySystemBucketInDBResource` | `{ db: 'dbname', system_buckets: '' }` | Matches the namespace pattern `dbname.system.buckets.*`. |
|
||||
| `kMatchAnySystemBucketInAnyDBResource` | `{ db: '', system_buckets: 'suffix' }` | Matches the namespace pattern `*.system.buckets.suffix`. |
|
||||
| `kMatchExactSystemBucketResource` | `{ db: 'dbname', system_buckets: 'suffix' }` | Matches the exact namespace `dbname.system.buckets.suffix`. |
|
||||
| MatchType | As encoded in a privilege doc | Usage |
|
||||
| -------------------------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `kMatchNever` | _Unexpressable_ | A base type only used internally to indicate that the privilege specified by the ResourcePattern can not match any real resource |
|
||||
| `kMatchClusterResource` | `{ cluster : true }` | Commonly used with host and cluster management actions such as `ActionType::addShard`, `ActionType::setParameter`, or `ActionType::shutdown`. |
|
||||
| `kMatchAnyResource` | `{ anyResource: true }` | Matches all storage resources, even [non-normal namespaces](#normal-namespace) such as `db.system.views`. |
|
||||
| `kMatchAnyNormalResource` | `{ db: '', collection: '' }` | Matches all [normal](#normal-namespace) storage resources. Used with [builtin role](builtin_roles.cpp) `readWriteAnyDatabase`. |
|
||||
| `kMatchDatabaseName` | `{ db: 'dbname', collection: '' }` | Matches all [normal](#normal-namespace) storage resources for a specific named database. Used with [builtin role](builtin_roles.cpp) `readWrite`. |
|
||||
| `kMatchCollectionName` | `{ db: '', collection: 'collname' }` | Matches all storage resources, normal or not, which have the exact collection suffix '`collname`'. For example, to provide read-only access to `*.system.js`. |
|
||||
| `kMatchExactNamespace` | `{ db: 'dbname', collection: 'collname' }` | Matches the exact namespace '`dbname`.`collname`'. |
|
||||
| `kMatchAnySystemBucketResource` | `{ db: '', system_buckets: '' }` | Matches the namespace pattern `*.system.buckets.*`. |
|
||||
| `kMatchAnySystemBucketInDBResource` | `{ db: 'dbname', system_buckets: '' }` | Matches the namespace pattern `dbname.system.buckets.*`. |
|
||||
| `kMatchAnySystemBucketInAnyDBResource` | `{ db: '', system_buckets: 'suffix' }` | Matches the namespace pattern `*.system.buckets.suffix`. |
|
||||
| `kMatchExactSystemBucketResource` | `{ db: 'dbname', system_buckets: 'suffix' }` | Matches the exact namespace `dbname.system.buckets.suffix`. |
|
||||
|
||||
As `ResourcePattern`s are based on `NamespaceString`, they naturally include an optional `TenantId`,
|
||||
which scopes the pattern to a specific tenant in serverless. A user with a given `TenantId` can only
|
||||
|
|
@ -614,16 +618,16 @@ with no `TenantId`.
|
|||
|
||||
A "normal" resource is a `namespace` which does not match either of the following patterns:
|
||||
|
||||
| Namespace pattern | Examples | Usage |
|
||||
| -- | -- | -- |
|
||||
| `local.replset.*` | `local.replset.initialSyncId` | Namespaces used by Replication to manage per-host state. |
|
||||
| `*.system.*` | `admin.system.version` `myDB.system.views` | Collections used by the database to support user collections. |
|
||||
| Namespace pattern | Examples | Usage |
|
||||
| ----------------- | ------------------------------------------ | ------------------------------------------------------------- |
|
||||
| `local.replset.*` | `local.replset.initialSyncId` | Namespaces used by Replication to manage per-host state. |
|
||||
| `*.system.*` | `admin.system.version` `myDB.system.views` | Collections used by the database to support user collections. |
|
||||
|
||||
See also: [NamespaceString::isNormalCollection()](../namespace_string.h)
|
||||
|
||||
#### ActionType
|
||||
|
||||
An [ActionType](action_type.idl) is a task which a client may be expected to perform. These are combined with [ResourcePattern](#resourcepattern)s to produce a [Privilege](#privilege). Note that not all `ActionType`s make sense with all `ResourcePattern`s (e.g. `ActionType::shutdown` applied to `ResourcePattern` `{ db: 'test', collection: 'my.awesome.collection' }`), however the system will generally not prohibit declaring these combinations.
|
||||
An [ActionType](action_type.idl) is a task which a client may be expected to perform. These are combined with [ResourcePattern](#resourcepattern)s to produce a [Privilege](#privilege). Note that not all `ActionType`s make sense with all `ResourcePattern`s (e.g. `ActionType::shutdown` applied to `ResourcePattern` `{ db: 'test', collection: 'my.awesome.collection' }`), however the system will generally not prohibit declaring these combinations.
|
||||
|
||||
### User and Role Management
|
||||
|
||||
|
|
@ -631,11 +635,11 @@ An [ActionType](action_type.idl) is a task which a client may be expected to per
|
|||
abstraction for mutating the contents of the local authentication database
|
||||
in the `admin.system.users` and `admin.system.roles` collections.
|
||||
These commands are implemented primarily for config and standalone nodes in
|
||||
[user\_management\_commands.cpp](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/db/commands/user_management_commands.cpp),
|
||||
[user_management_commands.cpp](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/db/commands/user_management_commands.cpp),
|
||||
and as passthrough proxies for mongos in
|
||||
[cluster\_user\_management\_commands.cpp](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/s/commands/cluster_user_management_commands.cpp).
|
||||
[cluster_user_management_commands.cpp](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/s/commands/cluster_user_management_commands.cpp).
|
||||
All command payloads and responses are defined via IDL in
|
||||
[user\_management\_commands.idl](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/db/commands/user_management_commands.idl)
|
||||
[user_management_commands.idl](https://github.com/mongodb/mongo/blob/92cc84b0171942375ccbd2312a052bc7e9f159dd/src/mongo/db/commands/user_management_commands.idl)
|
||||
|
||||
#### UMC Transactions
|
||||
|
||||
|
|
@ -645,7 +649,7 @@ validating that the command's arguments refer to extant roles, actions,
|
|||
and other user-defined values.
|
||||
|
||||
The `dropRole` and `dropAllRolesFromDatabase` commands can not be
|
||||
expressed as a single CRUD op. Instead, they must issue all three of the following ops:
|
||||
expressed as a single CRUD op. Instead, they must issue all three of the following ops:
|
||||
|
||||
1. `Update` the users collection to strip the role(s) from all users possessing it directly.
|
||||
1. `Update` the roles collection to strip the role(s) from all other roles possessing it as a subordinate.
|
||||
|
|
@ -826,6 +830,7 @@ and checks the current client's authorized users and authorized impersonated use
|
|||
Contracts](https://github.com/mongodb/mongo/blob/r4.9.0-rc0/src/mongo/db/auth/authorization_contract.h)
|
||||
were added in v5.0 to support API Version compatibility testing. Authorization contracts consist of
|
||||
three pieces:
|
||||
|
||||
1. A list of privileges and checks a command makes against `AuthorizationSession` to check if a user
|
||||
is permitted to run the command. These privileges and checks are declared in an IDL file in the
|
||||
`access_check` section. The contract is compiled into the command definition and is available via
|
||||
|
|
@ -843,18 +848,18 @@ exceptions are `getMore` and `explain` since they inherit their checks from othe
|
|||
|
||||
Refer to the following links for definitions of the Classes referenced in this document:
|
||||
|
||||
| Class | File | Description |
|
||||
| --- | --- | --- |
|
||||
| `ActionType` | [mongo/db/auth/action\_type.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/action_type.h) | High level categories of actions which may be performed against a given resource (e.g. `find`, `insert`, `update`, etc...) |
|
||||
| `AuthenticationSession` | [mongo/db/auth/authentication\_session.h](https://github.com/mongodb/mongo/blob/master/src/mongo/db/auth/authentication_session.h) | Session object to persist Authentication state |
|
||||
| `AuthorizationContract` | [mongo/db/auth/authorization_contract.h](https://github.com/mongodb/mongo/blob/r4.9.0-rc0/src/mongo/db/auth/authorization_contract.h) | Contract generated by IDL|
|
||||
| `AuthorizationManager` | [mongo/db/auth/authorization\_manager.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_manager.h) | Interface to external state providers |
|
||||
| `AuthorizationSession` | [mongo/db/auth/authorization\_session.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_session.h) | Representation of currently authenticated and authorized users on the `Client` connection |
|
||||
| `AuthzManagerExternalStateLocal` | [.../authz\_manager\_external\_state\_local.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authz_manager_external_state_local.h) | `Local` implementation of user/role provider |
|
||||
| `AuthzManagerExternalStateLDAP` | [.../authz\_manager\_external\_state\_ldap.h](https://github.com/10gen/mongo-enterprise-modules/blob/r4.4.0/src/ldap/authz_manager_external_state_ldap.h) | `LDAP` implementation of users/role provider |
|
||||
| `Client` | [mongo/db/client.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/client.h) | An active client session, typically representing a remote driver or shell |
|
||||
| `Privilege` | [mongo/db/auth/privilege.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/privilege.h) | A set of `ActionType`s permitted on a particular `resource' |
|
||||
| `ResourcePattern` | [mongo/db/auth/resource\_pattern.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/resource_pattern.h) | A reference to a namespace, db, collection, or cluster to apply a set of `ActionType` privileges to |
|
||||
| `RoleName` | [mongo/db/auth/role\_name.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/role_name.h) | A typed tuple containing a named role on a particular database |
|
||||
| `User` | [mongo/db/auth/user.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/user.h) | A representation of a authorization user, including all direct and subordinte roles and their privileges and authentication restrictions |
|
||||
| `UserName` | [mongo/db/auth/user\_name.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/user_name.h) | A typed tuple containing a named user on a particular database |
|
||||
| Class | File | Description |
|
||||
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ActionType` | [mongo/db/auth/action_type.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/action_type.h) | High level categories of actions which may be performed against a given resource (e.g. `find`, `insert`, `update`, etc...) |
|
||||
| `AuthenticationSession` | [mongo/db/auth/authentication_session.h](https://github.com/mongodb/mongo/blob/master/src/mongo/db/auth/authentication_session.h) | Session object to persist Authentication state |
|
||||
| `AuthorizationContract` | [mongo/db/auth/authorization_contract.h](https://github.com/mongodb/mongo/blob/r4.9.0-rc0/src/mongo/db/auth/authorization_contract.h) | Contract generated by IDL |
|
||||
| `AuthorizationManager` | [mongo/db/auth/authorization_manager.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_manager.h) | Interface to external state providers |
|
||||
| `AuthorizationSession` | [mongo/db/auth/authorization_session.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authorization_session.h) | Representation of currently authenticated and authorized users on the `Client` connection |
|
||||
| `AuthzManagerExternalStateLocal` | [.../authz_manager_external_state_local.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/authz_manager_external_state_local.h) | `Local` implementation of user/role provider |
|
||||
| `AuthzManagerExternalStateLDAP` | [.../authz_manager_external_state_ldap.h](https://github.com/10gen/mongo-enterprise-modules/blob/r4.4.0/src/ldap/authz_manager_external_state_ldap.h) | `LDAP` implementation of users/role provider |
|
||||
| `Client` | [mongo/db/client.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/client.h) | An active client session, typically representing a remote driver or shell |
|
||||
| `Privilege` | [mongo/db/auth/privilege.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/privilege.h) | A set of `ActionType`s permitted on a particular `resource' |
|
||||
| `ResourcePattern` | [mongo/db/auth/resource_pattern.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/resource_pattern.h) | A reference to a namespace, db, collection, or cluster to apply a set of `ActionType` privileges to |
|
||||
| `RoleName` | [mongo/db/auth/role_name.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/role_name.h) | A typed tuple containing a named role on a particular database |
|
||||
| `User` | [mongo/db/auth/user.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/user.h) | A representation of a authorization user, including all direct and subordinte roles and their privileges and authentication restrictions |
|
||||
| `UserName` | [mongo/db/auth/user_name.h](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/auth/user_name.h) | A typed tuple containing a named user on a particular database |
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
|
|
@ -20,12 +20,12 @@ shared, nor do they have identity (i.e. variables with the same numeric value ar
|
|||
different entities). Some SBE values are [modeled off of
|
||||
BSONTypes](https://github.com/mongodb/mongo/blob/f2b093acd48aee3c63d1a0e80a101eeb9925834a/src/mongo/bson/bsontypes.h#L63-L114)
|
||||
while others represent internal C++ types such as
|
||||
[collators](https://github.com/mongodb/mongo/blob/d19ea3f3ff51925e3b45c593217f8901373e4336/src/mongo/db/exec/sbe/values/value.h#L216-L217).
|
||||
[collators](https://github.com/mongodb/mongo/blob/d19ea3f3ff51925e3b45c593217f8901373e4336/src/mongo/db/exec/sbe/values/value.h#L216-L217).
|
||||
|
||||
One type that deserves a special mention is `Nothing`, which indicates the absence of a value. It is
|
||||
often used in SBE to indicate that a result cannot be computed instead of raising an exception
|
||||
(similar to the [Maybe
|
||||
Monad](https://en.wikipedia.org/wiki/Monad_(functional_programming)#An_example:_Maybe) in many
|
||||
Monad](<https://en.wikipedia.org/wiki/Monad_(functional_programming)#An_example:_Maybe>) in many
|
||||
functional programming languages).
|
||||
|
||||
Values are identified by a [1 byte
|
||||
|
|
@ -46,39 +46,39 @@ class](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd
|
|||
EExpressions form a tree and their goal is to produce values during evaluation. It's worth noting
|
||||
that EExpressions aren't tied to expressions in the Mongo Query Language, rather, they are meant to
|
||||
be building blocks that can be combined to express arbitrary query language semantics. Below is an
|
||||
overview of the different EExpression types:
|
||||
overview of the different EExpression types:
|
||||
|
||||
- [EConstant](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L251-L279):
|
||||
As the name suggests, this expression type stores a single, immutable SBE value. An `EConstant`
|
||||
manages the value's lifetime (that is, it releases the value's memory on destruction if
|
||||
necessary).
|
||||
- [EPrimUnary and
|
||||
EPrimBinary](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L324-L414):
|
||||
These expressions represent basic logical, arithmetic, and comparison operations that take one and
|
||||
two arguments, respectively.
|
||||
- [EIf](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L440-L461):
|
||||
Represents an 'if then else' expression.
|
||||
- [EFunction](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L416-L438):
|
||||
Represents a named, built-in function supported natively by the engine. At the time of writing, there are over [150 such
|
||||
functions](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.cpp#L564-L567).
|
||||
Note that function parameters are evaluated first and then are passed as arguments to the
|
||||
function.
|
||||
- [EFail](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L511-L534):
|
||||
Represents an exception and produces a query fatal error if reached at query runtime. It supports numeric error codes and error strings.
|
||||
- [ENumericConvert](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L536-L566):
|
||||
Represents the conversion of an arbitrary value to a target numeric type.
|
||||
- [EVariable](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L281-L319)
|
||||
Provides the ability to reference a variable defined elsewhere.
|
||||
- [ELocalBind](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L463-L485)
|
||||
Provides the ability to define multiple variables in a local scope. They are particularly useful
|
||||
when we want to reference some intermediate value multiple times.
|
||||
- [ELocalLambda](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L487-L507)
|
||||
Represents an anonymous function which takes a single input parameter. Many `EFunctions` accept
|
||||
these as parameters. A good example of this is the [`traverseF`
|
||||
function](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/vm/vm.cpp#L1329-L1357):
|
||||
it accepts 2 parameters: an input and an `ELocalLambda`. If the input is an array, the
|
||||
`ELocalLambda` is applied to each element in the array, otherwise, it is applied to the input on
|
||||
its own.
|
||||
- [EConstant](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L251-L279):
|
||||
As the name suggests, this expression type stores a single, immutable SBE value. An `EConstant`
|
||||
manages the value's lifetime (that is, it releases the value's memory on destruction if
|
||||
necessary).
|
||||
- [EPrimUnary and
|
||||
EPrimBinary](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L324-L414):
|
||||
These expressions represent basic logical, arithmetic, and comparison operations that take one and
|
||||
two arguments, respectively.
|
||||
- [EIf](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L440-L461):
|
||||
Represents an 'if then else' expression.
|
||||
- [EFunction](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L416-L438):
|
||||
Represents a named, built-in function supported natively by the engine. At the time of writing, there are over [150 such
|
||||
functions](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.cpp#L564-L567).
|
||||
Note that function parameters are evaluated first and then are passed as arguments to the
|
||||
function.
|
||||
- [EFail](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L511-L534):
|
||||
Represents an exception and produces a query fatal error if reached at query runtime. It supports numeric error codes and error strings.
|
||||
- [ENumericConvert](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L536-L566):
|
||||
Represents the conversion of an arbitrary value to a target numeric type.
|
||||
- [EVariable](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L281-L319)
|
||||
Provides the ability to reference a variable defined elsewhere.
|
||||
- [ELocalBind](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L463-L485)
|
||||
Provides the ability to define multiple variables in a local scope. They are particularly useful
|
||||
when we want to reference some intermediate value multiple times.
|
||||
- [ELocalLambda](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L487-L507)
|
||||
Represents an anonymous function which takes a single input parameter. Many `EFunctions` accept
|
||||
these as parameters. A good example of this is the [`traverseF`
|
||||
function](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/vm/vm.cpp#L1329-L1357):
|
||||
it accepts 2 parameters: an input and an `ELocalLambda`. If the input is an array, the
|
||||
`ELocalLambda` is applied to each element in the array, otherwise, it is applied to the input on
|
||||
its own.
|
||||
|
||||
EExpressions cannot be executed directly. Rather, [they are
|
||||
compiled](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/expressions/expression.h#L81-L84)
|
||||
|
|
@ -102,6 +102,7 @@ For more details about the VM, including how `EExpression` compilation and `Byte
|
|||
in detail, please reference [the Virtual Machine section below](#virtual-machine).
|
||||
|
||||
## Slots
|
||||
|
||||
To make use of SBE values (either those produced by executing `ByteCode`, or those maintained
|
||||
elsewhere), we need a mechanism to reference them throughout query execution. This is where slots
|
||||
come into play: A slot is a mechanism for reading and writing values at query runtime. Each slot is
|
||||
|
|
@ -109,10 +110,11 @@ come into play: A slot is a mechanism for reading and writing values at query ru
|
|||
SlotId](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/values/slot.h#L41-L48).
|
||||
Put another way, slots conceptually represent values that we care about during query execution,
|
||||
including:
|
||||
- Records and RecordIds retrieved from a collection
|
||||
- The result of evaluating an expression
|
||||
- The individual components of a sort key (where each component is bound to its own slot)
|
||||
- The result of executing some computation expressed in the input query
|
||||
|
||||
- Records and RecordIds retrieved from a collection
|
||||
- The result of evaluating an expression
|
||||
- The individual components of a sort key (where each component is bound to its own slot)
|
||||
- The result of executing some computation expressed in the input query
|
||||
|
||||
SlotIds by themselves don't provide a means to access or set values, rather, [slots are associated
|
||||
with
|
||||
|
|
@ -120,21 +122,21 @@ SlotAccessors](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5
|
|||
which provide the API to read the values bound to slots as well as to write new values into slots.
|
||||
There are several types of SlotAccessors, but the most common are the following:
|
||||
|
||||
- The
|
||||
[`OwnedValueAccessor`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/values/slot.h#L113-L212)
|
||||
allows for ownership of values. That is, this accessor is responsible for constructing/destructing
|
||||
values (in the case of deep values, this involves allocating/releasing memory). Note that an
|
||||
`OwnedValueAccessor` _can_ own values, but is not required to do so.
|
||||
- The
|
||||
[`ViewOfValueAccessor`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/values/slot.h#L81-L111)
|
||||
provides a way to read values that are owned elsewhere.
|
||||
- The
|
||||
[`OwnedValueAccessor`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/values/slot.h#L113-L212)
|
||||
allows for ownership of values. That is, this accessor is responsible for constructing/destructing
|
||||
values (in the case of deep values, this involves allocating/releasing memory). Note that an
|
||||
`OwnedValueAccessor` _can_ own values, but is not required to do so.
|
||||
- The
|
||||
[`ViewOfValueAccessor`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/values/slot.h#L81-L111)
|
||||
provides a way to read values that are owned elsewhere.
|
||||
|
||||
While a value bound to a slot can only be managed by a single `OwnedValueAccessor`, any number of
|
||||
`ViewOfValueAccessors` can be initialized to read the value associated with that slot.
|
||||
|
||||
A good example of the distinction between these two types of SlotAccessors is a query plan which
|
||||
performs a blocking sort over a collection scan. Suppose we are scanning a restaurants collection
|
||||
and we wish to find the highest rated restaurants. Such a query execution plan might look like:
|
||||
and we wish to find the highest rated restaurants. Such a query execution plan might look like:
|
||||
|
||||
```
|
||||
sort [output = s1] [sortBy = s2]
|
||||
|
|
@ -161,31 +163,34 @@ stages (as opposed to a push-based model where stages offer data to parent stage
|
|||
|
||||
A single `PlanStage` may have any number of children and performs some action, implements some algorithm,
|
||||
or maintains some execution state, such as:
|
||||
- Computing values bound to slots
|
||||
- Managing the lifetime of values in slots
|
||||
- Executing compiled `ByteCode`
|
||||
- Buffering values into memory
|
||||
|
||||
- Computing values bound to slots
|
||||
- Managing the lifetime of values in slots
|
||||
- Executing compiled `ByteCode`
|
||||
- Buffering values into memory
|
||||
|
||||
SBE PlanStages also follow an iterator model and perform query execution through the following steps:
|
||||
- First, a caller prepares a PlanStage tree for execution by calling `prepare()`.
|
||||
- Once the tree is prepared, the caller then calls `open()` to initialize the tree with any state
|
||||
needed for query execution. Note that this may include performing actual execution work, as is
|
||||
done by stages such as `HashAggStage` and `SortStage`.
|
||||
- With the PlanStage tree initialized, query execution can now proceed through iterative calls to
|
||||
`getNext()`. Note that the result set can be obtained in between calls to `getNext()` by reading
|
||||
values from slots.
|
||||
- Finally, `close()` is called to indicate that query execution is complete and release resources.
|
||||
|
||||
The following subsections describe [the PlanStage API](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/stages.h#L557-L651) introduced above in greater detail:
|
||||
- First, a caller prepares a PlanStage tree for execution by calling `prepare()`.
|
||||
- Once the tree is prepared, the caller then calls `open()` to initialize the tree with any state
|
||||
needed for query execution. Note that this may include performing actual execution work, as is
|
||||
done by stages such as `HashAggStage` and `SortStage`.
|
||||
- With the PlanStage tree initialized, query execution can now proceed through iterative calls to
|
||||
`getNext()`. Note that the result set can be obtained in between calls to `getNext()` by reading
|
||||
values from slots.
|
||||
- Finally, `close()` is called to indicate that query execution is complete and release resources.
|
||||
|
||||
### `virtual void prepare(CompileCtx& ctx) = 0;`
|
||||
The following subsections describe [the PlanStage API](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/stages.h#L557-L651) introduced above in greater detail:
|
||||
|
||||
### `virtual void prepare(CompileCtx& ctx) = 0;`
|
||||
|
||||
This method prepares a `PlanStage` (and, recursively, its children) for execution by:
|
||||
- Performing slot resolution, that is, obtaining `SlotAccessors` for all slots that this stage
|
||||
references and verifying that all slot accesses are valid. Typically, this is done by asking
|
||||
child stages for a `SlotAccessor*` via `getAccessor()`.
|
||||
- Compiling `EExpressions` into executable `ByteCode`. Note that `EExpressions` can reference slots
|
||||
through the `ctx` parameter.
|
||||
|
||||
- Performing slot resolution, that is, obtaining `SlotAccessors` for all slots that this stage
|
||||
references and verifying that all slot accesses are valid. Typically, this is done by asking
|
||||
child stages for a `SlotAccessor*` via `getAccessor()`.
|
||||
- Compiling `EExpressions` into executable `ByteCode`. Note that `EExpressions` can reference slots
|
||||
through the `ctx` parameter.
|
||||
|
||||
### `virtual value::SlotAccessor* getAccessor(CompileCtx& ctx, value::SlotId slot) = 0;`
|
||||
|
||||
|
|
@ -203,18 +208,20 @@ slots in said subtree invalid. For more details on slot resolution, consult [the
|
|||
section](#slot-resolution).
|
||||
|
||||
### `virtual void open(bool reOpen) = 0;`
|
||||
|
||||
### `virtual void close() = 0;`
|
||||
|
||||
These two methods mirror one another. While `open()` acquires necessary resources before `getNext()`
|
||||
can be called (that is, before `PlanStage` execution can begin), `close()` releases any resources
|
||||
acquired during `open()`. Acquiring resources for query execution can include actions such as:
|
||||
- Opening storage engine cursors.
|
||||
- Allocating memory.
|
||||
- Populating a buffer with results by exhausting child stages. This is often done by blocking stages
|
||||
which require processing their input to produce results. For example, the
|
||||
[`SortStage`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/sort.cpp#L340-L349)
|
||||
needs to sort all of the values produced by its children (either in memory or on disk) before it
|
||||
can produce results in sorted order.
|
||||
|
||||
- Opening storage engine cursors.
|
||||
- Allocating memory.
|
||||
- Populating a buffer with results by exhausting child stages. This is often done by blocking stages
|
||||
which require processing their input to produce results. For example, the
|
||||
[`SortStage`](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/sort.cpp#L340-L349)
|
||||
needs to sort all of the values produced by its children (either in memory or on disk) before it
|
||||
can produce results in sorted order.
|
||||
|
||||
It is only legal to call `close()` on PlanStages that have called `open()`, and to call `open()` on
|
||||
PlanStages that are closed. In some cases (such as in
|
||||
|
|
@ -222,7 +229,7 @@ PlanStages that are closed. In some cases (such as in
|
|||
a parent stage may `open()` and `close()` a child stage repeatedly. However, doing so may be
|
||||
expensive and ultimately redundant. This is where the `reOpen` parameter of `open()` comes in: when
|
||||
set to `true`, it provides the opportunity to execute an optimized a sequence of `close()` and
|
||||
`open()` calls.
|
||||
`open()` calls.
|
||||
|
||||
A good example of this is the [HashAgg
|
||||
stage](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/hash_agg.cpp#L426):
|
||||
|
|
@ -239,8 +246,8 @@ on child stages as needed and update the values held in slots that belong to it.
|
|||
`ADVANCED` to indicate that `getNext()` can still be called and `IS_EOF` to indicate that no more
|
||||
calls to `getNext()` can be made (that is, this `PlanStage` has completed execution). Importantly,
|
||||
`PlanStage::getNext()` does _not_ return any results directly. Rather, it updates the state of
|
||||
slots, which can then be read to obtain the results of the query.
|
||||
|
||||
slots, which can then be read to obtain the results of the query.
|
||||
|
||||
At the time of writing, there are 36 PlanStages. As such, only a handful of common stages'
|
||||
`getNext()` implementations are described below:
|
||||
|
||||
|
|
@ -285,7 +292,7 @@ subsequent `getNext()` calls until `IS_EOF` is returned. This stage supports [Ri
|
|||
Joins](https://github.com/mongodb/mongo/blob/dbbabbdc0f3ef6cbb47500b40ae235c1258b741a/src/mongo/db/exec/sbe/stages/loop_join.h#L47).
|
||||
|
||||
Note that slots from the outer stage can be made visible to [inner stage via
|
||||
LoopJoinStage::_outerCorrelated](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/loop_join.cpp#L105-L107),
|
||||
LoopJoinStage::\_outerCorrelated](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933bd522d3/src/mongo/db/exec/sbe/stages/loop_join.cpp#L105-L107),
|
||||
which adds said slots to the `CompileCtx` during `prepare()`. Conceptually, this is similar to the
|
||||
rules around scoped variables in for loops in many programming languages:
|
||||
|
||||
|
|
@ -297,9 +304,10 @@ for (let [outerSlot1, outerSlot2] of outerSlots) {
|
|||
}
|
||||
}
|
||||
```
|
||||
In the example above, the declaration of `res1` is invalid because values on the inner side are not
|
||||
visible outside of the inner loop, while the declaration of `res2` is valid because values on the
|
||||
outer side are visible to the inner side.
|
||||
|
||||
In the example above, the declaration of `res1` is invalid because values on the inner side are not
|
||||
visible outside of the inner loop, while the declaration of `res2` is valid because values on the
|
||||
outer side are visible to the inner side.
|
||||
|
||||
Also note that in the example above, the logical result of `LoopJoinStage` is a pairing of the tuple
|
||||
of slots visible from the outer side along with the tuple of slots from the inner side.
|
||||
|
|
@ -313,7 +321,7 @@ Documents:
|
|||
{ "_id" : 0, "name" : "Mihai Andrei", "major" : "Computer Science", "year": 2019}
|
||||
{ "_id" : 1, "name" : "Jane Doe", "major" : "Computer Science", "year": 2020}
|
||||
|
||||
Indexes:
|
||||
Indexes:
|
||||
{"major": 1}
|
||||
```
|
||||
|
||||
|
|
@ -324,6 +332,7 @@ db.alumni.find({"major" : "Computer Science", "year": 2020});
|
|||
```
|
||||
|
||||
The query plan chosen by the classic optimizer, represented as a `QuerySolution` tree, to answer this query is as follows:
|
||||
|
||||
```
|
||||
{
|
||||
"stage" : "FETCH",
|
||||
|
|
@ -357,9 +366,11 @@ The query plan chosen by the classic optimizer, represented as a `QuerySolution`
|
|||
}
|
||||
}
|
||||
```
|
||||
|
||||
In particular, it is an `IXSCAN` over the `{"major": 1}` index, followed by a `FETCH` and a filter of
|
||||
`year = 2020`. The SBE plan (generated by the [SBE stage builder](#sbe-stage-builder) with the [plan
|
||||
cache](#sbe-plan-cache) disabled) for this query plan is as follows:
|
||||
|
||||
```
|
||||
*** SBE runtime environment slots ***
|
||||
$$RESULT=s7 env: { s1 = Nothing (nothing), s6 = {"major" : 1} }
|
||||
|
|
@ -380,15 +391,15 @@ the numbers in brackets correspond to the `QuerySolutionNode` that each SBE `Pla
|
|||
`FETCH`).
|
||||
|
||||
We can represent the state of query execution in SBE by a table that shows the values bound to slots
|
||||
at a point in time:
|
||||
at a point in time:
|
||||
|
||||
|Slot |Name |Value|Owned by|
|
||||
|-----|-----|-----|--------------|
|
||||
s2 | Seek RID slot | `Nothing` | `ixseek`
|
||||
s5 | Index key slot | `Nothing` | `ixseek`
|
||||
s7 | Record slot | `Nothing` | `seek`
|
||||
s8 | RecordId slot | `Nothing` | `seek`
|
||||
s9 | Slot for the field 'year' | `Nothing` | `seek`
|
||||
| Slot | Name | Value | Owned by |
|
||||
| ---- | ------------------------- | --------- | -------- |
|
||||
| s2 | Seek RID slot | `Nothing` | `ixseek` |
|
||||
| s5 | Index key slot | `Nothing` | `ixseek` |
|
||||
| s7 | Record slot | `Nothing` | `seek` |
|
||||
| s8 | RecordId slot | `Nothing` | `seek` |
|
||||
| s9 | Slot for the field 'year' | `Nothing` | `seek` |
|
||||
|
||||
Initially, all slots hold a value of `Nothing`. Note also that some slots have been omitted for
|
||||
brevity, namely, s3, s4, and s6 (which correspond to a `SnapshotId`, an index identifier and an
|
||||
|
|
@ -402,13 +413,13 @@ calling `getNext()` on the inner `seek` stage. Following the specified index bou
|
|||
will seek to the `{"": "Computer Science"}` index key and fill out slots `s2` and `s5`. At this
|
||||
point, our slots are bound to the following values:
|
||||
|
||||
|Slot |Name |Value|Owned by|
|
||||
|-----|-----|-----|--------------|
|
||||
s2 | Seek RID slot | `<RID for _id: 0>` | `ixseek`
|
||||
s5 | Index key slot | `{"": "Computer Science"}` | `ixseek`
|
||||
s7 | Record slot | `Nothing` | `seek`
|
||||
s8 | RecordId slot | `Nothing` | `seek`
|
||||
s9 | Slot for the field 'year' | `Nothing` | `seek`
|
||||
| Slot | Name | Value | Owned by |
|
||||
| ---- | ------------------------- | -------------------------- | -------- |
|
||||
| s2 | Seek RID slot | `<RID for _id: 0>` | `ixseek` |
|
||||
| s5 | Index key slot | `{"": "Computer Science"}` | `ixseek` |
|
||||
| s7 | Record slot | `Nothing` | `seek` |
|
||||
| s8 | RecordId slot | `Nothing` | `seek` |
|
||||
| s9 | Slot for the field 'year' | `Nothing` | `seek` |
|
||||
|
||||
After `ixseek` returns `ADVANCED`, `nlj` will call `getNext` on the child `limit` stage, which will
|
||||
return `IS_EOF` after one call to `getNext()` on `seek` (in this way, a `limit 1 + seek` plan
|
||||
|
|
@ -416,13 +427,13 @@ executes a logical `FETCH`). The ScanStage will seek its cursor to the RecordId
|
|||
that RID is not the same as `_id`), bind values to slots for the RecordId, Record, and the value for
|
||||
the field 'year', and finally return `ADVANCED`. Our slots now look like so:
|
||||
|
||||
|Slot |Name |Value|Owned by|
|
||||
|-----|-----|-----|--------------|
|
||||
s2 | Seek RID slot | `<RID for _id: 0>` | `ixseek`
|
||||
s5 | Index key slot | `{"": "Computer Science"}` | `ixseek`
|
||||
s7 | Record slot | `{ "_id" : 0, "name" : "Mihai Andrei",`<br />`"major" : "Computer Science", "year": 2019}` | `seek`
|
||||
s8 | RecordId slot | `<RID for _id: 0>` | `seek`
|
||||
s9 | Slot for the field 'year' | `2019` | `seek`
|
||||
| Slot | Name | Value | Owned by |
|
||||
| ---- | ------------------------- | ------------------------------------------------------------------------------------------ | -------- |
|
||||
| s2 | Seek RID slot | `<RID for _id: 0>` | `ixseek` |
|
||||
| s5 | Index key slot | `{"": "Computer Science"}` | `ixseek` |
|
||||
| s7 | Record slot | `{ "_id" : 0, "name" : "Mihai Andrei",`<br />`"major" : "Computer Science", "year": 2019}` | `seek` |
|
||||
| s8 | RecordId slot | `<RID for _id: 0>` | `seek` |
|
||||
| s9 | Slot for the field 'year' | `2019` | `seek` |
|
||||
|
||||
Note that although `s8` and `s2` hold the same value (the RecordId for `_id: 0`), they represent
|
||||
different entities. `s2` holds the starting point for our `seek` stage (provided by `ixseek`),
|
||||
|
|
@ -432,7 +443,7 @@ read, which is also surfaced externally as the query result (provided that our `
|
|||
`seek` will return control to `nlj`, which returns control to `filter`. We now have a value for `s9`
|
||||
and can execute our filter expression. When executing the `ByteCode` for our filter expression with
|
||||
`s9` bound to 2019, the result is `false` because 2019 is not equal to 2020. As such, the `filter`
|
||||
stage must call `getNext()` on `nlj` once more.
|
||||
stage must call `getNext()` on `nlj` once more.
|
||||
|
||||
The slot tables which result from the next call to `FilterStage::getNext()` are left as an exercise
|
||||
to the reader.
|
||||
|
|
@ -446,34 +457,41 @@ README](https://github.com/mongodb/mongo/blob/06a931ffadd7ce62c32288d03e5a38933b
|
|||
for more details.
|
||||
|
||||
## Incomplete Sections Below (TODO)
|
||||
## Runtime Planners
|
||||
|
||||
## Runtime Planners
|
||||
|
||||
Outline:
|
||||
|
||||
### `MultiPlanner`
|
||||
|
||||
### `CachedSolutionPlanner`
|
||||
|
||||
### `SubPlanner`
|
||||
|
||||
## Virtual Machine
|
||||
|
||||
Outline:
|
||||
- Compilation of EExpressions
|
||||
- Frames/Labels
|
||||
- ByteCode Execution
|
||||
- Dispatch of instructions
|
||||
- Parameter resolution
|
||||
- Management of values
|
||||
|
||||
- Compilation of EExpressions
|
||||
- Frames/Labels
|
||||
- ByteCode Execution
|
||||
- Dispatch of instructions
|
||||
- Parameter resolution
|
||||
- Management of values
|
||||
|
||||
## Slot Resolution
|
||||
|
||||
Outline:
|
||||
- Binding reflectors
|
||||
- Other SlotAccessor types (`SwitchAccessor`, `MaterializedRowAccessor`)
|
||||
|
||||
- Binding reflectors
|
||||
- Other SlotAccessor types (`SwitchAccessor`, `MaterializedRowAccessor`)
|
||||
|
||||
## Yielding
|
||||
|
||||
Outline:
|
||||
- What is yielding and why we yield
|
||||
- `doSaveState()/doRestoreState()`
|
||||
- Index Key Consistency/Corruption checks
|
||||
|
||||
## Block Processing
|
||||
- What is yielding and why we yield
|
||||
- `doSaveState()/doRestoreState()`
|
||||
- Index Key Consistency/Corruption checks
|
||||
|
||||
## Block Processing
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## Table of Contents
|
||||
|
||||
- [High Level Overview](#high-level-overview)
|
||||
- [High Level Overview](#high-level-overview)
|
||||
|
||||
## High Level Overview
|
||||
|
||||
|
|
@ -23,9 +23,9 @@ process information for the controller. These sets of collector objects are stor
|
|||
object, allowing all the data to be collected through one call to collect on the
|
||||
`FTDCCollectorCollection`. There are two sets of `FTDCCollectorCollection` objects on the
|
||||
controller:
|
||||
[_periodicCollectors](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/ftdc/controller.h#L200-L201)
|
||||
[\_periodicCollectors](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/ftdc/controller.h#L200-L201)
|
||||
that collects data at a specified time interval, and
|
||||
[_rotateCollectors](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/ftdc/controller.h#L207-L208)
|
||||
[\_rotateCollectors](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/ftdc/controller.h#L207-L208)
|
||||
that collects one set of data every time a file is created.
|
||||
|
||||
At specified time intervals, the FTDC Controller calls collect on the `_periodicCollectors`
|
||||
|
|
|
|||
|
|
@ -2,72 +2,79 @@
|
|||
|
||||
This module is capable to run server health checks and crash an unhealthy server.
|
||||
|
||||
*Note:* in 4.4 release only the mongos proxy server is supported
|
||||
_Note:_ in 4.4 release only the mongos proxy server is supported
|
||||
|
||||
## Health Observers
|
||||
|
||||
*Health Observers* are designed for every particular check to run. Each observer can be configured to be on/off and critical or not to be able to crash the serer on error. Each observer has a configurable interval of how often it will run the checks.
|
||||
_Health Observers_ are designed for every particular check to run. Each observer can be configured to be on/off and critical or not to be able to crash the serer on error. Each observer has a configurable interval of how often it will run the checks.
|
||||
|
||||
## Health Observers Parameters
|
||||
|
||||
- healthMonitoringIntensities: main configuration for each observer. Can be set at startup and changed at runtime. Valid values:
|
||||
- off: this observer if off
|
||||
- critical: if the observer detects a failure, the process will crash
|
||||
- non-critical: if the observer detects a failure, the error will be logged and the process will not crash
|
||||
- healthMonitoringIntensities: main configuration for each observer. Can be set at startup and changed at runtime. Valid values:
|
||||
|
||||
- off: this observer if off
|
||||
- critical: if the observer detects a failure, the process will crash
|
||||
- non-critical: if the observer detects a failure, the error will be logged and the process will not crash
|
||||
|
||||
Example as startup parameter:
|
||||
```
|
||||
mongos --setParameter "healthMonitoringIntensities={ \"values\" : [{ \"type\" : \"ldap\", \"intensity\" : \"critical\" } ]}"
|
||||
```
|
||||
Example as startup parameter:
|
||||
|
||||
Example as runtime change command:
|
||||
```
|
||||
db.adminCommand({ "setParameter": 1,
|
||||
healthMonitoringIntensities: {values:
|
||||
[{type: "ldap", intensity: "critical"}] } });
|
||||
```
|
||||
```
|
||||
mongos --setParameter "healthMonitoringIntensities={ \"values\" : [{ \"type\" : \"ldap\", \"intensity\" : \"critical\" } ]}"
|
||||
```
|
||||
|
||||
- healthMonitoringIntervals: how often this health observer will run, in milliseconds.
|
||||
Example as runtime change command:
|
||||
|
||||
Example as startup parameter:
|
||||
```
|
||||
mongos --setParameter "healthMonitoringIntervals={ \"values\" : [ { \"type\" : \"ldap\", \"interval\" : 30000 } ] }"
|
||||
```
|
||||
here LDAP health observer is configured to run every 30 seconds.
|
||||
```
|
||||
db.adminCommand({ "setParameter": 1,
|
||||
healthMonitoringIntensities: {values:
|
||||
[{type: "ldap", intensity: "critical"}] } });
|
||||
```
|
||||
|
||||
Example as runtime change command:
|
||||
```
|
||||
db.adminCommand({"setParameter": 1, "healthMonitoringIntervals":{"values": [{"type":"ldap", "interval": 30000}]} });
|
||||
```
|
||||
- healthMonitoringIntervals: how often this health observer will run, in milliseconds.
|
||||
|
||||
Example as startup parameter:
|
||||
|
||||
```
|
||||
mongos --setParameter "healthMonitoringIntervals={ \"values\" : [ { \"type\" : \"ldap\", \"interval\" : 30000 } ] }"
|
||||
```
|
||||
|
||||
here LDAP health observer is configured to run every 30 seconds.
|
||||
|
||||
Example as runtime change command:
|
||||
|
||||
```
|
||||
db.adminCommand({"setParameter": 1, "healthMonitoringIntervals":{"values": [{"type":"ldap", "interval": 30000}]} });
|
||||
```
|
||||
|
||||
## LDAP Health Observer
|
||||
|
||||
LDAP Health Observer checks all configured LDAP servers that at least one of them is up and running. At every run, it creates new connection to every configured LDAP server and runs a simple query. The LDAP health observer is using the same parameters as described in the **LDAP Authorization** section of the manual.
|
||||
|
||||
To enable this observer, use the *healthMonitoringIntensities* and *healthMonitoringIntervals* parameters as described above. The recommended value for the LDAP monitoring interval is 30 seconds.
|
||||
|
||||
To enable this observer, use the _healthMonitoringIntensities_ and _healthMonitoringIntervals_ parameters as described above. The recommended value for the LDAP monitoring interval is 30 seconds.
|
||||
|
||||
## Active Fault
|
||||
|
||||
When a failure is detected, and the observer is configured as *critical*, the server will wait for the configured interval before crashing. The interval from the failure detection and crash is configured with *activeFaultDurationSecs* parameter:
|
||||
When a failure is detected, and the observer is configured as _critical_, the server will wait for the configured interval before crashing. The interval from the failure detection and crash is configured with _activeFaultDurationSecs_ parameter:
|
||||
|
||||
- activeFaultDurationSecs: how long to wait from the failure detection to crash, in seconds. This can be configured at startup and changed at runtime.
|
||||
- activeFaultDurationSecs: how long to wait from the failure detection to crash, in seconds. This can be configured at startup and changed at runtime.
|
||||
|
||||
Example:
|
||||
```
|
||||
db.adminCommand({"setParameter": 1, activeFaultDurationSecs: 300});
|
||||
```
|
||||
Example:
|
||||
|
||||
```
|
||||
db.adminCommand({"setParameter": 1, activeFaultDurationSecs: 300});
|
||||
```
|
||||
|
||||
## Progress Monitor
|
||||
|
||||
*Progress Monitor* detects that every health check is not stuck, without returning either success or failure. If a health check starts and does not complete the server will crash. This behavior could be configured with:
|
||||
_Progress Monitor_ detects that every health check is not stuck, without returning either success or failure. If a health check starts and does not complete the server will crash. This behavior could be configured with:
|
||||
|
||||
- progressMonitor: configure the progress monitor. Values:
|
||||
- *interval*: how often to run the liveness check, in milliseconds
|
||||
- *deadline*: timeout before crashing the server if a health check is not making progress, in seconds
|
||||
- progressMonitor: configure the progress monitor. Values:
|
||||
|
||||
Example:
|
||||
```
|
||||
mongos --setParameter "progressMonitor={ \"interval\" : 1000, \"deadline\" : 300 }"
|
||||
```
|
||||
- _interval_: how often to run the liveness check, in milliseconds
|
||||
- _deadline_: timeout before crashing the server if a health check is not making progress, in seconds
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
mongos --setParameter "progressMonitor={ \"interval\" : 1000, \"deadline\" : 300 }"
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
# Query System Internals
|
||||
|
||||
*Disclaimer*: This is a work in progress. It is not complete and we will
|
||||
_Disclaimer_: This is a work in progress. It is not complete and we will
|
||||
do our best to complete it in a timely manner.
|
||||
|
||||
# Overview
|
||||
|
|
@ -13,29 +13,29 @@ distinct, and mapReduce.
|
|||
|
||||
Here we will divide it into the following phases and topics:
|
||||
|
||||
* **Command Parsing & Validation:** Which arguments to the command are
|
||||
recognized and do they have the right types?
|
||||
* **Query Language Parsing & Validation:** More complex parsing of
|
||||
elements like query predicates and aggregation pipelines, which are
|
||||
skipped in the first section due to complexity of parsing rules.
|
||||
* **Query Optimization**
|
||||
* **Normalization and Rewrites:** Before we try to look at data
|
||||
access paths, we perform some simplification, normalization and
|
||||
"canonicalization" of the query.
|
||||
* **Index Tagging:** Figure out which indexes could potentially be
|
||||
helpful for which query predicates.
|
||||
* **Plan Enumeration:** Given the set of associated indexes and
|
||||
predicates, enumerate all possible combinations of assignments
|
||||
for the whole query tree and output a draft query plan for each.
|
||||
* **Plan Compilation:** For each of the draft query plans, finalize
|
||||
the details. Pick index bounds, add any necessary sorts, fetches,
|
||||
or projections
|
||||
* **Plan Selection:** Compete the candidate plans against each other
|
||||
and select the winner.
|
||||
* [**Plan Caching:**](#plan-caching) Attempt to skip the expensive steps above by
|
||||
caching the previous winning solution.
|
||||
* **Query Execution:** Iterate the winning plan and return results to the
|
||||
client.
|
||||
- **Command Parsing & Validation:** Which arguments to the command are
|
||||
recognized and do they have the right types?
|
||||
- **Query Language Parsing & Validation:** More complex parsing of
|
||||
elements like query predicates and aggregation pipelines, which are
|
||||
skipped in the first section due to complexity of parsing rules.
|
||||
- **Query Optimization**
|
||||
- **Normalization and Rewrites:** Before we try to look at data
|
||||
access paths, we perform some simplification, normalization and
|
||||
"canonicalization" of the query.
|
||||
- **Index Tagging:** Figure out which indexes could potentially be
|
||||
helpful for which query predicates.
|
||||
- **Plan Enumeration:** Given the set of associated indexes and
|
||||
predicates, enumerate all possible combinations of assignments
|
||||
for the whole query tree and output a draft query plan for each.
|
||||
- **Plan Compilation:** For each of the draft query plans, finalize
|
||||
the details. Pick index bounds, add any necessary sorts, fetches,
|
||||
or projections
|
||||
- **Plan Selection:** Compete the candidate plans against each other
|
||||
and select the winner.
|
||||
- [**Plan Caching:**](#plan-caching) Attempt to skip the expensive steps above by
|
||||
caching the previous winning solution.
|
||||
- **Query Execution:** Iterate the winning plan and return results to the
|
||||
client.
|
||||
|
||||
In this documentation we focus on the process for a single node or
|
||||
replica set where all the data is expected to be found locally. We plan
|
||||
|
|
@ -46,14 +46,15 @@ directory later.
|
|||
|
||||
The following commands are generally maintained by the query team, with
|
||||
the majority of our focus given to the first two.
|
||||
* find
|
||||
* aggregate
|
||||
* count
|
||||
* distinct
|
||||
* mapReduce
|
||||
* update
|
||||
* delete
|
||||
* findAndModify
|
||||
|
||||
- find
|
||||
- aggregate
|
||||
- count
|
||||
- distinct
|
||||
- mapReduce
|
||||
- update
|
||||
- delete
|
||||
- findAndModify
|
||||
|
||||
The code path for each of these starts in a Command, named something
|
||||
like MapReduceCommand or FindCmd. You can generally find these in
|
||||
|
|
@ -105,7 +106,7 @@ This file (specified in a YAML format) is used to generate C++ code. Our
|
|||
build system will run a python tool to parse this YAML and spit out C++
|
||||
code which is then compiled and linked. This code is left in a file
|
||||
ending with '\_gen.h' or '\_gen.cpp', for example
|
||||
'count\_command\_gen.cpp'. You'll notice that things like whether it is
|
||||
'count_command_gen.cpp'. You'll notice that things like whether it is
|
||||
optional, the type of the field, and any defaults are included here, so
|
||||
we don't have to write any code to handle that.
|
||||
|
||||
|
|
@ -173,12 +174,12 @@ further to a point where we understand which stages are involved. This
|
|||
is actually a special case, and we use a class called the
|
||||
`LiteParsedPipeline` for this and other similar purposes.
|
||||
|
||||
The `LiteParsedPipeline` class is constructed via a semi-parse which
|
||||
The `LiteParsedPipeline` class is constructed via a semi-parse which
|
||||
only goes so far as to tease apart which stages are involved. It is a
|
||||
very simple model of an aggregation pipeline, and is supposed to be
|
||||
cheaper to construct than doing a full parse. As a general rule of
|
||||
thumb, we try to keep expensive things from happening until after we've
|
||||
verified the user has the required privileges to do those things.
|
||||
verified the user has the required privileges to do those things.
|
||||
|
||||
This simple model can be used for requests we want to inspect before
|
||||
proceeding and building a full model of the user's query or request. As
|
||||
|
|
@ -215,6 +216,7 @@ Once we have parsed the command and checked authorization, we move on to parsing
|
|||
parts of the query. Once again, we will focus on the find and aggregate commands.
|
||||
|
||||
## Find command parsing
|
||||
|
||||
The find command is parsed entirely by the IDL. The IDL parser first creates a FindCommandRequest.
|
||||
As mentioned above, the IDL parser does all of the required type checking and stores all options for
|
||||
the query. The FindCommandRequest is then turned into a CanonicalQuery. The CanonicalQuery
|
||||
|
|
@ -231,12 +233,14 @@ both done here.
|
|||
## Aggregate Command Parsing
|
||||
|
||||
### LiteParsedPipeline
|
||||
|
||||
In the process of parsing an aggregation we create two versions of the pipeline: a
|
||||
LiteParsedPipeline (that contains LiteParsedDocumentSource objects) and the Pipeline (that contains
|
||||
DocumentSource objects) that is eventually used for execution. See the above section on
|
||||
DocumentSource objects) that is eventually used for execution. See the above section on
|
||||
authorization checking for more details.
|
||||
|
||||
### DocumentSource
|
||||
|
||||
Before talking about the aggregate command as a whole, we will first briefly discuss
|
||||
the concept of a DocumentSource. A DocumentSource represents one stage in the an aggregation
|
||||
pipeline. For each stage in the pipeline, we create another DocumentSource. A DocumentSource
|
||||
|
|
@ -248,6 +252,7 @@ validation of its internal fields and arguments and then generates the DocumentS
|
|||
added to the final pipeline.
|
||||
|
||||
### Pipeline
|
||||
|
||||
The pipeline parser uses the individual document source parsers to parse the entire pipeline
|
||||
argument of the aggregate command. The parsing process is fairly simple -- for each object in the
|
||||
user specified pipeline lookup the document source parser for the stage name, and then parse the
|
||||
|
|
@ -255,6 +260,7 @@ object using that parser. The final pipeline is composed of the DocumentSources
|
|||
individual parsers.
|
||||
|
||||
### Aggregation Command
|
||||
|
||||
When an aggregation is run, the first thing that happens is the request is parsed into a
|
||||
LiteParsedPipeline. As mentioned above, the LiteParsedPipeline is used to check options and
|
||||
permissions on namespaces. More checks are done in addition to those performed by the
|
||||
|
|
@ -264,22 +270,23 @@ above. Note that we use the original BSON for parsing the pipeline and DocumentS
|
|||
to continuing from the LiteParsedPipeline. This could be improved in the future.
|
||||
|
||||
## Other command parsing
|
||||
|
||||
As mentioned above, there are several other commands maintained by the query team. We will quickly
|
||||
give a summary of how each is parsed, but not get into the same level of detail.
|
||||
|
||||
* count : Parsed by IDL and then turned into a CountStage which can be executed in a similar way to
|
||||
a find command.
|
||||
* distinct : The distinct specific arguments are parsed by IDL into DistinctCommandRequest. Generic
|
||||
command arguments and 'query' field are parsed by custom code into ParsedDistinctCommand, then
|
||||
being used to construct CanonicalDistinct and eventually turned into executable stage.
|
||||
* mapReduce : Parsed by IDL and then turned into an equivalent aggregation command.
|
||||
* update : Parsed by IDL. An update command can contain both query (find) and pipeline syntax
|
||||
(for updates) which each get delegated to their own parsers.
|
||||
* delete : Parsed by IDL. The filter portion of the of the delete command is delegated to the find
|
||||
parser.
|
||||
* findAndModify : Parsed by IDL. The findAndModify command can contain find and update syntax. The
|
||||
query portion is delegated to the query parser and if this is an update (rather than a delete) it
|
||||
uses the same parser as the update command.
|
||||
- count : Parsed by IDL and then turned into a CountStage which can be executed in a similar way to
|
||||
a find command.
|
||||
- distinct : The distinct specific arguments are parsed by IDL into DistinctCommandRequest. Generic
|
||||
command arguments and 'query' field are parsed by custom code into ParsedDistinctCommand, then
|
||||
being used to construct CanonicalDistinct and eventually turned into executable stage.
|
||||
- mapReduce : Parsed by IDL and then turned into an equivalent aggregation command.
|
||||
- update : Parsed by IDL. An update command can contain both query (find) and pipeline syntax
|
||||
(for updates) which each get delegated to their own parsers.
|
||||
- delete : Parsed by IDL. The filter portion of the of the delete command is delegated to the find
|
||||
parser.
|
||||
- findAndModify : Parsed by IDL. The findAndModify command can contain find and update syntax. The
|
||||
query portion is delegated to the query parser and if this is an update (rather than a delete) it
|
||||
uses the same parser as the update command.
|
||||
|
||||
# Plan caching
|
||||
|
||||
|
|
@ -428,7 +435,7 @@ queries that project a known limited subset of fields, run in SBE and aren't eli
|
|||
indexes. For other heuristics limiting use of CSI see
|
||||
[`querySatisfiesCsiPlanningHeuristics()`](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/query/query_planner.cpp#L250).
|
||||
|
||||
Scanning of CSI is implemented by the
|
||||
Scanning of CSI is implemented by the
|
||||
[`columnscan`](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/exec/sbe/stages/column_scan.h)
|
||||
SBE stage. Unlike `ixscan` the plans that use `columnscan` don't include a separate fetch stage as
|
||||
the columnstore indexes are optimistically assumed to be covering for the scenarios when they are
|
||||
|
|
@ -447,7 +454,7 @@ which filters can be pushed down see
|
|||
|
||||
### Tests
|
||||
|
||||
*JS Tests:* Most CSI related tests can be found in `jstests/core/columnstore` folder or by searching
|
||||
_JS Tests:_ Most CSI related tests can be found in `jstests/core/columnstore` folder or by searching
|
||||
for tests that create an index with the "columnstore" tag. There are also
|
||||
[`core_column_store_indexes`](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/buildscripts/resmokeconfig/suites/core_column_store_indexes.yml)
|
||||
and [`aggregation_column_store_index_passthrough`](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/buildscripts/resmokeconfig/suites/aggregation_column_store_index_passthrough.yml)
|
||||
|
|
@ -470,6 +477,7 @@ does not sort by value but rather by path with `RecordId` postfix. A `ColumnStor
|
|||
entries per path in a collection document:
|
||||
|
||||
Example input documents:
|
||||
|
||||
```
|
||||
{
|
||||
_id: new ObjectId("..."),
|
||||
|
|
@ -489,7 +497,9 @@ Example input documents:
|
|||
viewed: true
|
||||
}
|
||||
```
|
||||
|
||||
High-level view of the column store data format:
|
||||
|
||||
```
|
||||
(_id\01, {vals: [ ObjectId("...") ]})
|
||||
(_id\02, {vals: [ ObjectId("...") ]})
|
||||
|
|
@ -531,24 +541,24 @@ that the storage engine will receive.
|
|||
|
||||
_Code spelunking entry points:_
|
||||
|
||||
* The [IndexAccessMethod](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/index_access_method.h)
|
||||
is invoked by the [IndexCatalogImpl](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/catalog/index_catalog_impl.cpp#L1714-L1715).
|
||||
- The [IndexAccessMethod](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/index_access_method.h)
|
||||
is invoked by the [IndexCatalogImpl](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/catalog/index_catalog_impl.cpp#L1714-L1715).
|
||||
|
||||
* The [ColumnStoreAccessMethod](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.h#L39),
|
||||
note the [write paths](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L269-L286)
|
||||
that use the ColumnKeyGenerator.
|
||||
- The [ColumnStoreAccessMethod](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.h#L39),
|
||||
note the [write paths](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L269-L286)
|
||||
that use the ColumnKeyGenerator.
|
||||
|
||||
* The [ColumnKeyGenerator](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L146)
|
||||
produces many [UnencodedCellView](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L111)
|
||||
via the [ColumnShredder](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.cpp#L163-L176)
|
||||
with the [ColumnProjectionTree & ColumnProjectionNode](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L46-L101)
|
||||
classes defining the desired path projections.
|
||||
- The [ColumnKeyGenerator](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L146)
|
||||
produces many [UnencodedCellView](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L111)
|
||||
via the [ColumnShredder](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.cpp#L163-L176)
|
||||
with the [ColumnProjectionTree & ColumnProjectionNode](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_key_generator.h#L46-L101)
|
||||
classes defining the desired path projections.
|
||||
|
||||
* The [column_cell.h/cpp](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_cell.h#L44-L50)
|
||||
helpers are leveraged throughout
|
||||
[ColumnStoreAccessMethod write methods](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L281)
|
||||
to encode ColumnKeyGenerator UnencodedCellView cells into final buffers for storage write.
|
||||
- The [column_cell.h/cpp](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/column_cell.h#L44-L50)
|
||||
helpers are leveraged throughout
|
||||
[ColumnStoreAccessMethod write methods](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L281)
|
||||
to encode ColumnKeyGenerator UnencodedCellView cells into final buffers for storage write.
|
||||
|
||||
* The ColumnStoreAccessMethod
|
||||
[invokes the WiredTigerColumnStore](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L318)
|
||||
with the final encoded path-cell (key-value) entries for storage.
|
||||
- The ColumnStoreAccessMethod
|
||||
[invokes the WiredTigerColumnStore](https://github.com/mongodb/mongo/blob/r6.3.0-alpha/src/mongo/db/index/columns_access_method.cpp#L318)
|
||||
with the final encoded path-cell (key-value) entries for storage.
|
||||
|
|
|
|||
|
|
@ -1,52 +1,62 @@
|
|||
# About
|
||||
|
||||
This directory contains all the logic related to query optimization in the new
|
||||
common query framework. It contains the models for representing a query and the
|
||||
logic for implementing optimization via a cascades framework.
|
||||
|
||||
# Testing
|
||||
|
||||
Developers working on the new optimizer may wish to run a subset of the tests
|
||||
which is exclusively focused on this codebase. This section details the relevant
|
||||
tests.
|
||||
tests.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
The following C++ unit tests exercise relevant parts of the codebase:
|
||||
|
||||
- algebra_test (src/mongo/db/query/optimizer/algebra/)
|
||||
- db_pipeline_test (src/mongo/db/pipeline/)
|
||||
- - This test suite includes many unrelated test cases, but
|
||||
'abt/abt_translation_test.cpp' and 'abt/abt_optimization_test.cpp' are the relevant ones.
|
||||
- optimizer_test (src/mongo/db/query/optimizer/)
|
||||
- sbe_abt_test (src/mongo/db/exec/sbe/abt/)
|
||||
- algebra_test (src/mongo/db/query/optimizer/algebra/)
|
||||
- db_pipeline_test (src/mongo/db/pipeline/)
|
||||
- - This test suite includes many unrelated test cases, but
|
||||
'abt/abt_translation_test.cpp' and 'abt/abt_optimization_test.cpp' are the relevant ones.
|
||||
- optimizer_test (src/mongo/db/query/optimizer/)
|
||||
- sbe_abt_test (src/mongo/db/exec/sbe/abt/)
|
||||
|
||||
These can be compiled with targets like 'build/install/bin/algebra_test',
|
||||
although the exact name will depend on your 'installDir' which you have
|
||||
configured with SCons. It may look more like
|
||||
'build/opt/install/bin/algebra_test'. If you want to build and run at once, you
|
||||
can use the '+' shortcut to ninja, like so:
|
||||
|
||||
```
|
||||
ninja <FLAGS> +algebra_test +db_pipeline_test +optimizer_test +sbe_abt_test
|
||||
```
|
||||
|
||||
## JS Integration Tests
|
||||
|
||||
In addition to the above unit tests, the following JS suites are helpful in
|
||||
exercising this codebase:
|
||||
- **cqf**: [buildscripts/resmokeconfig/suites/cqf.yml](/buildscripts/resmokeconfig/suites/cqf.yml)
|
||||
- **cqf_disabled_pipeline_opt**:
|
||||
|
||||
- **cqf**: [buildscripts/resmokeconfig/suites/cqf.yml](/buildscripts/resmokeconfig/suites/cqf.yml)
|
||||
- **cqf_disabled_pipeline_opt**:
|
||||
[buildscripts/resmokeconfig/suites/cqf_disabled_pipeline_opt.yml](/buildscripts/resmokeconfig/suites/cqf_disabled_pipeline_opt.yml)
|
||||
- **cqf_parallel**: [buildscripts/resmokeconfig/suites/cqf_parallel.yml](/buildscripts/resmokeconfig/suites/cqf_parallel.yml)
|
||||
- **query_golden_cqf**: [buildscripts/resmokeconfig/suites/query_golden_cqf.yml](/buildscripts/resmokeconfig/suites/query_golden_cqf.yml)
|
||||
- **cqf_parallel**: [buildscripts/resmokeconfig/suites/cqf_parallel.yml](/buildscripts/resmokeconfig/suites/cqf_parallel.yml)
|
||||
- **query_golden_cqf**: [buildscripts/resmokeconfig/suites/query_golden_cqf.yml](/buildscripts/resmokeconfig/suites/query_golden_cqf.yml)
|
||||
|
||||
Desriptions of these suites can be found in
|
||||
[buildscripts/resmokeconfig/evg_task_doc/evg_task_doc.yml](/buildscripts/resmokeconfig/evg_task_doc/evg_task_doc.yml).
|
||||
|
||||
You may run these like so, adjusting the `-j` flag for the appropriate level of
|
||||
parallel execution for your machine.
|
||||
|
||||
```
|
||||
./buildscripts/resmoke.py run -j4 \
|
||||
--suites=cqf,cqf_disabled_pipeline_opt,cqf_parallel,query_golden_cqf
|
||||
```
|
||||
|
||||
## Local Testing Recommendation
|
||||
|
||||
Something like this command may be helpful for local testing:
|
||||
|
||||
```
|
||||
ninja <FLAGS> install-devcore build/install/bin/algebra_test \
|
||||
build/install/bin/db_pipeline_test build/install/bin/optimizer_test \
|
||||
|
|
@ -57,15 +67,17 @@ build/install/bin/sbe_abt_test \
|
|||
&& ./build/install/bin/sbe_abt_test \
|
||||
&& ./buildscripts/resmoke.py run --suites=cqf,cqf_parallel,cqf_disabled_pipeline_opt,query_golden_cqf -j4
|
||||
```
|
||||
|
||||
**Note:** You may need to adjust the path to the unit test binary targets if your
|
||||
SCons install directory is something more like `build/opt/install/bin`.
|
||||
|
||||
## Evergreen Testing Recommendation
|
||||
|
||||
In addition to the above suites, there is a patch-only variant which enables the CQF feature flag
|
||||
on a selection of existing suites. The variant, "Query (all feature flags and CQF enabled)", runs
|
||||
all the tasks from the recommended all-feature-flags variants. When testing on evergreen, you
|
||||
may want a combination of passthrough tests from the CQF variant and CQF-targeted tests (like the
|
||||
integration suites mentioned above and unit tests) on interesting variants such as ASAN.
|
||||
integration suites mentioned above and unit tests) on interesting variants such as ASAN.
|
||||
|
||||
You can define local evergreen aliases to make scheduling these tasks easier and faster than
|
||||
selecting them individually on each evergreen patch. A discussion of evergreen aliases can be
|
||||
|
|
|
|||
|
|
@ -1,19 +1,24 @@
|
|||
# Query Shape
|
||||
|
||||
A query shape is a transformed version of a command with literal values replaced by a "canonical"
|
||||
BSON Type placeholder. Hence, different instances of a command would be considered to have the same
|
||||
query shape if they are identical once their literal values are abstracted.
|
||||
|
||||
For example, these two queries would have the same shape:
|
||||
|
||||
```js
|
||||
db.example.findOne({x: 24});
|
||||
db.example.findOne({x: 53});
|
||||
```
|
||||
|
||||
While these queries would each have a distinct shape:
|
||||
|
||||
```js
|
||||
db.example.findOne({x: 53, y: 1});
|
||||
db.example.findOne({x: 53});
|
||||
db.example.findOne({x: "string"});
|
||||
```
|
||||
|
||||
While different literal _values_ result in the same shape (matching `x` for 23 vs 53), different
|
||||
BSON _types_ of the literal are considered distinct shapes (matching `x` for 53 vs "string").
|
||||
|
||||
|
|
@ -28,33 +33,35 @@ You can see which components are considered part of the query shape or not for e
|
|||
type in their respective "shape component" classes, whose purpose is to determine which components
|
||||
are relevant and should be included for determining the shape for specific type of command. The
|
||||
structure is as follows:
|
||||
- [`CmdSpecificShapeComponents`](query_shape.h#L65)
|
||||
- [`LetShapeComponent`](cmd_with_let_shape.h#L48)
|
||||
- [`AggCmdShapeComponents`](agg_cmd_shape.h#L82)
|
||||
- [`FindCmdShapeComponents`](find_cmd_shape.h#L48)
|
||||
|
||||
- [`CmdSpecificShapeComponents`](query_shape.h#L65)
|
||||
- [`LetShapeComponent`](cmd_with_let_shape.h#L48)
|
||||
- [`AggCmdShapeComponents`](agg_cmd_shape.h#L82)
|
||||
- [`FindCmdShapeComponents`](find_cmd_shape.h#L48)
|
||||
|
||||
See more information for the different shapes in their respective classes, structured as follows:
|
||||
- [`Shape`](query_shape.h)
|
||||
- [`CmdWithLetShape`](cmd_with_let_shape.h)
|
||||
- [`AggCmdShape`](agg_cmd_shape.h)
|
||||
- [`FindCmdShape`](find_cmd_shape.h)
|
||||
|
||||
- [`Shape`](query_shape.h)
|
||||
- [`CmdWithLetShape`](cmd_with_let_shape.h)
|
||||
- [`AggCmdShape`](agg_cmd_shape.h)
|
||||
- [`FindCmdShape`](find_cmd_shape.h)
|
||||
|
||||
## Serialization Options
|
||||
|
||||
`SerializationOptions` describes the way we serialize literal values.
|
||||
|
||||
There are 3 different serialization options:
|
||||
- `kUnchanged`: literals are serialized unmodified
|
||||
- `{x: 5, y: "hello"}` -> `{x: 5, y: "hello"}`
|
||||
- `kToDebugTypeString`: human readable format, type string of the literal is serialized
|
||||
- `{x: 5, y: "hello"}` -> `{x: "?number", y: "?string"}`
|
||||
- `kToRepresentativeParseableValue`: literal serialized to one canonical value for given type, which
|
||||
must be parseable
|
||||
- `{x: 5, y: "hello"}` -> `{x: 1, y: "?"}`
|
||||
- An example of a query which is serialized differently due to the parseable requirement is `{x:
|
||||
{$regex: "^p.*"}}`. If we serialized the pattern as if it were a normal string we would end up
|
||||
with `{x: {$regex: "?"}}` however `"?"` is not a valid regex pattern, so this would fail
|
||||
parsing. Instead we will serialize it this way to maintain parseability, `{x: {$regex:
|
||||
"\\?"}}`, since `"\\?"` is valid regex.
|
||||
|
||||
- `kUnchanged`: literals are serialized unmodified
|
||||
- `{x: 5, y: "hello"}` -> `{x: 5, y: "hello"}`
|
||||
- `kToDebugTypeString`: human readable format, type string of the literal is serialized
|
||||
- `{x: 5, y: "hello"}` -> `{x: "?number", y: "?string"}`
|
||||
- `kToRepresentativeParseableValue`: literal serialized to one canonical value for given type, which
|
||||
must be parseable - `{x: 5, y: "hello"}` -> `{x: 1, y: "?"}` - An example of a query which is serialized differently due to the parseable requirement is `{x:
|
||||
{$regex: "^p.*"}}`. If we serialized the pattern as if it were a normal string we would end up
|
||||
with `{x: {$regex: "?"}}` however `"?"` is not a valid regex pattern, so this would fail
|
||||
parsing. Instead we will serialize it this way to maintain parseability, `{x: {$regex:
|
||||
"\\?"}}`, since `"\\?"` is valid regex.
|
||||
|
||||
See [serialization_options.h](serialization_options.h) for more details.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
# Query Stats
|
||||
|
||||
This directory is the home of the infrastructure related to recording runtime query statistics for
|
||||
the database. It is not to be confused with `src/mongo/db/query/stats/` which is the home of the
|
||||
logic for computing and maintaining statistics about a collection or index's data distribution - for
|
||||
|
|
@ -11,18 +12,22 @@ query stats key and will be collected on any mongod or mongos process for which
|
|||
including primaries and secondaries.
|
||||
|
||||
## QueryStatsStore
|
||||
|
||||
At the center of everything here is the [`QueryStatsStore`](query_stats.h#93-97), which is a
|
||||
partitioned hash table that maps the hash of a [Query Stats Key](#glossary) (also known as the
|
||||
_Query Stats Store Key_) to some metrics about how often each one occurs.
|
||||
|
||||
### Computing the Query Stats Store Key
|
||||
|
||||
A query stats store key contains various dimensions that distinctify a specific query. One main
|
||||
attribute to the query stats store key, is the query shape (`query_shape::Shape`). For example, if
|
||||
the client does this:
|
||||
|
||||
```js
|
||||
db.example.findOne({x: 24});
|
||||
db.example.findOne({x: 53});
|
||||
```
|
||||
|
||||
then the `QueryStatsStore` should contain an entry for a single query shape which would record 2
|
||||
executions and some related statistics (see [`QueryStatsEntry`](query_stats_entry.h) for details).
|
||||
|
||||
|
|
@ -31,11 +36,13 @@ For more information on query shape, see the [query_shape](../query_shape/README
|
|||
The query stats store has _more_ dimensions (i.e. more granularity) to group incoming queries than
|
||||
just the query shape. For example, these queries would all three have the same shape but the first
|
||||
would have a different query stats store entry from the other two:
|
||||
|
||||
```js
|
||||
db.example.find({x: 55});
|
||||
db.example.find({x: 55}).batchSize(2);
|
||||
db.example.find({x: 55}).batchSize(3);
|
||||
```
|
||||
|
||||
There are two distinct query stats store entries here - both the examples which include the batch
|
||||
size will be treated separately from the example which does not specify a batch size.
|
||||
|
||||
|
|
@ -46,6 +53,7 @@ we accumulate statistics. As one example, you can find the
|
|||
`FindCmdQueryStatsStoreKeyComponents` (including `batchSize` shown in this example).
|
||||
|
||||
### Query Stats Store Cache Size
|
||||
|
||||
The size of the`QueryStatsStore` can be set by the server parameter
|
||||
[`internalQueryStatsCacheSize`](#server-parameters), and the partitions will be created based off
|
||||
that. See [`queryStatsStoreManagerRegisterer`](query_stats.cpp#L138-L154) for more details about how
|
||||
|
|
@ -55,8 +63,9 @@ will be evicted to drop below the max size. Eviction will be tracked in the new
|
|||
metrics](#server-status-metrics) for queryStats.
|
||||
|
||||
## Metric Collection
|
||||
|
||||
At a high level, when a query is run and collection of query stats is enabled, during planning we
|
||||
call [`registerRequest`]((query_stats.h#L195-L198)) in which the query stats store key will be
|
||||
call [`registerRequest`](<(query_stats.h#L195-L198)>) in which the query stats store key will be
|
||||
generated based on the query's shape and the various other dimensions. The key will always be serialized
|
||||
and stored on the `opDebug`, and also on the cursor in the case that there are `getMore`s, so that we can
|
||||
continue to aggregate the operation's metrics. Once the query execution is fully complete,
|
||||
|
|
@ -65,6 +74,7 @@ the key from the store if it exists and update it, or create a new one and add i
|
|||
more details in the [comments](query_stats.h#L158-L216).
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
Whether or not query stats will be recorded for a specific query execution depends on a Rate
|
||||
Limiter, which limits the number of recordings per second based on the server parameter
|
||||
[internalQueryStatsRateLimit](#server-parameters). The goal of the rate limiter is to minimize
|
||||
|
|
@ -74,32 +84,42 @@ be updated in the query stats store. Our rate limiter uses the sliding window al
|
|||
[here](rate_limiting.h#82-87).
|
||||
|
||||
## Metric Retrieval
|
||||
|
||||
To retrieve the stats gathered in the `QueryStatsStore`, there is a new aggregation stage,
|
||||
`$queryStats`. This stage must be the first in a pipeline and it must be run against the admin
|
||||
database. The structure of the command is as follows (note `aggregate: 1` reflecting there is no collection):
|
||||
|
||||
```js
|
||||
db.adminCommand({
|
||||
aggregate: 1,
|
||||
pipeline: [{
|
||||
$queryStats: {
|
||||
tranformIdentifiers: {
|
||||
algorithm: "hmac-sha-256",
|
||||
hmacKey: BinData(8, "87c4082f169d3fef0eef34dc8e23458cbb457c3sf3n2") /* bindata
|
||||
pipeline: [
|
||||
{
|
||||
$queryStats: {
|
||||
tranformIdentifiers: {
|
||||
algorithm: "hmac-sha-256",
|
||||
hmacKey: BinData(
|
||||
8,
|
||||
"87c4082f169d3fef0eef34dc8e23458cbb457c3sf3n2",
|
||||
) /* bindata
|
||||
subtype 8 - a new type for sensitive data */,
|
||||
}
|
||||
}
|
||||
}]
|
||||
})
|
||||
},
|
||||
},
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
`transformIdentifiers` is optional. If not present, we will generate the regular Query Stats Key. If
|
||||
present:
|
||||
- `algorithm` is required and the only currently supported option is "hmac-sha-256".
|
||||
- `hmacKey` is required
|
||||
- We will generate the [One-way Tokenized](#glossary) Query Stats Key by applying the "hmac-sha-256"
|
||||
to the names of any field, collection, or database. Application Name field is not transformed.
|
||||
|
||||
- `algorithm` is required and the only currently supported option is "hmac-sha-256".
|
||||
- `hmacKey` is required
|
||||
- We will generate the [One-way Tokenized](#glossary) Query Stats Key by applying the "hmac-sha-256"
|
||||
to the names of any field, collection, or database. Application Name field is not transformed.
|
||||
|
||||
The query stats store will output one document for each query stats key, which is structured in the
|
||||
following way:
|
||||
|
||||
```js
|
||||
{
|
||||
key: {/* Query Stats Key */},
|
||||
|
|
@ -115,63 +135,73 @@ following way:
|
|||
}
|
||||
}
|
||||
```
|
||||
- `key`: Query Stats Key.
|
||||
- `asOf`: UTC time when $queryStats read this entry from the store. This will not return the same
|
||||
UTC time for each result. The data structure used for the store is partitioned, and each partition
|
||||
will be read at a snapshot individually. You may see up to the number of partitions in unique
|
||||
timestamps returned by one $queryStats cursor.
|
||||
- `metrics`: the metrics collected; these may be flawed due to:
|
||||
- Server restarts, which will reset metrics.
|
||||
- LRU eviction, which will reset metrics.
|
||||
- Rate limiting, which will skew metrics.
|
||||
- `metrics.execCount`: Number of recorded observations of this query.
|
||||
- `metrics.firstSeenTimestamp`: UTC time taken at query completion (including getMores) for the
|
||||
first recording of this query stats store entry.
|
||||
- `metrics.lastSeenTimestamp`: UTC time taken at query completion (including getMores) for the
|
||||
latest recording of this query stats store entry.
|
||||
- `metrics.docsReturned`: Various broken down metrics for the number of documents returned by
|
||||
observation of this query.
|
||||
- `metrics.firstResponseExecMicros`: Estimated time spent computing and returning the first batch.
|
||||
- `metrics.totalExecMicros`: Estimated time spent computing and returning all batches, which is the
|
||||
same as the above for single-batch queries.
|
||||
- `metrics.lastExecutionMicros`: Estimated time spent processing the latest query (akin to
|
||||
"totalExecMicros", not "firstResponseExecMicros").
|
||||
|
||||
- `key`: Query Stats Key.
|
||||
- `asOf`: UTC time when $queryStats read this entry from the store. This will not return the same
|
||||
UTC time for each result. The data structure used for the store is partitioned, and each partition
|
||||
will be read at a snapshot individually. You may see up to the number of partitions in unique
|
||||
timestamps returned by one $queryStats cursor.
|
||||
- `metrics`: the metrics collected; these may be flawed due to:
|
||||
- Server restarts, which will reset metrics.
|
||||
- LRU eviction, which will reset metrics.
|
||||
- Rate limiting, which will skew metrics.
|
||||
- `metrics.execCount`: Number of recorded observations of this query.
|
||||
- `metrics.firstSeenTimestamp`: UTC time taken at query completion (including getMores) for the
|
||||
first recording of this query stats store entry.
|
||||
- `metrics.lastSeenTimestamp`: UTC time taken at query completion (including getMores) for the
|
||||
latest recording of this query stats store entry.
|
||||
- `metrics.docsReturned`: Various broken down metrics for the number of documents returned by
|
||||
observation of this query.
|
||||
- `metrics.firstResponseExecMicros`: Estimated time spent computing and returning the first batch.
|
||||
- `metrics.totalExecMicros`: Estimated time spent computing and returning all batches, which is the
|
||||
same as the above for single-batch queries.
|
||||
- `metrics.lastExecutionMicros`: Estimated time spent processing the latest query (akin to
|
||||
"totalExecMicros", not "firstResponseExecMicros").
|
||||
|
||||
#### Permissions
|
||||
|
||||
`$queryStats` is restricted by two privilege actions:
|
||||
- `queryStatsRead` privilege allows running `$queryStats` without passing the `transformIdentifiers`
|
||||
options.
|
||||
- `queryStatsReadTransformed` allows running `$queryStats` with `transformIdentifiers` set. These
|
||||
two privileges are included in the clusterMonitor role in Atlas.
|
||||
|
||||
- `queryStatsRead` privilege allows running `$queryStats` without passing the `transformIdentifiers`
|
||||
options.
|
||||
- `queryStatsReadTransformed` allows running `$queryStats` with `transformIdentifiers` set. These
|
||||
two privileges are included in the clusterMonitor role in Atlas.
|
||||
|
||||
### Server Parameters
|
||||
- `internalQueryStatsCacheSize`:
|
||||
* Max query stats store size, specified as a string like "4MB" or "1%". Defaults to 1% of the
|
||||
machine's total memory.
|
||||
* Query stats store is a LRU cache structure with partitions, so we may be under the cap due to
|
||||
implementation.
|
||||
|
||||
- `internalQueryStatsRateLimit`:
|
||||
* The rate limit is an integer which imposes a maximum number of recordings per second. Default is
|
||||
0 which has the effect of disabling query stats collection. Setting the parameter to -1 means
|
||||
there will be no rate limit.
|
||||
- `internalQueryStatsCacheSize`:
|
||||
|
||||
- `logComponentVerbosity.queryStats`:
|
||||
* Controls the logging behavior for query stats. See [Logging](#logging) for details.
|
||||
- Max query stats store size, specified as a string like "4MB" or "1%". Defaults to 1% of the
|
||||
machine's total memory.
|
||||
- Query stats store is a LRU cache structure with partitions, so we may be under the cap due to
|
||||
implementation.
|
||||
|
||||
- `internalQueryStatsRateLimit`:
|
||||
|
||||
- The rate limit is an integer which imposes a maximum number of recordings per second. Default is
|
||||
0 which has the effect of disabling query stats collection. Setting the parameter to -1 means
|
||||
there will be no rate limit.
|
||||
|
||||
- `logComponentVerbosity.queryStats`:
|
||||
- Controls the logging behavior for query stats. See [Logging](#logging) for details.
|
||||
|
||||
### Logging
|
||||
|
||||
Setting `logComponentVerbosity.queryStats` will do the following for each level:
|
||||
* Level 0 (default): Nothing will be logged.
|
||||
* Level 1 or higher: Invocations of $queryStats will be logged if and only if the algorithm is
|
||||
"hmac-sha-256". The specification of the $queryStats stage is logged, with any provided hmac key
|
||||
redacted.
|
||||
* Level 2 or higher: Nothing extra, reserved for future use.
|
||||
* Level 3 or higher: All results of any "hmac-sha-256" $queryStats invocation are logged. Each
|
||||
result will be its own entry and there will be one final entry that says "we finished".
|
||||
* Levels 4 and 5 do nothing extra.
|
||||
|
||||
- Level 0 (default): Nothing will be logged.
|
||||
- Level 1 or higher: Invocations of $queryStats will be logged if and only if the algorithm is
|
||||
"hmac-sha-256". The specification of the $queryStats stage is logged, with any provided hmac key
|
||||
redacted.
|
||||
- Level 2 or higher: Nothing extra, reserved for future use.
|
||||
- Level 3 or higher: All results of any "hmac-sha-256" $queryStats invocation are logged. Each
|
||||
result will be its own entry and there will be one final entry that says "we finished".
|
||||
- Levels 4 and 5 do nothing extra.
|
||||
|
||||
### Server Status Metrics
|
||||
|
||||
The following will be added to the `serverStatus.metrics`:
|
||||
|
||||
```js
|
||||
queryStats: {
|
||||
numEvicted: NumberLong(0),
|
||||
|
|
@ -183,6 +213,7 @@ queryStats: {
|
|||
```
|
||||
|
||||
# Glossary
|
||||
|
||||
**Query Execution**: This term implies the overall execution of what a client would consider one
|
||||
query, but which may or may not involve one or more getMore commands to iterate a cursor. For
|
||||
example, a find command and two getMore commands on the returned cursor is one query execution. An
|
||||
|
|
|
|||
|
|
@ -1,17 +1,21 @@
|
|||
# Search
|
||||
|
||||
This document is a work-in-progress and just provides a high-level overview of the search implementation.
|
||||
|
||||
[Atlas Search](https://www.mongodb.com/docs/atlas/atlas-search/) provides integrated full-text search by running queries with the $search and $searchMeta aggregation stages. You can read about the $vectorSearch aggregation stage in [vector_search](https://github.com/mongodb/mongo/blob/master/src/mongo/db/query/vector_search/README.md).
|
||||
|
||||
## Lucene
|
||||
|
||||
Diving into the mechanics of search requires a brief rundown of [Apache Lucene](https://lucene.apache.org/) because it is the bedrock of MongoDB's search capabilities. MongoDB employees can read more about Lucene and mongot at [go/mongot](http://go/mongot).
|
||||
|
||||
Apache Lucene is an open-source text search library, written in Java. Lucene allows users to store data in three primary ways:
|
||||
* inverted index: maps each term (in a set of documents) to the documents in which the term appears, in which terms are the unique words/phrases and documents are the pieces of content being indexed. Inverted indexes offer great performance for matching search terms with documents.
|
||||
* storedFields: stores all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast. This is very useful for search features that are improved by row-oriented data access, like search highlighting. Search highlighting marks up the search terms and displays them within the best/most relevant sections of a document.
|
||||
* DocValues: column-oriented fields with a document-to-value mapping built at index time. As it facilitates column based data access, it's faster for aggregating field values for counts and facets.
|
||||
|
||||
- inverted index: maps each term (in a set of documents) to the documents in which the term appears, in which terms are the unique words/phrases and documents are the pieces of content being indexed. Inverted indexes offer great performance for matching search terms with documents.
|
||||
- storedFields: stores all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast. This is very useful for search features that are improved by row-oriented data access, like search highlighting. Search highlighting marks up the search terms and displays them within the best/most relevant sections of a document.
|
||||
- DocValues: column-oriented fields with a document-to-value mapping built at index time. As it facilitates column based data access, it's faster for aggregating field values for counts and facets.
|
||||
|
||||
## `mongot`
|
||||
|
||||
`mongot` is a MongoDB-specific process written as a wrapper around Lucene and run on Atlas. Using Lucene, `mongot` indexes MongoDB databases to provide our customers with full text search capabilities.
|
||||
|
||||
In the current “coupled” search architecture, one `mongot` runs alongside each `mongod` or `mongos`. Each `mongod`/`mongos` and `mongot` pair are on the same physical box/server and communicate via localhost.
|
||||
|
|
@ -19,11 +23,13 @@ In the current “coupled” search architecture, one `mongot` runs alongside ea
|
|||
`mongot` replicates the data from its collocated `mongod` node using change streams and builds Lucene indexes on that replicated data. `mongot` is guaranteed to be eventually consistent with mongod. Check out [mongot_cursor](https://github.com/mongodb/mongo/blob/master/src/mongo/db/query/search/mongot_cursor.h) for the core shared code that establishes and executes communication between `mongod` and `mongot`.
|
||||
|
||||
## Search Indexes
|
||||
|
||||
In order to run search queries, the user has to create a search index. Search index commands similarly use `mongod`/`mongos` server communication protocols to communicate with a remote search index server, but with an Envoy instance that handles forwarding the command requests to Atlas servers and then eventually to the relevant Lucene/`mongot` instances. `mongot` and Envoy instances are co-located with every `mongod` server instance, and Envoy instances are co-located with `mongos` servers as well. The precise structure of the search index architecture will likely evolve in future as improvements are made to that system.
|
||||
|
||||
Search indexes can be:
|
||||
* Only on specified fields ("static")
|
||||
* All fields (“dynamic”)
|
||||
|
||||
- Only on specified fields ("static")
|
||||
- All fields (“dynamic”)
|
||||
|
||||
`mongot` stores the indexed data exclusively, unless the customer has opted into storing entire documents (more expensive).
|
||||
|
||||
|
|
@ -34,6 +40,7 @@ The four commands have security authorization action types corresponding with th
|
|||
Note: Indexes can also be managed through the Atlas UI.
|
||||
|
||||
## $search and $searchMeta stages
|
||||
|
||||
There are two text search stages in the aggregation framework (and $search is not available for find commands). [$search](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#-search) returns the results of full-text search, and [$searchMeta](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#-searchmeta) returns metadata about search results. When used for an aggregation, either search stage must be the first stage in the pipeline. For example:
|
||||
|
||||
```
|
||||
|
|
@ -44,16 +51,18 @@ db.coll.aggregate([
|
|||
]);
|
||||
```
|
||||
|
||||
$search and $searchMeta are parsed as [DocumentSourceSearch](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_search.h) and [DocumentSourceSearchMeta](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_search_meta.h), respectively. When using the classic engine, however, DocumentSourceSearch is [desugared](https://github.com/mongodb/mongo/blob/04f19bb61aba10577658947095020f00ac1403c4/src/mongo/db/pipeline/search/document_source_search.cpp#L118) into a sequence that uses the [$_internalSearchMongotRemote stage](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.h) and, if the `returnStoredSource` option is false, the [$_internalSearchIdLookup stage](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_internal_search_id_lookup.h). In SBE, both $search and $searchMeta are lowered directly from the original document sources.
|
||||
$search and $searchMeta are parsed as [DocumentSourceSearch](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_search.h) and [DocumentSourceSearchMeta](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_search_meta.h), respectively. When using the classic engine, however, DocumentSourceSearch is [desugared](https://github.com/mongodb/mongo/blob/04f19bb61aba10577658947095020f00ac1403c4/src/mongo/db/pipeline/search/document_source_search.cpp#L118) into a sequence that uses the [$\_internalSearchMongotRemote stage](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.h) and, if the `returnStoredSource` option is false, the [$\_internalSearchIdLookup stage](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_internal_search_id_lookup.h). In SBE, both $search and $searchMeta are lowered directly from the original document sources.
|
||||
|
||||
For example, the stage `{$search: {query: “chocolate”, path: “flavor”}, returnStoredSource: false}` will desugar into the two stages: `{$_internalSearchMongotRemote: {query: “chocolate”, path: “flavor”}, returnStoredSource: false}` and `{$_internalSearchIdLookup: {}}`.
|
||||
For example, the stage `{$search: {query: “chocolate”, path: “flavor”}, returnStoredSource: false}` will desugar into the two stages: `{$_internalSearchMongotRemote: {query: “chocolate”, path: “flavor”}, returnStoredSource: false}` and `{$_internalSearchIdLookup: {}}`.
|
||||
|
||||
### $_internalSearchMongotRemote
|
||||
$_internalSearchMongotRemote is the foundational stage for all search queries, e.g., $search and $searchMeta. This stage opens a cursor on `mongot` ([here](https://github.com/mongodb/mongo/blob/e530c98e7d44878ed8164ee9167c28afc97067a7/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.cpp#L269)) and retrieves results one-at-a-time from the cursor ([here](https://github.com/mongodb/mongo/blob/e530c98e7d44878ed8164ee9167c28afc97067a7/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.cpp#L163)).
|
||||
### $\_internalSearchMongotRemote
|
||||
|
||||
$\_internalSearchMongotRemote is the foundational stage for all search queries, e.g., $search and $searchMeta. This stage opens a cursor on `mongot` ([here](https://github.com/mongodb/mongo/blob/e530c98e7d44878ed8164ee9167c28afc97067a7/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.cpp#L269)) and retrieves results one-at-a-time from the cursor ([here](https://github.com/mongodb/mongo/blob/e530c98e7d44878ed8164ee9167c28afc97067a7/src/mongo/db/pipeline/search/document_source_internal_search_mongot_remote.cpp#L163)).
|
||||
|
||||
Within this stage, the underlying [TaskExecutorCursor](https://github.com/mongodb/mongo/blob/e530c98e7d44878ed8164ee9167c28afc97067a7/src/mongo/executor/task_executor_cursor.h) acts as a black box to handle dispatching commands to `mongot` only as necessary. The cursor retrieves a batch of results from `mongot`, iterates through that batch per each `getNext` call, then schedules a `getMore` request to `mongot` whenever the previous batch is exhausted.
|
||||
|
||||
Each batch returned from mongot includes a batch of BSON documents and metadata about the query results. Each document contains an _id and a relevancy score. The relevancy score indicates how well the document’s indexed values matched the user query. Metadata is a user-specified group of fields with information about the result set as a whole, mostly including counts of various groups (or facets).
|
||||
Each batch returned from mongot includes a batch of BSON documents and metadata about the query results. Each document contains an \_id and a relevancy score. The relevancy score indicates how well the document’s indexed values matched the user query. Metadata is a user-specified group of fields with information about the result set as a whole, mostly including counts of various groups (or facets).
|
||||
|
||||
### $_internalSearchIdLookup
|
||||
The $_internalSearchIdLookup stage is responsible for recreating the entire document to give to the rest of the agg pipeline (in the above example, $match and $project) and for checking to make sure the data returned is up to date with the data on `mongod`, since `mongot`’s indexed data is eventually consistent with `mongod`. For example, if `mongot` returned the _id to a document that had been deleted, $_internalSearchIdLookup is responsible for catching; it won’t find a document matching that _id and then filters out that document. The stage will also perform shard filtering, where it ensures there are no duplicates from separate shards, and it will retrieve the most up-to-date field values. However, this stage doesn’t account for documents that had been inserted to the collection but not yet propagated to `mongot` via the $changeStream; that’s why search queries are eventually consistent but don’t guarantee strong consistency.
|
||||
### $\_internalSearchIdLookup
|
||||
|
||||
The $\_internalSearchIdLookup stage is responsible for recreating the entire document to give to the rest of the agg pipeline (in the above example, $match and $project) and for checking to make sure the data returned is up to date with the data on `mongod`, since `mongot`’s indexed data is eventually consistent with `mongod`. For example, if `mongot` returned the \_id to a document that had been deleted, $\_internalSearchIdLookup is responsible for catching; it won’t find a document matching that \_id and then filters out that document. The stage will also perform shard filtering, where it ensures there are no duplicates from separate shards, and it will retrieve the most up-to-date field values. However, this stage doesn’t account for documents that had been inserted to the collection but not yet propagated to `mongot` via the $changeStream; that’s why search queries are eventually consistent but don’t guarantee strong consistency.
|
||||
|
|
|
|||
|
|
@ -63,9 +63,9 @@ The `timeField` will be used in the `control` object in the buckets collection.
|
|||
will be `control.min.t`, and `control.max.<time field>` will be `control.max.t`. The values for `t` in the
|
||||
user documents are stored in `data.t`.
|
||||
|
||||
The meta-data field is always specified as `meta` in the buckets collection. We will return documents with the user-specified meta-data field (in this case `m`), but we will store the meta-data field values under
|
||||
the field `meta` in the buckets collection. Therefore, when returning documents we will need to
|
||||
rewrite the field `meta` to the field the user expects: `m`. When optimizing queries, we will need to
|
||||
The meta-data field is always specified as `meta` in the buckets collection. We will return documents with the user-specified meta-data field (in this case `m`), but we will store the meta-data field values under
|
||||
the field `meta` in the buckets collection. Therefore, when returning documents we will need to
|
||||
rewrite the field `meta` to the field the user expects: `m`. When optimizing queries, we will need to
|
||||
rewrite the field the user inserted (`m`) to the field the buckets collection stores: `meta`.
|
||||
|
||||
## $match on metaField reorder
|
||||
|
|
|
|||
|
|
@ -18,14 +18,14 @@ Vector search is implemented as an aggregation stage that behaves similarly to [
|
|||
|
||||
[`$vectorSearch`](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_vector_search.h) takes several [parameters](https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/search/document_source_vector_search.idl) that are passed on to `mongot`. These include:
|
||||
|
||||
| Parameter | Description |
|
||||
| --------- | -------- |
|
||||
| queryVector | vector to query |
|
||||
| path | field to search over |
|
||||
| Parameter | Description |
|
||||
| ------------- | ----------------------------------------------------------- |
|
||||
| queryVector | vector to query |
|
||||
| path | field to search over |
|
||||
| numCandidates | number of candidates to consider when performing the search |
|
||||
| limit | maximum number of documents to return |
|
||||
| index | index to use for the search |
|
||||
| filter | optional pre-filter to apply before searching |
|
||||
| limit | maximum number of documents to return |
|
||||
| index | index to use for the search |
|
||||
| filter | optional pre-filter to apply before searching |
|
||||
|
||||
Validation for most of these fields occurs on `mongot`, with the exception of `filter`. `mongot` does not yet support complex MQL semantics, so the `filter` is limited to simple comparisons (e.g. `$eq`, `$lt`, `$gte`) on basic field types. This is validated on `mongod` with a [custom `MatchExpressionVisitor`](https://github.com/mongodb/mongo/blob/master/src/mongo/db/query/vector_search/filter_validator.cpp).
|
||||
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
|
|
@ -26,7 +26,7 @@ standalone would. The one difference from a standalone write is that replica set
|
|||
`OpObserver` that inserts a document to the **oplog** whenever a write to the database happens,
|
||||
describing the write. The oplog is a capped collection called `oplog.rs` in the `local` database.
|
||||
There are a few optimizations made for it in WiredTiger, and it is the only collection that doesn't
|
||||
include an _id field.
|
||||
include an \_id field.
|
||||
|
||||
If a write does multiple operations, each will have its own oplog entry; for example, inserts with
|
||||
implicit collection creation create two oplog entries, one for the `create` and one for the
|
||||
|
|
@ -46,7 +46,7 @@ or **majority**. If **majority** is specified, the write waits for that write to
|
|||
**committed snapshot** as well, so that it can be read with `readConcern: { level: majority }`
|
||||
reads. (If this last sentence made no sense, come back to it at the end).
|
||||
|
||||
### Default Write Concern
|
||||
### Default Write Concern
|
||||
|
||||
If a write operation does not explicitly specify a write concern, the server will use a default
|
||||
write concern. This default write concern will be defined by either the
|
||||
|
|
@ -54,7 +54,7 @@ write concern. This default write concern will be defined by either the
|
|||
**implicit default write concern**, implicitly set by the
|
||||
server based on replica set configuration.
|
||||
|
||||
#### Cluster-Wide Write Concern
|
||||
#### Cluster-Wide Write Concern
|
||||
|
||||
Users can set the cluster-wide write concern (CWWC) using the
|
||||
[`setDefaultRWConcern`](https://docs.mongodb.com/manual/reference/command/setDefaultRWConcern/)
|
||||
|
|
@ -67,7 +67,7 @@ On sharded clusters, the CWWC will be stored on config servers. Shard servers th
|
|||
store the CWWC. Instead, mongos polls the config server and applies the default write concern to
|
||||
requests it forwards to shards.
|
||||
|
||||
#### Implicit Default Write Concern
|
||||
#### Implicit Default Write Concern
|
||||
|
||||
If there is no cluster-wide default write concern set, the server will set the default. This is
|
||||
known as the implicit default write concern (IDWC). For most cases, the IDWC will default to
|
||||
|
|
@ -87,7 +87,7 @@ successfully acknowledge a majority write as the majority for the set is two nod
|
|||
primary will remain primary with the arbiter's vote. In this case, the DWCF will have preemptively
|
||||
set the IDWC to `{w: 1}` so the user can still perform writes to the replica set.
|
||||
|
||||
#### Implicit Default Write Concern and Sharded Clusters
|
||||
#### Implicit Default Write Concern and Sharded Clusters
|
||||
|
||||
For sharded clusters, the implicit default write concern will always be `{w: "majority"}`.
|
||||
As mentioned above, mongos will send the default write concern with all requests that it forwards
|
||||
|
|
@ -108,7 +108,7 @@ the result of the default write concern formula is `{w: 1}`. Similarly, we will
|
|||
`{w: "majority"}` across the cluster, but we do not want to specify that for PSA sets for reasons
|
||||
listed above.
|
||||
|
||||
#### Replica Set Reconfigs and Default Write Concern
|
||||
#### Replica Set Reconfigs and Default Write Concern
|
||||
|
||||
A replica set reconfig will recalculate the default write concern using the Default Write Concern
|
||||
Formula if CWWC is not set. If the new value of the implicit default write concern is different
|
||||
|
|
@ -124,11 +124,12 @@ majority write concern needed to set the CWWC, users can run
|
|||
setting CWWC does not get in the way of being able to do a force reconfig.
|
||||
|
||||
#### Code References
|
||||
- [The definition of an Oplog Entry](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_entry.idl)
|
||||
- [Upper layer uses OpObserver class to write Oplog](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/op_observer/op_observer.h#L112), for example, [it is helpful to take a look at ObObserverImpl::logOperation()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/op_observer/op_observer_impl.cpp#L114)
|
||||
- [repl::logOplogRecords() is a common function to write Oplogs into Oplog Collection](https://github.com/mongodb/mongo/blob/r7.1.0/src/mongo/db/repl/oplog.cpp#L440)
|
||||
- [WriteConcernOptions is filled in extractWriteConcern()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/write_concern.cpp#L71)
|
||||
- [Upper level uses waitForWriteConcern() to wait for the write concern to be fulfilled](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/write_concern.cpp#L254)
|
||||
|
||||
- [The definition of an Oplog Entry](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_entry.idl)
|
||||
- [Upper layer uses OpObserver class to write Oplog](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/op_observer/op_observer.h#L112), for example, [it is helpful to take a look at ObObserverImpl::logOperation()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/op_observer/op_observer_impl.cpp#L114)
|
||||
- [repl::logOplogRecords() is a common function to write Oplogs into Oplog Collection](https://github.com/mongodb/mongo/blob/r7.1.0/src/mongo/db/repl/oplog.cpp#L440)
|
||||
- [WriteConcernOptions is filled in extractWriteConcern()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/write_concern.cpp#L71)
|
||||
- [Upper level uses waitForWriteConcern() to wait for the write concern to be fulfilled](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/write_concern.cpp#L254)
|
||||
|
||||
## Life as a Secondary
|
||||
|
||||
|
|
@ -214,22 +215,22 @@ the `OplogFetcher` decides to continue, it will wait for the next batch to arriv
|
|||
not, the `OplogFetcher` will terminate, which will lead to `BackgroundSync` choosing a new sync
|
||||
source. Reasons for changing sync sources include:
|
||||
|
||||
* If the node is no longer in the replica set configuration.
|
||||
* If the current sync source is no longer in the replica set configuration.
|
||||
* If the user has requested another sync source via the `replSetSyncFrom` command.
|
||||
* If chaining is disabled and the node is not currently syncing from the primary.
|
||||
* If the sync source is not the primary, does not have its own sync source, and is not ahead of
|
||||
the node. This indicates that the sync source will not receive writes in a timely manner. As a
|
||||
result, continuing to sync from it will likely cause the node to be lagged.
|
||||
* If the most recent OpTime of the sync source is more than `maxSyncSourceLagSecs` seconds behind
|
||||
another member's latest oplog entry. This ensures that the sync source is not too far behind
|
||||
other nodes in the set. `maxSyncSourceLagSecs` is a server parameter and has a default value of
|
||||
30 seconds.
|
||||
* If the node has discovered another eligible sync source that is significantly closer. A
|
||||
significantly closer node has a ping time that is at least `changeSyncSourceThresholdMillis`
|
||||
lower than our current sync source. This minimizes the number of nodes that have sync sources
|
||||
located far away.`changeSyncSourceThresholdMillis` is a server parameter and has a default value
|
||||
of 5 ms.
|
||||
- If the node is no longer in the replica set configuration.
|
||||
- If the current sync source is no longer in the replica set configuration.
|
||||
- If the user has requested another sync source via the `replSetSyncFrom` command.
|
||||
- If chaining is disabled and the node is not currently syncing from the primary.
|
||||
- If the sync source is not the primary, does not have its own sync source, and is not ahead of
|
||||
the node. This indicates that the sync source will not receive writes in a timely manner. As a
|
||||
result, continuing to sync from it will likely cause the node to be lagged.
|
||||
- If the most recent OpTime of the sync source is more than `maxSyncSourceLagSecs` seconds behind
|
||||
another member's latest oplog entry. This ensures that the sync source is not too far behind
|
||||
other nodes in the set. `maxSyncSourceLagSecs` is a server parameter and has a default value of
|
||||
30 seconds.
|
||||
- If the node has discovered another eligible sync source that is significantly closer. A
|
||||
significantly closer node has a ping time that is at least `changeSyncSourceThresholdMillis`
|
||||
lower than our current sync source. This minimizes the number of nodes that have sync sources
|
||||
located far away.`changeSyncSourceThresholdMillis` is a server parameter and has a default value
|
||||
of 5 ms.
|
||||
|
||||
### Sync Source Selection
|
||||
|
||||
|
|
@ -256,29 +257,29 @@ candidate.
|
|||
|
||||
Otherwise, it iterates through all of the nodes and sees which one is the best.
|
||||
|
||||
* First the secondary checks the `TopologyCoordinator`'s cached view of the replica set for the
|
||||
latest OpTime known to be on the primary. Secondaries do not sync from nodes whose newest oplog
|
||||
entry is more than
|
||||
[`maxSyncSourceLagSecs`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/topology_coordinator.cpp#L302-L315)
|
||||
seconds behind the primary's newest oplog entry.
|
||||
* Secondaries then loop through each node and choose the closest node that satisfies [various
|
||||
criteria](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/topology_coordinator.cpp#L200-L438).
|
||||
“Closest” here is determined by the lowest ping time to each node.
|
||||
* If no node satisfies the necessary criteria, then the `BackgroundSync` waits 1 second and restarts
|
||||
the sync source selection process.
|
||||
- First the secondary checks the `TopologyCoordinator`'s cached view of the replica set for the
|
||||
latest OpTime known to be on the primary. Secondaries do not sync from nodes whose newest oplog
|
||||
entry is more than
|
||||
[`maxSyncSourceLagSecs`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/topology_coordinator.cpp#L302-L315)
|
||||
seconds behind the primary's newest oplog entry.
|
||||
- Secondaries then loop through each node and choose the closest node that satisfies [various
|
||||
criteria](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/topology_coordinator.cpp#L200-L438).
|
||||
“Closest” here is determined by the lowest ping time to each node.
|
||||
- If no node satisfies the necessary criteria, then the `BackgroundSync` waits 1 second and restarts
|
||||
the sync source selection process.
|
||||
|
||||
#### Sync Source Probing
|
||||
|
||||
After choosing a sync source candidate, the `SyncSourceResolver` probes the sync source candidate to
|
||||
make sure it actually is able to fetch from the sync source candidate’s oplog.
|
||||
|
||||
* If the sync source candidate has no oplog or there is an error, the secondary denylists that sync
|
||||
source for some time and then tries to find a new sync source candidate.
|
||||
* If the oldest entry in the sync source candidate's oplog is newer than the node's newest entry,
|
||||
then the node denylists that sync source candidate as well because the candidate is too far
|
||||
ahead.
|
||||
* The sync source's **RollbackID** is also fetched to be checked after the first batch is returned
|
||||
by the `OplogFetcher`.
|
||||
- If the sync source candidate has no oplog or there is an error, the secondary denylists that sync
|
||||
source for some time and then tries to find a new sync source candidate.
|
||||
- If the oldest entry in the sync source candidate's oplog is newer than the node's newest entry,
|
||||
then the node denylists that sync source candidate as well because the candidate is too far
|
||||
ahead.
|
||||
- The sync source's **RollbackID** is also fetched to be checked after the first batch is returned
|
||||
by the `OplogFetcher`.
|
||||
|
||||
If the secondary is too far behind all possible sync source candidates then it goes into maintenance
|
||||
mode and waits for manual intervention (likely a call to `resync`). If no viable candidates were
|
||||
|
|
@ -326,14 +327,15 @@ endless loop doing the following:
|
|||
last optime in the batch.
|
||||
|
||||
#### Code References
|
||||
- [Start background threads like bgSync/oplogApplier/syncSourceFeedback](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L213)
|
||||
- [BackgroundSync starts SyncSourceResolver and OplogFetcher to sync log](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/bgsync.cpp#L225)
|
||||
- [SyncSourceResolver chooses a sync source to sync from](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/sync_source_resolver.cpp#L545)
|
||||
- [OplogBuffer currently uses a BlockingQueue as underlying data structure](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_buffer_blocking_queue.h#L41)
|
||||
- [OplogFetcher queries from sync source and put fetched oplogs in OplogApplier::_oplogBuffer](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_fetcher.cpp#L209)
|
||||
- [OplogBatcher polls oplogs from OplogApplier::_oplogBuffer and creates an OplogBatch to apply](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_batcher.cpp#L282)
|
||||
- [OplogApplier gets batches of oplog entries from the OplogBatcher and applies entries in parallel](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_applier_impl.cpp#L297)
|
||||
- [SyncSourceFeedback keeps checking if there are new oplogs applied on this instance and issues `UpdatePositionCmd` to sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/sync_source_feedback.cpp#L157)
|
||||
|
||||
- [Start background threads like bgSync/oplogApplier/syncSourceFeedback](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L213)
|
||||
- [BackgroundSync starts SyncSourceResolver and OplogFetcher to sync log](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/bgsync.cpp#L225)
|
||||
- [SyncSourceResolver chooses a sync source to sync from](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/sync_source_resolver.cpp#L545)
|
||||
- [OplogBuffer currently uses a BlockingQueue as underlying data structure](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_buffer_blocking_queue.h#L41)
|
||||
- [OplogFetcher queries from sync source and put fetched oplogs in OplogApplier::\_oplogBuffer](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_fetcher.cpp#L209)
|
||||
- [OplogBatcher polls oplogs from OplogApplier::\_oplogBuffer and creates an OplogBatch to apply](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_batcher.cpp#L282)
|
||||
- [OplogApplier gets batches of oplog entries from the OplogBatcher and applies entries in parallel](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_applier_impl.cpp#L297)
|
||||
- [SyncSourceFeedback keeps checking if there are new oplogs applied on this instance and issues `UpdatePositionCmd` to sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/sync_source_feedback.cpp#L157)
|
||||
|
||||
## Replication and Topology Coordinators
|
||||
|
||||
|
|
@ -393,9 +395,9 @@ issuing remote commands to other nodes.
|
|||
|
||||
Each node communicates with other nodes at regular intervals to:
|
||||
|
||||
* Check the liveness of the other nodes (heartbeats)
|
||||
* Stay up to date with the primary (oplog fetching)
|
||||
* Update their sync source with their progress (`replSetUpdatePosition` commands)
|
||||
- Check the liveness of the other nodes (heartbeats)
|
||||
- Stay up to date with the primary (oplog fetching)
|
||||
- Update their sync source with their progress (`replSetUpdatePosition` commands)
|
||||
|
||||
Each oplog entry is assigned a unique `OpTime` to describe when it occurred so other nodes can
|
||||
compare how up-to-date they are.
|
||||
|
|
@ -604,14 +606,15 @@ The `replSetUpdatePosition` command response does not include any information un
|
|||
error, such as in a `ReplSetConfig` mismatch.
|
||||
|
||||
#### Code References
|
||||
- [OplogFetcher passes on the metadata it received from its sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_fetcher.cpp#L897)
|
||||
- [Node handles heartbeat response and schedules the next heartbeat after it receives heartbeat response](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L190)
|
||||
- [Node responds to heartbeat request](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L752)
|
||||
- [Primary advances the replica set's commit point after receiving replSetUpdatePosition command](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1889)
|
||||
- [Secondary advances its understanding of the replica set commit point using metadata fetched from its sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5649)
|
||||
- [TopologyCoordinator updates commit optime](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2885)
|
||||
- [SyncSourceFeedback triggers replSetUpdatePosition command using Reporter](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/reporter.cpp#L189)
|
||||
- [Node updates replica set metadata after receiving replSetUpdatePosition command](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L675)
|
||||
|
||||
- [OplogFetcher passes on the metadata it received from its sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/oplog_fetcher.cpp#L897)
|
||||
- [Node handles heartbeat response and schedules the next heartbeat after it receives heartbeat response](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L190)
|
||||
- [Node responds to heartbeat request](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L752)
|
||||
- [Primary advances the replica set's commit point after receiving replSetUpdatePosition command](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1889)
|
||||
- [Secondary advances its understanding of the replica set commit point using metadata fetched from its sync source](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5649)
|
||||
- [TopologyCoordinator updates commit optime](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2885)
|
||||
- [SyncSourceFeedback triggers replSetUpdatePosition command using Reporter](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/reporter.cpp#L189)
|
||||
- [Node updates replica set metadata after receiving replSetUpdatePosition command](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L675)
|
||||
|
||||
## Read Concern
|
||||
|
||||
|
|
@ -625,11 +628,11 @@ any updates that occurred since the read began may or may not be seen.
|
|||
read command to specify at what consistency level the read should be satisfied. There are 5 read
|
||||
concern levels:
|
||||
|
||||
* Local
|
||||
* Majority
|
||||
* Linearizable
|
||||
* Snapshot
|
||||
* Available
|
||||
- Local
|
||||
- Majority
|
||||
- Linearizable
|
||||
- Snapshot
|
||||
- Available
|
||||
|
||||
**Local** just returns whatever the most up-to-date data is on the node. On a primary, it does this
|
||||
by reading from the storage engine's most recent snapshot. On a secondary, it performs a timestamped
|
||||
|
|
@ -644,10 +647,10 @@ been replicated to a majority of nodes in the replica set. Any data seen in majo
|
|||
roll back in the future. Thus majority reads prevent **dirty reads**, though they often are
|
||||
**stale reads**.
|
||||
|
||||
Read concern majority reads do not wait for anything to be committed; they just use different
|
||||
snapshots from local reads. Read concern majority reads usually return as fast as local reads, but
|
||||
sometimes will block. For example, right after startup or rollback when we do not have a committed
|
||||
snapshot, majority reads will be blocked. Also, when some of the secondaries are unavailable or
|
||||
Read concern majority reads do not wait for anything to be committed; they just use different
|
||||
snapshots from local reads. Read concern majority reads usually return as fast as local reads, but
|
||||
sometimes will block. For example, right after startup or rollback when we do not have a committed
|
||||
snapshot, majority reads will be blocked. Also, when some of the secondaries are unavailable or
|
||||
lagging, majority reads could slow down or block.
|
||||
|
||||
For information on how majority read concern works within a multi-document transaction, see the
|
||||
|
|
@ -695,10 +698,12 @@ the local snapshot is beyond the specified OpTime. If read concern majority is s
|
|||
wait until the committed snapshot is beyond the specified OpTime.
|
||||
|
||||
**afterClusterTime** is a read concern option used for supporting **causal consistency**.
|
||||
|
||||
<!-- TODO: link to the Causal Consistency section of the Sharding Architecture Guide -->
|
||||
|
||||
#### Code References
|
||||
- [ReadConcernArg is filled in _extractReadConcern()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/service_entry_point_common.cpp#L261)
|
||||
|
||||
- [ReadConcernArg is filled in \_extractReadConcern()](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/service_entry_point_common.cpp#L261)
|
||||
|
||||
## Read Preference
|
||||
|
||||
|
|
@ -936,7 +941,7 @@ atomicity of a transaction that involves multiple shards. One important part of
|
|||
Protocol is making sure that all shards participating in the transaction are in the
|
||||
**prepared state**, or guaranteed to be able to commit, before actually committing the transaction.
|
||||
This will allow us to avoid a situation where the transaction only commits on some of the shards and
|
||||
aborts on others. Once a node puts a transaction in the prepared state, it *must* be able to commit
|
||||
aborts on others. Once a node puts a transaction in the prepared state, it _must_ be able to commit
|
||||
the transaction if we decide to commit the overall cross-shard transaction.
|
||||
|
||||
Another key piece of the Two Phase Commit Protocol is the [**`TransactionCoordinator`**](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/s/transaction_coordinator.h#L70), which is
|
||||
|
|
@ -1040,7 +1045,7 @@ transaction are visible.
|
|||
When a node receives the `commitTransaction` command and the transaction is in the prepared state,
|
||||
it will first [re-acquire](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1962) the [RSTL](#replication-state-transition-lock) to prevent any state
|
||||
transitions from happening while the commit is in progress. It will then [reserve an oplog slot](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L2021-L2030),
|
||||
[commit the storage transaction at the `commitTimestamp`](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L2057-L2059),
|
||||
[commit the storage transaction at the `commitTimestamp`](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L2057-L2059),
|
||||
[write the `commitTransaction` oplog entry](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L2065-L2069)
|
||||
into the oplog, [update the transactions table](https://github.com/mongodb/mongo/blob/master/src/mongo/db/op_observer/op_observer_impl.cpp#L201), transition the `txnState` to `kCommitted`, record
|
||||
metrics, and [clean up the transaction resources](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L2073-L2075).
|
||||
|
|
@ -1096,7 +1101,7 @@ holding these locks to prevent conflicting operations.
|
|||
|
||||
### Recovering Prepared Transactions
|
||||
|
||||
The prepare state *must* endure any state transition or failover, so they must be recovered and
|
||||
The prepare state _must_ endure any state transition or failover, so they must be recovered and
|
||||
reconstructed in all situations. If the in-memory state of a prepared transaction is lost, it can be
|
||||
reconstructed using the information in the prepare oplog entry(s).
|
||||
|
||||
|
|
@ -1121,11 +1126,12 @@ transaction using the `TransactionHistoryIterator`. It will check out the sessio
|
|||
the transaction, apply all the operations from the oplog entry(s) and prepare the transaction.
|
||||
|
||||
#### Code references
|
||||
* Function to [abort unprepared transactions during stepup or stepdown](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/replication_coordinator_impl.cpp#L2766).
|
||||
* Where we [yield locks for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1282-L1287).
|
||||
* Where we [restore locks for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1343-L1348).
|
||||
* Function to [reconstruct prepared transactions from oplog entries](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/transaction_oplog_application.cpp#L804).
|
||||
* Where we [skip over prepareTransaction oplog entries](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/transaction_oplog_application.cpp#L737-L752) during recovery oplog application.
|
||||
|
||||
- Function to [abort unprepared transactions during stepup or stepdown](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/replication_coordinator_impl.cpp#L2766).
|
||||
- Where we [yield locks for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1282-L1287).
|
||||
- Where we [restore locks for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1343-L1348).
|
||||
- Function to [reconstruct prepared transactions from oplog entries](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/transaction_oplog_application.cpp#L804).
|
||||
- Where we [skip over prepareTransaction oplog entries](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/transaction_oplog_application.cpp#L737-L752) during recovery oplog application.
|
||||
|
||||
## Read Concern Behavior Within Transactions
|
||||
|
||||
|
|
@ -1135,8 +1141,8 @@ transaction. If no read concern was specified, the default read concern is local
|
|||
Reads within a transaction behave differently from reads outside of a transaction because of
|
||||
**speculative** behavior. This means a transaction speculatively executes without ensuring that
|
||||
the data read won't be rolled back until it commits. No matter the read concern, when a node goes to
|
||||
commit a transaction, it waits for the data that it read to be majority committed *as long as the
|
||||
transaction was run with write concern majority*. Because of speculative behavior, this means that
|
||||
commit a transaction, it waits for the data that it read to be majority committed _as long as the
|
||||
transaction was run with write concern majority_. Because of speculative behavior, this means that
|
||||
the transaction can only provide the guarantees of majority read concern, that data that it read
|
||||
won't roll back, if it is run with write concern majority.
|
||||
|
||||
|
|
@ -1172,8 +1178,9 @@ the [`all_durable`](#replication-timestamp-glossary) timestamp when the transact
|
|||
which ensures a snapshot with no oplog holes.
|
||||
|
||||
#### Code references
|
||||
* [Noop write for read-only transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1940-L1944).
|
||||
* Function to [set a read snapshot for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1170).
|
||||
|
||||
- [Noop write for read-only transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1940-L1944).
|
||||
- Function to [set a read snapshot for transactions](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1170).
|
||||
|
||||
## Transaction Oplog Application
|
||||
|
||||
|
|
@ -1291,10 +1298,11 @@ an [empty transaction](https://github.com/mongodb/mongo/blob/07e1e93c566243983b4
|
|||
since the operations should be applied by its split transactions.
|
||||
|
||||
#### Code references
|
||||
* [Filling writer vectors for unprepared transactions on terminal applyOps.](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/oplog_applier_impl.cpp#L1018-L1033)
|
||||
* [Applying writes in parallel](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/oplog_applier_impl.cpp#L809-L832) via the writer thread pool.
|
||||
* Function to [unstash transaction resources](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1462) from the RecoveryUnit to the OperationContext.
|
||||
* Function to [stash transaction resources](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1427) from the OperationContext to the RecoveryUnit.
|
||||
|
||||
- [Filling writer vectors for unprepared transactions on terminal applyOps.](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/oplog_applier_impl.cpp#L1018-L1033)
|
||||
- [Applying writes in parallel](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/repl/oplog_applier_impl.cpp#L809-L832) via the writer thread pool.
|
||||
- Function to [unstash transaction resources](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1462) from the RecoveryUnit to the OperationContext.
|
||||
- Function to [stash transaction resources](https://github.com/mongodb/mongo/blob/be38579dc72a40988cada1f43ab6695dcff8cc36/src/mongo/db/transaction/transaction_participant.cpp#L1427) from the OperationContext to the RecoveryUnit.
|
||||
|
||||
## Transaction Errors
|
||||
|
||||
|
|
@ -1377,43 +1385,45 @@ mode. Only then can it acquire the global lock in its desired mode.
|
|||
## Step Up
|
||||
|
||||
There are a number of ways that a node will run for election:
|
||||
* If it hasn't seen a primary within the election timeout (which defaults to 10 seconds).
|
||||
* If it realizes that it has higher priority than the primary, it will wait and run for
|
||||
election (also known as a **priority takeover**). The amount of time the node waits before calling
|
||||
an election is directly related to its priority in comparison to the priority of rest of the set
|
||||
(so higher priority nodes will call for a priority takeover faster than lower priority nodes).
|
||||
Priority takeovers allow users to specify a node that they would prefer be the primary.
|
||||
* Newly elected primaries attempt to catchup to the latest applied OpTime in the replica
|
||||
set. Until this process (called primary catchup) completes, the new primary will not accept
|
||||
writes. If a secondary realizes that it is more up-to-date than the primary and the primary takes
|
||||
longer than `catchUpTakeoverDelayMillis` (default 30 seconds), it will run for election. This
|
||||
behvarior is known as a **catchup takeover**. If primary catchup is taking too long, catchup
|
||||
takeover can help allow the replica set to accept writes sooner, since a more up-to-date node will
|
||||
not spend as much time (or any time) in catchup. See the [Transitioning to `PRIMARY` section](https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/README.md#transitioning-to-primary) section for
|
||||
further details on primary catchup.
|
||||
* The `replSetStepUp` command can be run on an eligible node to cause it to run for election
|
||||
immediately. We don't expect users to call this command, but it is run internally for election
|
||||
handoff and testing.
|
||||
* When a node is stepped down via the `replSetStepDown` command, if the `enableElectionHandoff`
|
||||
parameter is set to true (the default), it will choose an eligible secondary to run the
|
||||
`replSetStepUp` command on a best-effort basis. This behavior is called **election handoff**. This
|
||||
will mean that the replica set can shorten failover time, since it skips waiting for the election
|
||||
timeout. If `replSetStepDown` was called with `force: true` or the node was stepped down while
|
||||
`enableElectionHandoff` is false, then nodes in the replica set will wait until the election
|
||||
timeout triggers to run for election.
|
||||
|
||||
- If it hasn't seen a primary within the election timeout (which defaults to 10 seconds).
|
||||
- If it realizes that it has higher priority than the primary, it will wait and run for
|
||||
election (also known as a **priority takeover**). The amount of time the node waits before calling
|
||||
an election is directly related to its priority in comparison to the priority of rest of the set
|
||||
(so higher priority nodes will call for a priority takeover faster than lower priority nodes).
|
||||
Priority takeovers allow users to specify a node that they would prefer be the primary.
|
||||
- Newly elected primaries attempt to catchup to the latest applied OpTime in the replica
|
||||
set. Until this process (called primary catchup) completes, the new primary will not accept
|
||||
writes. If a secondary realizes that it is more up-to-date than the primary and the primary takes
|
||||
longer than `catchUpTakeoverDelayMillis` (default 30 seconds), it will run for election. This
|
||||
behvarior is known as a **catchup takeover**. If primary catchup is taking too long, catchup
|
||||
takeover can help allow the replica set to accept writes sooner, since a more up-to-date node will
|
||||
not spend as much time (or any time) in catchup. See the [Transitioning to `PRIMARY` section](https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/README.md#transitioning-to-primary) section for
|
||||
further details on primary catchup.
|
||||
- The `replSetStepUp` command can be run on an eligible node to cause it to run for election
|
||||
immediately. We don't expect users to call this command, but it is run internally for election
|
||||
handoff and testing.
|
||||
- When a node is stepped down via the `replSetStepDown` command, if the `enableElectionHandoff`
|
||||
parameter is set to true (the default), it will choose an eligible secondary to run the
|
||||
`replSetStepUp` command on a best-effort basis. This behavior is called **election handoff**. This
|
||||
will mean that the replica set can shorten failover time, since it skips waiting for the election
|
||||
timeout. If `replSetStepDown` was called with `force: true` or the node was stepped down while
|
||||
`enableElectionHandoff` is false, then nodes in the replica set will wait until the election
|
||||
timeout triggers to run for election.
|
||||
|
||||
### Code references
|
||||
* [election timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L345) ([defaults](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_config.idl#L101))
|
||||
* [priority takeover](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L449)
|
||||
* [priority takeover: priority check](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1568-L1578)
|
||||
* [priority takeover: wait time calculation](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_config.cpp#L705-L709)
|
||||
* [newly elected primary catchup](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4714)
|
||||
* [primary catchup completion](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4799-L4813)
|
||||
* [primary start accepting writes](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1361)
|
||||
* [catchup takeover](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L466)
|
||||
* [catchup takeover: takeover check](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L466)
|
||||
* [election handoff](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2924)
|
||||
* [election handoff: skip wait](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2917-L2921)
|
||||
|
||||
- [election timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L345) ([defaults](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_config.idl#L101))
|
||||
- [priority takeover](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L449)
|
||||
- [priority takeover: priority check](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1568-L1578)
|
||||
- [priority takeover: wait time calculation](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_config.cpp#L705-L709)
|
||||
- [newly elected primary catchup](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4714)
|
||||
- [primary catchup completion](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4799-L4813)
|
||||
- [primary start accepting writes](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1361)
|
||||
- [catchup takeover](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L466)
|
||||
- [catchup takeover: takeover check](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L466)
|
||||
- [election handoff](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2924)
|
||||
- [election handoff: skip wait](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2917-L2921)
|
||||
|
||||
### Candidate Perspective
|
||||
|
||||
|
|
@ -1441,11 +1451,12 @@ If the candidate received votes from a majority of nodes, including itself, the
|
|||
election.
|
||||
|
||||
#### Code references
|
||||
* [dry-run election](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L203)
|
||||
* [skipping dry-run](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L185)
|
||||
* [real election](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L277)
|
||||
* [candidate process vote response](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/vote_requester.cpp#L114)
|
||||
* [candidate checks election result](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L416)
|
||||
|
||||
- [dry-run election](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L203)
|
||||
- [skipping dry-run](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L185)
|
||||
- [real election](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L277)
|
||||
- [candidate process vote response](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/vote_requester.cpp#L114)
|
||||
- [candidate checks election result](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_elect_v1.cpp#L416)
|
||||
|
||||
### Voter Perspective
|
||||
|
||||
|
|
@ -1469,8 +1480,9 @@ future elections. This ensures that even if a node restarts, it does not vote fo
|
|||
same term.
|
||||
|
||||
#### Code references
|
||||
* [node processing vote request](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L3429)
|
||||
* [recording LastVote durably](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5739)
|
||||
|
||||
- [node processing vote request](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L3429)
|
||||
- [recording LastVote durably](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5739)
|
||||
|
||||
### Transitioning to `PRIMARY`
|
||||
|
||||
|
|
@ -1512,14 +1524,15 @@ and logs “transition to primary complete”. At this point, new writes will be
|
|||
primary.
|
||||
|
||||
#### Code references
|
||||
* [clearing the sync source, notify nodes of election, prepare catch up](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4697-L4707)
|
||||
* [catchup to latest optime known via heartbeats](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4800)
|
||||
* [catchup-timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4746)
|
||||
* [always allow chaining for catchup](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5231)
|
||||
* [enter drain mode after catchup attempt](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4783)
|
||||
* [exit drain mode](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1205)
|
||||
* [term bump](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1300)
|
||||
* [drop temporary collections](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L532)
|
||||
|
||||
- [clearing the sync source, notify nodes of election, prepare catch up](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4697-L4707)
|
||||
- [catchup to latest optime known via heartbeats](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4800)
|
||||
- [catchup-timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4746)
|
||||
- [always allow chaining for catchup](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L5231)
|
||||
- [enter drain mode after catchup attempt](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4783)
|
||||
- [exit drain mode](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1205)
|
||||
- [term bump](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L1300)
|
||||
- [drop temporary collections](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L532)
|
||||
|
||||
## Step Down
|
||||
|
||||
|
|
@ -1528,14 +1541,15 @@ primary.
|
|||
The `replSetStepDown` command is one way that a node relinquishes its position as primary. Stepdown via the
|
||||
`replSetStepDown` command is called "conditional" because it may or may not succeed. Success in this case
|
||||
depends on the params passed to the command as well as the state of nodes of the replica set.
|
||||
* If the `force` option is set to `true`:
|
||||
* In this case the primary node will wait for `secondaryCatchUpPeriodSecs`, a `replSetStepDown` parameter,
|
||||
before stepping down regardless of whether the other nodes have caught up or are electable.
|
||||
* If the `force` option is omitted or set to `false`, the following conditions must be met for the command to
|
||||
succeed:
|
||||
* The [`lastApplied`](#replication-timestamp-glossary) OpTime of the primary must be replicated to a majority
|
||||
of the nodes
|
||||
* At least one of the up-to-date secondaries must also be electable
|
||||
|
||||
- If the `force` option is set to `true`:
|
||||
- In this case the primary node will wait for `secondaryCatchUpPeriodSecs`, a `replSetStepDown` parameter,
|
||||
before stepping down regardless of whether the other nodes have caught up or are electable.
|
||||
- If the `force` option is omitted or set to `false`, the following conditions must be met for the command to
|
||||
succeed:
|
||||
- The [`lastApplied`](#replication-timestamp-glossary) OpTime of the primary must be replicated to a majority
|
||||
of the nodes
|
||||
- At least one of the up-to-date secondaries must also be electable
|
||||
|
||||
When a `replSetStepDown` command comes in, the node begins to check if it can step down. First, the
|
||||
node attempts to acquire the [RSTL](#replication-state-transition-lock). In order to do so, it must
|
||||
|
|
@ -1551,40 +1565,43 @@ Upon a successful stepdown, it yields locks held by
|
|||
Finally, we log stepdown metrics and update our member state to `SECONDARY`.
|
||||
|
||||
#### Code references
|
||||
* [User-facing documentation](https://www.mongodb.com/docs/manual/reference/command/replSetStepDown/#command-fields).
|
||||
* [Replication coordinator stepDown method](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2729)
|
||||
* [ReplSetStepDown command class](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L527)
|
||||
* [The node loops trying to step down](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2836)
|
||||
* [A majority of nodes need to have reached the last applied optime](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2733)
|
||||
* [At least one caught up node needs to be electable](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2738)
|
||||
* [Set the LeaderMode to kSteppingDown](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1721)
|
||||
* [Upon a successful stepdown, it yields locks held by prepared transactions](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2899)
|
||||
|
||||
- [User-facing documentation](https://www.mongodb.com/docs/manual/reference/command/replSetStepDown/#command-fields).
|
||||
- [Replication coordinator stepDown method](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2729)
|
||||
- [ReplSetStepDown command class](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L527)
|
||||
- [The node loops trying to step down](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2836)
|
||||
- [A majority of nodes need to have reached the last applied optime](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2733)
|
||||
- [At least one caught up node needs to be electable](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L2738)
|
||||
- [Set the LeaderMode to kSteppingDown](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1721)
|
||||
- [Upon a successful stepdown, it yields locks held by prepared transactions](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L2899)
|
||||
|
||||
### Unconditional
|
||||
|
||||
Stepdowns can also occur for the following reasons:
|
||||
* If the primary learns of a higher term
|
||||
* Liveness timeout: If a primary stops being able to transitively communicate with a majority of
|
||||
nodes. The primary does not need to be able to communicate directly with a majority of nodes. If
|
||||
primary A can’t communicate with node B, but A can communicate with C which can communicate with B,
|
||||
that is okay. If you consider the minimum spanning tree on the cluster where edges are connections
|
||||
from nodes to their sync source, then as long as the primary is connected to a majority of nodes, it
|
||||
will stay primary.
|
||||
* Force reconfig via the `replSetReconfig` command
|
||||
* Force reconfig via heartbeat: If we learn of a newer config through heartbeats, we will
|
||||
schedule a replica set config change.
|
||||
|
||||
- If the primary learns of a higher term
|
||||
- Liveness timeout: If a primary stops being able to transitively communicate with a majority of
|
||||
nodes. The primary does not need to be able to communicate directly with a majority of nodes. If
|
||||
primary A can’t communicate with node B, but A can communicate with C which can communicate with B,
|
||||
that is okay. If you consider the minimum spanning tree on the cluster where edges are connections
|
||||
from nodes to their sync source, then as long as the primary is connected to a majority of nodes, it
|
||||
will stay primary.
|
||||
- Force reconfig via the `replSetReconfig` command
|
||||
- Force reconfig via heartbeat: If we learn of a newer config through heartbeats, we will
|
||||
schedule a replica set config change.
|
||||
|
||||
During unconditional stepdown, we do not check preconditions before attempting to step down. Similar
|
||||
to conditional stepdowns, we must kill any conflicting user/system operations before acquiring the
|
||||
RSTL and yield locks of prepared transactions following a successful stepdown.
|
||||
|
||||
#### Code references
|
||||
* [Stepping down on learning of a higher term](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L6066)
|
||||
* [Liveness timeout checks](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1236-L1249)
|
||||
* [Stepping down on liveness timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L424)
|
||||
* [ReplSetReconfig command class](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L431)
|
||||
* [Stepping on reconfig](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4010)
|
||||
* [Stepping down on heartbeat](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L980)
|
||||
|
||||
- [Stepping down on learning of a higher term](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L6066)
|
||||
- [Liveness timeout checks](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/topology_coordinator.cpp#L1236-L1249)
|
||||
- [Stepping down on liveness timeout](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L424)
|
||||
- [ReplSetReconfig command class](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/repl_set_commands.cpp#L431)
|
||||
- [Stepping on reconfig](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4010)
|
||||
- [Stepping down on heartbeat](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L980)
|
||||
|
||||
### Concurrent Stepdown Attempts
|
||||
|
||||
|
|
@ -1676,16 +1693,17 @@ the `stableTimestamp` refers to a point in time that is before the last update,
|
|||
lose the session information that was never applied as part of the coalescing.
|
||||
|
||||
As an example, consider the following:
|
||||
|
||||
1. During a single batch of secondary oplog application:
|
||||
i). User data write for stmtId=0 at t=10.
|
||||
ii). User data write for stmtId=1 at t=11.
|
||||
iii). User data write for stmtId=2 at t=12.
|
||||
iv). Session txn record write at t=12 with stmtId=2 as lastWriteOpTime. In particular, no
|
||||
i). User data write for stmtId=0 at t=10.
|
||||
ii). User data write for stmtId=1 at t=11.
|
||||
iii). User data write for stmtId=2 at t=12.
|
||||
iv). Session txn record write at t=12 with stmtId=2 as lastWriteOpTime. In particular, no
|
||||
session txn record write for t=10 with stmtId=0 as lastWriteOpTime or for t=11 with stmtId=1 as lastWriteOpTime because they were coalseced by the [SessionUpdateTracker](https://github.com/mongodb/mongo/blob/9d601c939bca2a4304dca2d3c8abd195c1f070af/src/mongo/db/repl/session_update_tracker.cpp#L217-L221).
|
||||
2. Rollback to stable timestamp t=10.
|
||||
3. The session txn record won't exist with stmtId=0 as lastWriteOpTime (because the write was
|
||||
entirely skipped by oplog application) despite the user data write for stmtId=0 being reflected
|
||||
on-disk. Without any fix, this allows stmtId=0 to be re-executed by this node if it became primary.
|
||||
entirely skipped by oplog application) despite the user data write for stmtId=0 being reflected
|
||||
on-disk. Without any fix, this allows stmtId=0 to be re-executed by this node if it became primary.
|
||||
|
||||
As a solution, we traverse the oplog to find the last completed retryable write statements that occur before or at the `stableTimestamp`, and use this information to restore the `config.transactions`
|
||||
table. More specifically, we perform a forward scan of the oplog starting from the first entry
|
||||
|
|
@ -1771,8 +1789,8 @@ Before the data clone phase begins, the node will do the following:
|
|||
have data because it means that initial sync didn't complete. We also check this flag to prevent
|
||||
reading from the oplog while initial sync is in progress.
|
||||
2. [Reset the in-memory FCV to `kUnsetDefaultLastLTSBehavior`.](https://github.com/10gen/mongo/blob/b718dc1aa3ffb3e6df4f61a30d54cda578cf2830/src/mongo/db/repl/initial_syncer.cpp#L689). This is to ensure compatibility between the sync source and sync
|
||||
target. If the sync source is actually in a different feature compatibility version, we will find
|
||||
out when we clone from the sync source.
|
||||
target. If the sync source is actually in a different feature compatibility version, we will find
|
||||
out when we clone from the sync source.
|
||||
3. Find a sync source.
|
||||
4. Drop all of its data except for the local database and recreate the oplog.
|
||||
5. Get the Rollback ID (RBID) from the sync source to ensure at the end that no rollbacks occurred
|
||||
|
|
@ -1792,8 +1810,8 @@ out when we clone from the sync source.
|
|||
be the same as the `beginApplyingTimestamp`.
|
||||
9. [Set the in-memory FCV to the sync source's FCV.](https://github.com/10gen/mongo/blob/b718dc1aa3ffb3e6df4f61a30d54cda578cf2830/src/mongo/db/repl/initial_syncer.cpp#L1153). This is because during the cloning phase, we do expect to clone the sync source's "admin.system.version" collection eventually (which contains the FCV document), but we can't guarantee that we will clone "admin.system.version" first. Setting the in-memory FCV value to the sync source's FCV first will ensure that we clone collections using the same FCV as the sync source. However, we won't persist the FCV to disk nor will we update our minWireVersion until we clone the actual document.
|
||||
10. Create an `OplogFetcher` and start fetching and buffering oplog entries from the sync source
|
||||
to be applied later. Operations are buffered to a collection so that they are not limited by the
|
||||
amount of memory available.
|
||||
to be applied later. Operations are buffered to a collection so that they are not limited by the
|
||||
amount of memory available.
|
||||
|
||||
## Data clone phase
|
||||
|
||||
|
|
@ -1815,19 +1833,19 @@ run a `getMore` on an open cursor to get the next batch, exhaust cursors make it
|
|||
`find` does not exhaust the cursor, the sync source will keep sending batches until there are none
|
||||
left.
|
||||
|
||||
The cloners are resilient to transient errors. If a cloner encounters an error marked with the
|
||||
The cloners are resilient to transient errors. If a cloner encounters an error marked with the
|
||||
`RetriableError` label in
|
||||
[`error_codes.yml`](https://github.com/mongodb/mongo/blob/r4.3.2/src/mongo/base/error_codes.yml), it
|
||||
will retry whatever network operation it was attempting. It will continue attempting to retry for a
|
||||
will retry whatever network operation it was attempting. It will continue attempting to retry for a
|
||||
length of time set by the server parameter `initialSyncTransientErrorRetryPeriodSeconds`, after
|
||||
which it will consider the failure permanent. A permanent failure means it will choose a new sync
|
||||
which it will consider the failure permanent. A permanent failure means it will choose a new sync
|
||||
source and retry all of initial sync, up to a number of times set by the server parameter
|
||||
`numInitialSyncAttempts`. One notable exception, where we do not retry the entire operation, is for
|
||||
the actual querying of the collection data. For querying, we use a feature called **resume
|
||||
tokens**. We set a flag on the query: `$_requestResumeToken`. This causes each batch we receive
|
||||
`numInitialSyncAttempts`. One notable exception, where we do not retry the entire operation, is for
|
||||
the actual querying of the collection data. For querying, we use a feature called **resume
|
||||
tokens**. We set a flag on the query: `$_requestResumeToken`. This causes each batch we receive
|
||||
from the sync source to contain an opaque token which indicates our current position in the
|
||||
collection. After storing a batch of data, we store the most recent resume token in a member
|
||||
variable of the `CollectionCloner`. Then, when retrying we provide this resume token in the query,
|
||||
collection. After storing a batch of data, we store the most recent resume token in a member
|
||||
variable of the `CollectionCloner`. Then, when retrying we provide this resume token in the query,
|
||||
allowing us to avoid having to re-fetch the parts of the collection we have already stored.
|
||||
|
||||
The `initialSyncTransientErrorRetryPeriodSeconds` is also used to control retries for the oplog
|
||||
|
|
@ -1900,23 +1918,24 @@ is the node's last applied OpTime. Finally, the `InitialSyncer` shuts down and t
|
|||
`ReplicationCoordinator` starts steady state replication.
|
||||
|
||||
#### Code References
|
||||
- [ReplicationCoordinator starts initial sync if the node is started up without any data](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L887)
|
||||
- [Follow this flowchart for initial sync call stack.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.h#L278)
|
||||
- [Initial syncer uses AllDatabaseCloner/DatabaseCloner/CollectionCloner to clone data from sync source, where the state transition is defined in runStages().](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/base_cloner.cpp#L268)
|
||||
- [AllDatabaseCloner creates and runs each DatabaseCloner in its post stage.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/all_database_cloner.cpp#L263)
|
||||
- [DatabaseCloner creates and runs each CollectionCloner in its post stage.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/database_cloner.cpp#L137)
|
||||
- [InitialSyncer uses RollbackChecker to check if there is a rollback on sync source during initial sync.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.cpp#L2014)
|
||||
- [Set lastApplied OpTime as initialDataTimestamp to storage engine after initial sync finishes.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.cpp#L586-L590)
|
||||
- [Start steady state replication after initial sync completes.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L847)
|
||||
|
||||
- [ReplicationCoordinator starts initial sync if the node is started up without any data](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L887)
|
||||
- [Follow this flowchart for initial sync call stack.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.h#L278)
|
||||
- [Initial syncer uses AllDatabaseCloner/DatabaseCloner/CollectionCloner to clone data from sync source, where the state transition is defined in runStages().](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/base_cloner.cpp#L268)
|
||||
- [AllDatabaseCloner creates and runs each DatabaseCloner in its post stage.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/all_database_cloner.cpp#L263)
|
||||
- [DatabaseCloner creates and runs each CollectionCloner in its post stage.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/database_cloner.cpp#L137)
|
||||
- [InitialSyncer uses RollbackChecker to check if there is a rollback on sync source during initial sync.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.cpp#L2014)
|
||||
- [Set lastApplied OpTime as initialDataTimestamp to storage engine after initial sync finishes.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/initial_syncer.cpp#L586-L590)
|
||||
- [Start steady state replication after initial sync completes.](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L847)
|
||||
|
||||
# Reconfiguration
|
||||
|
||||
MongoDB replica sets consist of a set of members, where a *member* corresponds to a single
|
||||
participant of the replica set, identified by a host name and port. We refer to a *node* as the
|
||||
MongoDB replica sets consist of a set of members, where a _member_ corresponds to a single
|
||||
participant of the replica set, identified by a host name and port. We refer to a _node_ as the
|
||||
mongod server process that corresponds to a particular replica set member. A replica set
|
||||
*configuration* [consists](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_config.idl#L133-L135) of a list of members in a replica set along with some member specific
|
||||
_configuration_ [consists](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_config.idl#L133-L135) of a list of members in a replica set along with some member specific
|
||||
settings as well as global settings for the set. We alternately refer to a configuration as a
|
||||
*config*, for brevity. Each member of the config has a [member
|
||||
_config_, for brevity. Each member of the config has a [member
|
||||
id](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/member_id.h#L42-L45), which is a
|
||||
unique integer identifier for that member. A config is defined in the
|
||||
[ReplSetConfig](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_config.h#L156)
|
||||
|
|
@ -1939,7 +1958,7 @@ heartbeats](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repli
|
|||
To update the current configuration, a client may execute the [`replSetReconfig`](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_commands.cpp#L424-L426) command with the
|
||||
new, desired config. Reconfigurations [can be run
|
||||
](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_commands.cpp#L446-L448)in
|
||||
*safe* mode or in *force* mode. We alternately refer to reconfigurations as *reconfigs*, for
|
||||
_safe_ mode or in _force_ mode. We alternately refer to reconfigurations as _reconfigs_, for
|
||||
brevity. Safe reconfigs, which are the default, can only be run against primary nodes and ensure the
|
||||
replication safety guarantee that majority committed writes will not be rolled back. Force reconfigs
|
||||
can be run against either a primary or secondary node and their usage may cause the rollback of
|
||||
|
|
@ -1955,7 +1974,7 @@ differences to integrate with the existing, heartbeat-based reconfig protocol mo
|
|||
|
||||
Note that in a static configuration, the safety of the Raft protocol depends on the fact that any
|
||||
two quorums (i.e. majorities) of a replica set have at least one member in common i.e. they satisfy
|
||||
the *quorum overlap* property. For any two arbitrary configurations, however, this is not the case.
|
||||
the _quorum overlap_ property. For any two arbitrary configurations, however, this is not the case.
|
||||
So, extra restrictions are placed on how nodes are allowed to move between configurations. First,
|
||||
all safe reconfigs enforce a [single node
|
||||
change](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_config_checks.cpp#L101-L109)
|
||||
|
|
@ -1970,18 +1989,18 @@ two additional constraints that must be satisfied before a primary node can inst
|
|||
configuration:
|
||||
|
||||
1. **[Config
|
||||
Replication](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4023-L4026)**:
|
||||
The current config, C, must be installed on at least a majority of voting nodes in C.
|
||||
Replication](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4023-L4026)**:
|
||||
The current config, C, must be installed on at least a majority of voting nodes in C.
|
||||
2. **[Oplog
|
||||
Commitment](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4045-L4052)**:
|
||||
Any oplog entries that were majority committed in the previous config, C0, must be replicated to at
|
||||
least a majority of voting nodes in the current config, C1.
|
||||
Commitment](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4045-L4052)**:
|
||||
Any oplog entries that were majority committed in the previous config, C0, must be replicated to at
|
||||
least a majority of voting nodes in the current config, C1.
|
||||
|
||||
Condition 1 ensures that any configs earlier than C can no longer independently form a quorum to
|
||||
elect a node or commit a write. Condition 2 ensures that committed writes in any older configs are
|
||||
now committed by the rules of the current configuration. This guarantees that any leaders elected in
|
||||
a subsequent configuration will contain these entries in their log upon assuming role as leader.
|
||||
When both conditions are satisfied, we say that the current config is *committed*.
|
||||
When both conditions are satisfied, we say that the current config is _committed_.
|
||||
|
||||
We wait for both of these conditions to become true at the
|
||||
[beginning](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_commands.cpp#L450-L466)
|
||||
|
|
@ -2011,7 +2030,7 @@ term)`](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/repl_set_
|
|||
pair, where `term` is compared first, and then `version`, analogous to the rules for optime
|
||||
comparison. The `term` of a config is the term of the primary that originally created that config,
|
||||
and the `version` is a [monotonically increasing number](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L4112) assigned to each config. When executing a
|
||||
reconfig, the version of the new config must be greater than the version of the current config. If
|
||||
reconfig, the version of the new config must be greater than the version of the current config. If
|
||||
the `(version, term)` pair of config A is greater than that of config B, then it is considered
|
||||
"newer" than config B. If a node hears about a newer config via a heartbeat from another node, it
|
||||
will [schedule a
|
||||
|
|
@ -2117,6 +2136,7 @@ the replica set configuration, [set its `lastApplied` and
|
|||
to the top of the oplog (the latest entry in the oplog) and start steady state replication.
|
||||
|
||||
## Recover from Unstable Checkpoint
|
||||
|
||||
We may not have a recovery timestamp if we need to recover from an **unstable checkpoint**. MongoDB
|
||||
takes unstable checkpoints by setting the [`initialDataTimestamp`](#replication-timestamp-glossary)
|
||||
to the `kAllowUnstableCheckpointsSentinel`. Recovery from an unstable checkpoint replays the oplog
|
||||
|
|
@ -2262,27 +2282,28 @@ if it is earlier than the `initialDataTimestamp`, since data earlier than that t
|
|||
inconsistent.
|
||||
|
||||
#### Timestamps related to both prepared and non-prepared transactions:
|
||||
- **`prepareTimestamp`**: The timestamp of the ‘prepare’ oplog entry for a prepared transaction. This
|
||||
is the earliest timestamp at which it is legal to commit the transaction. This timestamp is provided
|
||||
to the storage engine to block reads that are trying to read prepared data until the storage engines
|
||||
knows whether the prepared transaction has committed or aborted.
|
||||
|
||||
- **`commit oplog entry timestamp`**: The timestamp of the ‘commitTransaction’ oplog entry for a
|
||||
prepared transaction, or the timestamp of the ‘applyOps’ oplog entry for a non-prepared transaction.
|
||||
In a cross-shard transaction each shard may have a different commit oplog entry timestamp. This is
|
||||
guaranteed to be greater than the `prepareTimestamp`. When the `stable_timestamp` advances to this
|
||||
point, the transaction can’t be rolled-back; hence, it is referred to as the transaction's
|
||||
`durable_timestamp` in [WT](https://source.wiredtiger.com/develop/timestamp_txn_api.html).
|
||||
- **`prepareTimestamp`**: The timestamp of the ‘prepare’ oplog entry for a prepared transaction. This
|
||||
is the earliest timestamp at which it is legal to commit the transaction. This timestamp is provided
|
||||
to the storage engine to block reads that are trying to read prepared data until the storage engines
|
||||
knows whether the prepared transaction has committed or aborted.
|
||||
|
||||
- **`commitTimestamp`**: The timestamp at which we committed a multi-document transaction, referred
|
||||
to as `commit_timestamp` in [WT](https://source.wiredtiger.com/develop/timestamp_txn_api.html). This will
|
||||
be the `commitTimestamp` field in the `commitTransaction` oplog entry for a prepared transaction, or
|
||||
the timestamp of the ‘applyOps’ oplog entry for a non-prepared transaction. In a cross-shard
|
||||
transaction this timestamp is the same across all shards. The effects of the transaction are visible
|
||||
as of this timestamp. Note that `commitTimestamp` and the `commit oplog entry timestamp` are the
|
||||
same for non-prepared transactions because we do not write down the oplog entry until we commit the
|
||||
transaction. For a prepared transaction, we have the following guarantee: `prepareTimestamp` <=
|
||||
`commitTimestamp` <= `commit oplog entry timestamp`
|
||||
- **`commit oplog entry timestamp`**: The timestamp of the ‘commitTransaction’ oplog entry for a
|
||||
prepared transaction, or the timestamp of the ‘applyOps’ oplog entry for a non-prepared transaction.
|
||||
In a cross-shard transaction each shard may have a different commit oplog entry timestamp. This is
|
||||
guaranteed to be greater than the `prepareTimestamp`. When the `stable_timestamp` advances to this
|
||||
point, the transaction can’t be rolled-back; hence, it is referred to as the transaction's
|
||||
`durable_timestamp` in [WT](https://source.wiredtiger.com/develop/timestamp_txn_api.html).
|
||||
|
||||
- **`commitTimestamp`**: The timestamp at which we committed a multi-document transaction, referred
|
||||
to as `commit_timestamp` in [WT](https://source.wiredtiger.com/develop/timestamp_txn_api.html). This will
|
||||
be the `commitTimestamp` field in the `commitTransaction` oplog entry for a prepared transaction, or
|
||||
the timestamp of the ‘applyOps’ oplog entry for a non-prepared transaction. In a cross-shard
|
||||
transaction this timestamp is the same across all shards. The effects of the transaction are visible
|
||||
as of this timestamp. Note that `commitTimestamp` and the `commit oplog entry timestamp` are the
|
||||
same for non-prepared transactions because we do not write down the oplog entry until we commit the
|
||||
transaction. For a prepared transaction, we have the following guarantee: `prepareTimestamp` <=
|
||||
`commitTimestamp` <= `commit oplog entry timestamp`
|
||||
|
||||
# Non-replication subsystems dependent on replication state transitions.
|
||||
|
||||
|
|
@ -2305,5 +2326,5 @@ entire node which can result in longer periods of write unavailability for the r
|
|||
|
||||
The PrimaryOnlyService interface is more sophisticated than the ReplicaSetAwareService interface and
|
||||
is designed specifically for services built on persistent state machines that must be driven to
|
||||
conclusion by the Primary node of the replica set, even across failovers. Check out [this
|
||||
conclusion by the Primary node of the replica set, even across failovers. Check out [this
|
||||
document](../../../../docs/primary_only_service.md) for more information about PrimaryOnlyServices.
|
||||
|
|
|
|||
|
|
@ -1,35 +1,38 @@
|
|||
# Sharding Architecture Guide
|
||||
|
||||
This page contains details of the source code architecture of the MongoDB Sharding system. It is intended to be used by engineers working on the core server, with some sections containing low-level details which are most appropriate for new engineers on the sharding team.
|
||||
|
||||
It is not intended to be a tutorial on how to operate sharding as a user and it requires that the reader is already familiar with the general concepts of [sharding](https://docs.mongodb.com/manual/sharding/#sharding), the [architecture of a MongoDB sharded cluster](https://docs.mongodb.com/manual/sharding/#sharded-cluster), and the concept of a [shard key](https://docs.mongodb.com/manual/sharding/#shard-keys).
|
||||
|
||||
## Sharding terminology and acronyms
|
||||
* Config Data: All the [catalog containers](README_sharding_catalog.md#catalog-containers) residing on the CSRS.
|
||||
* Config Shard: Same as CSRS.
|
||||
* CRUD operation: Comes from [Create, Read, Update, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete), and indicates operations which modify a collection's data as opposed to the catalog.
|
||||
* CSRS: **C**onfig **S**erver as a **R**eplica **S**et. This is a fancy name for the [config server](https://www.mongodb.com/docs/manual/core/sharded-cluster-config-servers/). Comes from the times of version 3.2 and earlier, when there was a legacy type of Config server called [SCCC](https://www.mongodb.com/docs/manual/release-notes/3.4-compatibility/#removal-of-support-for-sccc-config-servers) which didn't operate as a replica set.
|
||||
* CSS: [Collection Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/collection_sharding_state.h#L59)
|
||||
* DDL operation: Comes from [Data Definition Language](https://en.wikipedia.org/wiki/Data_definition_language), and indicates operations which modify the catalog (e.g., create collection, create index, drop database) as opposed to CRUD, which modifies the data.
|
||||
* DSS: [Database Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/database_sharding_state.h#L42)
|
||||
* Routing Info: The subset of data stored in the [catalog containers](README_sharding_catalog.md#catalog-containers) which is used for making routing decisions. As of the time of this writing, the contents of *config.databases*, *config.collections*, *config.indexes* and *config.chunks*.
|
||||
* SS: [Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/s/sharding_state.h#L51)
|
||||
|
||||
- Config Data: All the [catalog containers](README_sharding_catalog.md#catalog-containers) residing on the CSRS.
|
||||
- Config Shard: Same as CSRS.
|
||||
- CRUD operation: Comes from [Create, Read, Update, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete), and indicates operations which modify a collection's data as opposed to the catalog.
|
||||
- CSRS: **C**onfig **S**erver as a **R**eplica **S**et. This is a fancy name for the [config server](https://www.mongodb.com/docs/manual/core/sharded-cluster-config-servers/). Comes from the times of version 3.2 and earlier, when there was a legacy type of Config server called [SCCC](https://www.mongodb.com/docs/manual/release-notes/3.4-compatibility/#removal-of-support-for-sccc-config-servers) which didn't operate as a replica set.
|
||||
- CSS: [Collection Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/collection_sharding_state.h#L59)
|
||||
- DDL operation: Comes from [Data Definition Language](https://en.wikipedia.org/wiki/Data_definition_language), and indicates operations which modify the catalog (e.g., create collection, create index, drop database) as opposed to CRUD, which modifies the data.
|
||||
- DSS: [Database Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/database_sharding_state.h#L42)
|
||||
- Routing Info: The subset of data stored in the [catalog containers](README_sharding_catalog.md#catalog-containers) which is used for making routing decisions. As of the time of this writing, the contents of _config.databases_, _config.collections_, _config.indexes_ and _config.chunks_.
|
||||
- SS: [Sharding State](https://github.com/mongodb/mongo/blob/master/src/mongo/s/sharding_state.h#L51)
|
||||
|
||||
## Sharding code architecture
|
||||
|
||||
The graph further down visualises the architecture of the MongoDB Sharding system and the relationships between its various components and the links below point to the relevant sections describing these components.
|
||||
|
||||
- [Sharding catalog](README_sharding_catalog.md#sharding-catalog)
|
||||
- [Router role](README_sharding_catalog.md#router-role)
|
||||
- [Shard role](README_sharding_catalog.md#router-role)
|
||||
- [Routing Info Consistency Model](README_routing_info_cache_consistency_model.md)
|
||||
- [Shard versioning protocol](README_versioning_protocols.md)
|
||||
- [Balancer](README_balancer.md)
|
||||
- [Range deleter](README_range_deleter.md)
|
||||
- [DDL Operations](README_ddl_operations.md)
|
||||
- [Migrations](README_migrations.md)
|
||||
- [UserWriteBlocking](README_user_write_blocking.md)
|
||||
- [Sessions and Transactions](README_sessions_and_transactions.md)
|
||||
- [Startup and Shutdown](README_startup_and_shutdown.md)
|
||||
- [Query Sampling for Shard Key Analyzer](README_analyze_shard_key.md)
|
||||
- [Sharding catalog](README_sharding_catalog.md#sharding-catalog)
|
||||
- [Router role](README_sharding_catalog.md#router-role)
|
||||
- [Shard role](README_sharding_catalog.md#router-role)
|
||||
- [Routing Info Consistency Model](README_routing_info_cache_consistency_model.md)
|
||||
- [Shard versioning protocol](README_versioning_protocols.md)
|
||||
- [Balancer](README_balancer.md)
|
||||
- [Range deleter](README_range_deleter.md)
|
||||
- [DDL Operations](README_ddl_operations.md)
|
||||
- [Migrations](README_migrations.md)
|
||||
- [UserWriteBlocking](README_user_write_blocking.md)
|
||||
- [Sessions and Transactions](README_sessions_and_transactions.md)
|
||||
- [Startup and Shutdown](README_startup_and_shutdown.md)
|
||||
- [Query Sampling for Shard Key Analyzer](README_analyze_shard_key.md)
|
||||
|
||||
```mermaid
|
||||
C4Component
|
||||
|
|
|
|||
|
|
@ -7,12 +7,12 @@ to assist the user in selecting a shard key given a collection's data
|
|||
and the user's query patterns.
|
||||
It returns two kinds of metrics called `keyCharacteristics` and `readWriteDistribution`:
|
||||
|
||||
* `keyCharacteristics` consists of metrics about the cardinality, frequency and monotonicity
|
||||
of the shard key, calculated based on documents sampled from the collection.
|
||||
* `readWriteDistribution` consists of the metrics about query routing patterns
|
||||
and the hotness of shard key ranges, calculated based on sampled queries.
|
||||
- `keyCharacteristics` consists of metrics about the cardinality, frequency and monotonicity
|
||||
of the shard key, calculated based on documents sampled from the collection.
|
||||
- `readWriteDistribution` consists of the metrics about query routing patterns
|
||||
and the hotness of shard key ranges, calculated based on sampled queries.
|
||||
|
||||
The remainder of this document describes how queries are sampled in order to report
|
||||
The remainder of this document describes how queries are sampled in order to report
|
||||
`readWriteDistribution` metrics.
|
||||
|
||||
## How the Query Analyzer Works
|
||||
|
|
@ -58,7 +58,7 @@ This sampling rate is for the entire collection, not per mongos or mongod.
|
|||
|
||||
In a sharded cluster,
|
||||
the overall `samplesPerSecond` configured via the `configureQueryAnalyzer` command
|
||||
is divided among mongoses
|
||||
is divided among mongoses
|
||||
proportional to the number of queries that each mongos handles,
|
||||
so a mongos that handles proportionally more queries than other mongoses
|
||||
also samples proportionally more frequently.
|
||||
|
|
@ -75,7 +75,6 @@ For a standalone replica set, the primary mongod controls the sampling rate on e
|
|||
in a similar way: computing an exponential moving average of number of samples and updating
|
||||
its sample rate from the primary.
|
||||
|
||||
|
||||
### Recording and Persisting a Sampled Query
|
||||
|
||||
Recording and persisting of sampled queries is designed to minimize the performance impact of
|
||||
|
|
@ -88,7 +87,6 @@ Sampled queries can be fetched via the new aggregation stage called `$listSample
|
|||
Sampled queries have a TTL of seven days from the insertion time
|
||||
(configurable by server parameter `sampledQueriesExpirationSeconds`).
|
||||
|
||||
|
||||
### Code references
|
||||
|
||||
[**QueryAnalysisSampler class**](https://github.com/10gen/mongo/blob/a1e2e227762163d8fc405f56708d41ec3d34b550/src/mongo/s/query_analysis_sampler.h#L58)
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
# Balancer
|
||||
|
||||
The [balancer](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/s/balancer/balancer.h) is an application built on top of sharding that monitors the data distribution of sharded collections and issues commands to improve that distribution. It is enabled by default, and can be disabled and reenabled for the entire cluster or per-collection at any time. It can also be configured to run during specific hours each day, a period called the balancing window.
|
||||
|
||||
The balancer runs as a background daemon of the config server primary, and consists of two separate threads which issue requests generated by action policies. During their executions, each thread will query the different balancer policies to determine which operations to run. The three action policies are the [ChunkSelectionPolicy](#chunkselectionpolicy), which handles normal collection balancing actions, the [DefragmentationPolicy](#defragmentationpolicy), which handles collection defragmentation, and the [AutoMergerPolicy](#automergerpolicy), which handles merging of contiguous chunks.
|
||||
|
||||
## MainThread
|
||||
|
||||
The [main thread](https://github.com/mongodb/mongo/blob/6e25f79b5a12ebcb768297e15bf5cd9775a48d48/src/mongo/db/s/balancer/balancer.cpp#L695-L934) runs continuously but in "rounds" with [a delay](https://github.com/mongodb/mongo/blob/6e25f79b5a12ebcb768297e15bf5cd9775a48d48/src/mongo/db/s/balancer/balancer.cpp#L83-L89) between each round. This thread is responsible for issuing splits generated by the chunk selection policy as well as migrations from both the chunk selection and defragmentation policies. Its operation cycle looks like the following.
|
||||
|
||||
```mermaid
|
||||
|
|
@ -24,14 +26,17 @@ loop BalancerRound
|
|||
note over m: Sleep
|
||||
end
|
||||
```
|
||||
|
||||
During each balancer round, the main thread will dispatch a batch of splits and wait for them all to complete, and then it will dispatch a batch of migrations and wait for them to complete. All of the commands issued during the round will be waited for before the round ends. Any encountered errors are logged, but otherwise ignored by the main thread.
|
||||
|
||||
### Jumbo chunks
|
||||
If the main thread issues a migration for a chunk that is larger than twice the maximum chunk size, the migration will fail telling the balancer that the chunk is too large to be migrated. In this case, the main thread will try to split the chunk. If this fails (due to there being too many of the same shard key value in the chunk), the chunk will be marked as __jumbo__. This flag is used by the chunk selection policy to avoid selecting large chunks for migration in the future.
|
||||
|
||||
If the main thread issues a migration for a chunk that is larger than twice the maximum chunk size, the migration will fail telling the balancer that the chunk is too large to be migrated. In this case, the main thread will try to split the chunk. If this fails (due to there being too many of the same shard key value in the chunk), the chunk will be marked as **jumbo**. This flag is used by the chunk selection policy to avoid selecting large chunks for migration in the future.
|
||||
|
||||
A jumbo chunk may still be migrated off a shard when a `moveChunk/moveRange` command is issued with `forceJumbo: true`. Users can take advantage of this option to manually redistribute jumbo chunks across a cluster.
|
||||
|
||||
## Secondary Thread
|
||||
|
||||
The [secondary thread](https://github.com/mongodb/mongo/blob/6e25f79b5a12ebcb768297e15bf5cd9775a48d48/src/mongo/db/s/balancer/balancer.cpp#L504-L693) waits on a condition variable and needs to be signaled by some other process before it will start working - usually this other process is a client thread or the main balancer thread. The secondary thread is responsible for issuing the non-migration commands from the defragmentation policy (such as merge and datasize) and auto-merger policy (such as mergeAllChunksOnShard). When both policies are active, the secondary thread will randomly pick commands either from the defragmentation policy or from the auto merger policy.
|
||||
|
||||
```mermaid
|
||||
|
|
@ -51,23 +56,29 @@ loop Issue commands
|
|||
end
|
||||
note over s: Sleep until signaled again
|
||||
```
|
||||
|
||||
Contrary to the main thread, the secondary thread dispatches server commands asynchronously. It is possible to have up to 50 outstanding operations on the secondary thread at any time, though there are configurable parameters that introduce a wait between scheduling operations in order to reduce the impact of the issued commands on the rate of catalog cache refreshes in secondary nodes.
|
||||
The default value of the [defragmentation throttling parameter](https://github.com/mongodb/mongo/blob/6e25f79b5a12ebcb768297e15bf5cd9775a48d48/src/mongo/db/s/sharding_config_server_parameters.idl#L39-L48) is 1 second, meaning that the issuing of two different defragmentation actions will be spaced 1 second apart at least.
|
||||
The default value of the [auto merger throttling parameter](https://github.com/mongodb/mongo/blob/d0a76b2b3ea3f5800cb9023fbb83830288569f8f/src/mongo/db/s/sharding_config_server_parameters.idl#L57) is 15 seconds, meaning that the issuing of two different auto-merge actions for the same collection will be spaced 15 seconds seconds apart at least.
|
||||
|
||||
## ClusterStatistics and CollectionDataSizeInfoForBalancing
|
||||
|
||||
All of the migration decisions made by the two balancer policies require information about cluster data distribution. Each balancer policy has a reference to the [ClusterStatistics](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/s/balancer/cluster_statistics.h) which is an interface that allows the policy to obtain the data distribution and shard utilization statistics for the cluster. This includes information about which chunks are owned by which shards, the zones defined for the collection, and which chunks are a part of which zones. The chunk selection policy also needs more specific information about the data size for each collection on each shard and uses the [CollectionDataSizeInfoForBalancing](https://github.com/mongodb/mongo/blob/6e25f79b5a12ebcb768297e15bf5cd9775a48d48/src/mongo/db/s/balancer/balancer_policy.h#L202-L212) to track this information.
|
||||
|
||||
## ChunkSelectionPolicy
|
||||
|
||||
The [chunk selection policy](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/s/balancer/balancer_chunk_selection_policy.h) is responsible for generating operations to maintain the balance of sharded collections within the cluster. A collection is considered balanced when all shards have an equal amount of its data.
|
||||
|
||||
### Splits
|
||||
|
||||
The chunk selection policy will generate split requests if a chunk crosses zone boundaries. The split command will create a smaller chunk whose min and max values are equal to the zone boundaries.
|
||||
|
||||
### Migrations
|
||||
|
||||
The chunk selection policy will look through all of the sharded collections to create a list of migrations for the balancer to issue in order to distribute data more evenly across shards.
|
||||
|
||||
For each collection, the policy will select all elegible ranges to be migrated, prioritizing them with the following criteria:
|
||||
|
||||
1. If any range is owned by a shard that is draining (being removed), select that range.
|
||||
2. If any ranges are violating zones (on the wrong shard), select that range.
|
||||
3. If neither of the above is true, the policy will select ranges to move in order to obtain an even amount of data per shard. It will look at the most overloaded shard (the one with the most data) and the least loaded shard (with the least data) and will select a range from the larger shard if the difference in data size between them is greater than 3 times the max chunk size.
|
||||
|
|
@ -77,13 +88,17 @@ In all three scenarios, the range will be moved to the least loaded shard that i
|
|||
The chunk selection policy will submit as many migrations as it can during each balancing round. However, the defragmentation policy has priority when it comes to issuing migrations, so the chunk selection policy will not be able to schedule migrations on any shards that are migrating data for defragmentation.
|
||||
|
||||
## DefragmentationPolicy
|
||||
|
||||
Collection defragmentation is the process of merging as many chunks as possible in a collection with the goal of reducing the overall number of chunks in a collection. Chunks are the subdivisions of user data stored in the sharding routing table and used route commands, so reducing the number of chunks in a collection shrinks the routing information and makes sharding refreshes faster. The [defragmentation policy](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/s/balancer/balancer_defragmentation_policy.h) is responsible for generating commands throughout the defragmentation process. Defragmentation consists of three phases - MergeAndMeasureChunks, MoveAndMergeChunks, and MergeChunks. Once a user initiates defragmentation for a collection, that collection will only be considered by the defragmentation policy and will not undergo standard balancing while defragmentation is running.
|
||||
|
||||
### MergeAndMeasureChunksPhase
|
||||
|
||||
The first phase of defragmentation consists of merge and datasize commands for all chunks in the collection. On each shard, the defragmentation policy will generate a merge request for each set of consecutive chunks on the shard. After each merge completes, a datasize command will be generated for the chunk that was just created. The datasize values are persisted to the config.chunks entries to be used by the second phase of defragmentation, and they are cleaned up when defragmentation finishes. The merge and datasize commands are issued by the balancer secondary thread, which will issue the requests in parallel, scheduling a callback to notify the defragmentation policy of the outcome.
|
||||
|
||||
### MoveAndMergeChunksPhase
|
||||
|
||||
The second phase of defragmentation is the most complex. Using the datasizes calculated in phase 1, the second phase will create migration requests for any chunk that is less than 25% of the collection max chunk size. Each so called small chunk will be migrated to a shard the contains a chunk consequtive with it. If there are two such shards, the potential recipients will be given a score based on the following criteria in order of importance:
|
||||
|
||||
1. Is this recipient shard the current shard this chunk is on?
|
||||
2. Is this chunk smaller than the one it will be merged with on the recipient shard?
|
||||
3. Will merging this chunk with the one on the recipient make the resulting chunk big enough that it is no longer a small chunk?
|
||||
|
|
@ -94,13 +109,16 @@ For each answer yes to the above questions, the recipient gets a higher score wi
|
|||
After the migration has been completed by the main balancer thread, the defragmentation policy will create a merge request for the migrated shard and the consecutive chunk it was moved to meet. This merge action will be executed by the secondary balancer thread. This process will continue until all chunks are bigger than 25% of the max chunk size.
|
||||
|
||||
### MergeChunksPhase
|
||||
|
||||
The last phase of defragmentation is very similar to the first phase but without issuing datasize commands for the chunks. This phase will issue merge requests for all consecutive chunks on all shards. As with the first phase, all of these commands will be issued by the balancer secondary thread.
|
||||
|
||||
### Error Handling
|
||||
|
||||
There are two classes of errors during defragmentation, retriable and non-retriable errors. When an operation generated by the defragmentation policy hits a retriable error, the policy will issue the same operation again until it succeeds. For a non-retriable error, defragmentation will restart execution, returning to the beginning of a phase. For both the MergeAndMeasureChunksPhase and the MergeChunksPhase, defragmentation will restart from the beginning of that phase. For the MoveAndMergeChunksPhase, defragmentation will restart from the MergeAndMeasureChunksPhase.
|
||||
|
||||
## AutoMergerPolicy
|
||||
Starting from v7.0, sharding has a component called auto-merger that is periodically scanning through chunks in order to spot [mergeable](#definition-of-mergeable-chunks) ones and squash them together. Unless explicitly disabled, the auto-merger periodically checks if there are mergeable chunks every [autoMergerIntervalSecs](https://github.com/mongodb/mongo/blob/d0a76b2b3ea3f5800cb9023fbb83830288569f8f/src/mongo/db/s/sharding_config_server_parameters.idl#L50) - a configurable parameter defaulted to 1 hour - and eventually starts issuing `mergeAllChunksOnShard` actions via the balancer secondary thread.
|
||||
|
||||
Starting from v7.0, sharding has a component called auto-merger that is periodically scanning through chunks in order to spot [mergeable](#definition-of-mergeable-chunks) ones and squash them together. Unless explicitly disabled, the auto-merger periodically checks if there are mergeable chunks every [autoMergerIntervalSecs](https://github.com/mongodb/mongo/blob/d0a76b2b3ea3f5800cb9023fbb83830288569f8f/src/mongo/db/s/sharding_config_server_parameters.idl#L50) - a configurable parameter defaulted to 1 hour - and eventually starts issuing `mergeAllChunksOnShard` actions via the balancer secondary thread.
|
||||
|
||||
The algorithm implemented by the auto-merger policy can be summarized as follows:
|
||||
|
||||
|
|
@ -121,34 +139,35 @@ Two or more contiguous chunks belonging to the same collection are going to be c
|
|||
|
||||
Formally, two or more contiguous **non-jumbo** chunks are then required to fulfill the following conditions to be merged:
|
||||
|
||||
- Have never been migrated.
|
||||
- Have never been migrated.
|
||||
|
||||
**OR**
|
||||
|
||||
- The last migration involving either of them has happened more than [minSnapshotHistoryWindowInSeconds](https://www.mongodb.com/docs/manual/reference/parameters/#mongodb-parameter-param.minSnapshotHistoryWindowInSeconds) ago **AND** more than [transactionLifetimeLimitSeconds](https://www.mongodb.com/docs/manual/reference/parameters/#mongodb-parameter-param.transactionLifetimeLimitSeconds) ago.
|
||||
- The last migration involving either of them has happened more than [minSnapshotHistoryWindowInSeconds](https://www.mongodb.com/docs/manual/reference/parameters/#mongodb-parameter-param.minSnapshotHistoryWindowInSeconds) ago **AND** more than [transactionLifetimeLimitSeconds](https://www.mongodb.com/docs/manual/reference/parameters/#mongodb-parameter-param.transactionLifetimeLimitSeconds) ago.
|
||||
|
||||
#### Example
|
||||
*The following example assumes that history is empty for all chunks and no chunk is flagged as jumbo, so all contiguous intervals belonging to the same shard are mergeable.*
|
||||
|
||||
_The following example assumes that history is empty for all chunks and no chunk is flagged as jumbo, so all contiguous intervals belonging to the same shard are mergeable._
|
||||
|
||||
Let's consider the following portion of routing table representing chunks belonging to the collection **db.coll** with shard key **x**:
|
||||
|
||||
|CHUNK|MIN|MAX|SHARD|
|
||||
|-|-|-|-|
|
||||
|A|x: 0|x: 10|Shard0|
|
||||
|B|x: 10|x: 20|Shard0|
|
||||
|C|x: 20|x: 30|Shard0|
|
||||
|D|x: 30|x: 40|Shard0|
|
||||
|E|x: 40|x: 50|Shard1|
|
||||
|F|x: 50|x: 60|Shard1|
|
||||
|G|x: 60|x: 70|Shard0|
|
||||
|H|x: 70|x: 80|Shard0|
|
||||
|I|x: 80|x: 90|Shard1|
|
||||
| CHUNK | MIN | MAX | SHARD |
|
||||
| ----- | ----- | ----- | ------ |
|
||||
| A | x: 0 | x: 10 | Shard0 |
|
||||
| B | x: 10 | x: 20 | Shard0 |
|
||||
| C | x: 20 | x: 30 | Shard0 |
|
||||
| D | x: 30 | x: 40 | Shard0 |
|
||||
| E | x: 40 | x: 50 | Shard1 |
|
||||
| F | x: 50 | x: 60 | Shard1 |
|
||||
| G | x: 60 | x: 70 | Shard0 |
|
||||
| H | x: 70 | x: 80 | Shard0 |
|
||||
| I | x: 80 | x: 90 | Shard1 |
|
||||
|
||||
When the auto-merger runs, mergeable chunks are squashed and the final result looks like the following:
|
||||
|
||||
|CHUNK|MIN|MAX|SHARD|
|
||||
|-|-|-|-|
|
||||
|A-B-C-D|x: 0|x: 40|Shard0|
|
||||
|E-F|x: 40|x: 60|Shard1|
|
||||
|G-H|x: 60|x: 80|Shard0|
|
||||
|I|x: 80|x: 90|Shard1|
|
||||
| CHUNK | MIN | MAX | SHARD |
|
||||
| ------- | ----- | ----- | ------ |
|
||||
| A-B-C-D | x: 0 | x: 40 | Shard0 |
|
||||
| E-F | x: 40 | x: 60 | Shard1 |
|
||||
| G-H | x: 60 | x: 80 | Shard0 |
|
||||
| I | x: 80 | x: 90 | Shard1 |
|
||||
|
|
|
|||
|
|
@ -1,10 +1,12 @@
|
|||
# DDL Operations
|
||||
On the Sharding team, we use the term *DDL* to mean any operation that needs to update any subset of [catalog containers](README_sharding_catalog.md#catalog-containers). Within this definition, there are standard DDLs that use the DDL coordinator infrastructure as well as non-standard DDLs that each have their own implementations.
|
||||
|
||||
On the Sharding team, we use the term _DDL_ to mean any operation that needs to update any subset of [catalog containers](README_sharding_catalog.md#catalog-containers). Within this definition, there are standard DDLs that use the DDL coordinator infrastructure as well as non-standard DDLs that each have their own implementations.
|
||||
|
||||
## Standard DDLs
|
||||
|
||||
Most DDL operations are built upon the DDL coordinator infrastructure which provides some [retriability](#retriability), [synchronization](#synchronization), and [recoverability](#recovery) guarantees.
|
||||
|
||||
Each of these operations has a *coordinator* - a node that drives the execution of the operation. In [most operations](https://github.com/mongodb/mongo/blob/e61bf27c2f6a83fed36e5a13c008a32d563babe2/src/mongo/db/s/sharding_ddl_coordinator_service.cpp#L60-L120), this coordinator is the database primary, but in [a few others](https://github.com/mongodb/mongo/blob/e61bf27c2f6a83fed36e5a13c008a32d563babe2/src/mongo/db/s/config/configsvr_coordinator_service.cpp#L75-L94) the coordinator is the CSRS. These coordinators extend either the [RecoverableShardingDDLCoordinator class](https://github.com/mongodb/mongo/blob/9fe03fd6c85760920398b7891fde74069f5457db/src/mongo/db/s/sharding_ddl_coordinator.h#L266) or the [ConfigSvrCoordinator class](https://github.com/mongodb/mongo/blob/9fe03fd6c85760920398b7891fde74069f5457db/src/mongo/db/s/config/configsvr_coordinator.h#L47), which make up the DDL coordinator infrastructure.
|
||||
Each of these operations has a _coordinator_ - a node that drives the execution of the operation. In [most operations](https://github.com/mongodb/mongo/blob/e61bf27c2f6a83fed36e5a13c008a32d563babe2/src/mongo/db/s/sharding_ddl_coordinator_service.cpp#L60-L120), this coordinator is the database primary, but in [a few others](https://github.com/mongodb/mongo/blob/e61bf27c2f6a83fed36e5a13c008a32d563babe2/src/mongo/db/s/config/configsvr_coordinator_service.cpp#L75-L94) the coordinator is the CSRS. These coordinators extend either the [RecoverableShardingDDLCoordinator class](https://github.com/mongodb/mongo/blob/9fe03fd6c85760920398b7891fde74069f5457db/src/mongo/db/s/sharding_ddl_coordinator.h#L266) or the [ConfigSvrCoordinator class](https://github.com/mongodb/mongo/blob/9fe03fd6c85760920398b7891fde74069f5457db/src/mongo/db/s/config/configsvr_coordinator.h#L47), which make up the DDL coordinator infrastructure.
|
||||
|
||||
The diagram below shows a simplified example of a DDL operation's execution. The coordinator can be one of the shards or the config server, and the commands sent to that node will just be applied locally.
|
||||
|
||||
|
|
@ -42,32 +44,38 @@ The outer loop is common to all standard DDL operations and ensures that differe
|
|||
The checkpoints are majority write concern updates to a persisted document on the coordinator. This document - called a state document - contains all the information about the running operation including the operation type, namespaces involved, and which checkpoint the operation has reached. An initial checkpoint must be the first thing any coordinator does in order to ensure that the operation will continue to run even [in the presence of failovers](#recovery). Subsequent checkpoints allow retries to skip phases that have already been completed.
|
||||
|
||||
### Retriability
|
||||
Most DDL operations must complete after they have started. An exception to this is often a *CheckPreconditions* phase at the beginning of a coordinator in which the operation will check some conditions and will be allowed to exit if these conditions are not met. After this, however, the operation will continue to retry until it succeeds. This is because the updates to the sharding metadata would cause inconsistencies if the critical section were released partially through the operation. For this reason, DDL operations should not throw non-retriable errors after the initial phase of checking preconditions.
|
||||
|
||||
Most DDL operations must complete after they have started. An exception to this is often a _CheckPreconditions_ phase at the beginning of a coordinator in which the operation will check some conditions and will be allowed to exit if these conditions are not met. After this, however, the operation will continue to retry until it succeeds. This is because the updates to the sharding metadata would cause inconsistencies if the critical section were released partially through the operation. For this reason, DDL operations should not throw non-retriable errors after the initial phase of checking preconditions.
|
||||
|
||||
### Synchronization
|
||||
|
||||
DDL operations are serialized on the coordinator by acquisition of the DDL locks, handled by the [DDL Lock Manager](https://github.com/mongodb/mongo/blob/r7.0.0-rc7/src/mongo/db/s/ddl_lock_manager.h). DDL locks are local to the coordinator and only in memory, so they must be reacquired during [recovery](#recovery).
|
||||
|
||||
The DDL locks follow a [multiple granularity hierarchical approach](https://en.wikipedia.org/wiki/Multiple_granularity_locking), which means DDL locks must be acquired in a specific order using the [intentional locking protocol](https://en.wikipedia.org/wiki/Multiple_granularity_locking#:~:text=MGL%20also%20uses%20intentional%20%22locks%22) appropriately. With that, we ensure that a DDL operation acting over a whole database serializes with another DDL operation targeting a collection from that database, and, at the same time, two DDL operations targeting different collections can run concurrently.
|
||||
|
||||
Every DDL lock resource should be taken in the following order:
|
||||
|
||||
1. DDL Database lock
|
||||
2. DDL Collection lock
|
||||
|
||||
Therefore, if a DDL operation needs to update collection metadata, a DDL lock will be acquired first on the database in IX mode and then on the collection in X mode. On the other hand, if a DDL operation only updates the database metadata (like dropDatabase), only the DDL lock on the database will be taken in X mode.
|
||||
|
||||
Some operations also acquire additional DDL locks, such as renameCollection, which will acquire the database and collection DDL locks for the target namespace after acquiring the DDL locks on the source collection.
|
||||
Some operations also acquire additional DDL locks, such as renameCollection, which will acquire the database and collection DDL locks for the target namespace after acquiring the DDL locks on the source collection.
|
||||
|
||||
Finally, at the end of the operation, all of the locks are released in reverse order.
|
||||
|
||||
### Recovery
|
||||
|
||||
DDL coordinators are resilient to elections and sudden crashes because they are implemented as [primary only services](https://github.com/mongodb/mongo/blob/r6.0.0/docs/primary_only_service.md#primaryonlyservice) that - by definition - get automatically resumed when the node of a shard steps up.
|
||||
|
||||
When a new primary node is elected, the DDL primary only service is rebuilt, and any ongoing coordinators will be restarted based on their persisted state document. During this recovery phase, any new requests for DDL operations are put on hold, waiting for existing coordinators to be re-instatiated to avoid conflicts with the DDL locks.
|
||||
|
||||
### Sections about specific standard DDL operations
|
||||
- [User write blocking](README_user_write_blocking.md)
|
||||
|
||||
- [User write blocking](README_user_write_blocking.md)
|
||||
|
||||
## Non-Standard DDLs
|
||||
|
||||
Some DDL operations do not follow the structure outlined in the section above. These operations are [chunk migration](README_migrations.md), resharding, and refine collection shard key. There are also other operations such as add and remove shard that do not modify the sharding catalog but do modify local metadata and need to coordinate with ddl operations. These operations also do not use the DDL coordinator infrastructure, but they do take the DDl lock to synchronize with other ddls.
|
||||
|
||||
Both chunk migration and resharding have to copy user data across shards. This is too time intensive to happen entirely while holding the collection critical section, so these operations have separate machinery to transfer the data and commit the changes. These commands do not commit transactionally across the shards and the config server, rather they commit on the config server and rely on shards pulling the updated commit information from the config server after learning via a router that there is new information. They also do not have the same requirement as standard DDL operations that they must complete after starting except after entering their commit phases.
|
||||
|
|
@ -75,4 +83,5 @@ Both chunk migration and resharding have to copy user data across shards. This i
|
|||
Refine shard key commits only on the config server, again relying on shards to pull updated information from the config server after hearing about this more recent information from a router. In this case, this was done not because of the cost of transfering data, but so that refine shard key did not need to involve the shards. This allows the refineShardKey command to run quickly and not block operations.
|
||||
|
||||
### Sections explaining specific non-standard DDL operations
|
||||
- [Chunk Migration](README_migrations.md)
|
||||
|
||||
- [Chunk Migration](README_migrations.md)
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
# Migrations
|
||||
|
||||
The migration procedure allows data for sharded collections to be moved from one shard to another. This procedure is normally initiated by the balancer, but can also be triggered manually by the user by calling the `moveRange` command. MoveRange can move a range of data within existing data boundaries, or it can perform a split and a move if called on a sub-range of data.
|
||||
|
||||
Migrations involve the config server and two shards. The first shard is the donor or source shard, which currently owns the range being moved. The second shard is the recipient or destination shard, which is the shard the range is being moved to.
|
||||
|
|
@ -48,25 +49,29 @@ D->>C: OK
|
|||
```
|
||||
|
||||
### Synchronization
|
||||
A shard can only be a donor or a recipient at any given time, not both. Because of this there can only be a single migration between a pair of shards at a time. This syhchronization is handled by the [ActiveMigrationsRegistry](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/active_migrations_registry.h). The migration is registered on the donor and on the recipient at the beginning of the migration, and no other migrations can begin while another migration is registered. This registration is done by persisting a document describing the migration with write concern majority, ensuring that there will be no conflicting migrations even in the case of failover.
|
||||
|
||||
A shard can only be a donor or a recipient at any given time, not both. Because of this there can only be a single migration between a pair of shards at a time. This syhchronization is handled by the [ActiveMigrationsRegistry](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/active_migrations_registry.h). The migration is registered on the donor and on the recipient at the beginning of the migration, and no other migrations can begin while another migration is registered. This registration is done by persisting a document describing the migration with write concern majority, ensuring that there will be no conflicting migrations even in the case of failover.
|
||||
|
||||
Most DDL operations require that the set of shards that comprise a database or collection remain fixed for the duration of the DDL. They achieve this using the [setAllowMigrations](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/config/configsvr_set_allow_migrations_command.cpp) command, which stops any new migrations from starting in a specific collection and then waits for all ongoing migrations to complete, aborting any that haven't reached the commit stage yet.
|
||||
|
||||
### Recovery
|
||||
|
||||
If a change in primary occurs on a shard that is acting as the donor or recipient of a migration, the migration will be recovered on step up by the new primary. The registration of the migration on the donor and recipient at the beginning of the migration persists a migration task document. When a new primary steps up, it will begin an asychronous recovery for the migration documents that it has on disk.
|
||||
|
||||
If the migration document already contains the decision for the migration (committed or aborted), the new primary uses the persisted decision to complete the migration. If the decision has not been persisted, the primary will query the config server to see if the migration committed or not. If it has already been committed, the new primary will complete the migration using that decision. Otherwise, the primary will abort the migration.
|
||||
|
||||
### Session Migration
|
||||
|
||||
The session information for a given collection also needs to be transferred during a migration. This is done in parallel with the user data migration, and has a very similar workflow. The request to start cloning on the recipient shard begins the user data transfer as shown in the diagram above, and it also starts a separate cloning process for the session information. The status polling by the donor to determine when to enter the critical section takes into account both the user data transfer progress as well as the session data transfer progress.
|
||||
|
||||
## Ownership Filter
|
||||
|
||||
Because the [range deletions](#range-deletions) happen asynchronously, there is a period of time during which documents exist on a shard that doesn't actually own them. This can happen either on the recipient shard while the migration is still ongoing and the ownership has not changed yet, on the donor shard after the migration has committed, or on the recipient after the migration has been aborted. These documents are called orphan documents, and they should not be visible to the user.
|
||||
|
||||
Shards store a copy of the data placement information for each collection in which they own data. This information is called the *filtering metadata* because it is used to filter orphaned documents out of queries. A query obtains a reference to the current filtering metadata for a collection, called the *ownership filter*, by calling CollectionShardingRuntime::getOwnershipFilter(). A cluster timestamp can be specified which will cause an earlier version of the filtering metadata to be returned.
|
||||
|
||||
Shards store a copy of the data placement information for each collection in which they own data. This information is called the _filtering metadata_ because it is used to filter orphaned documents out of queries. A query obtains a reference to the current filtering metadata for a collection, called the _ownership filter_, by calling CollectionShardingRuntime::getOwnershipFilter(). A cluster timestamp can be specified which will cause an earlier version of the filtering metadata to be returned.
|
||||
|
||||
## Range Deletions
|
||||
|
||||
In order to improve the performance of migrations, the deletion of documents from the donor shard after the migration is not part of the migration process. Instead, document is persisted describing the range and the [range deleter](https://github.com/mongodb/mongo/blob/r6.2.0/src/mongo/db/s/range_deleter_service.h) - running on shard primary nodes - will clean up the local documents later. This also applies to any documents that were copied to the recipient in the event that the migration failed.
|
||||
|
||||
At the beginning of the migration, before the donor sends the request to begin cloning, it persists a document locally describing the range that is to be deleted along with a pending flag indicating that the range is not yet ready to be deleted. The recipient does likewise after receiving the request to begin cloning but before requesting the first batch from the donor. If the migration succeeds, the range is marked as ready to delete on the donor and the range deletion document is deleted from the recipient. If the migration fails, the range deletion document is deleted from the donor and the range is marked as ready on the recipient. The range deleter service will be notified that the document was marked as ready through the op observer, scheduling the range for deletion as soon as the update marking the range as ready is committed.
|
||||
|
|
|
|||
|
|
@ -1,40 +1,52 @@
|
|||
# Range deletions
|
||||
|
||||
The `config.rangeDeletions` collection is a shard-local internal collection containing a document for each range needing to be eventually cleared up; documents in `config.rangeDeletions` are usually referred as "range deletion tasks" or "range deleter documents".
|
||||
|
||||
The complete format of range deletion tasks is defined in [range\_deletion\_tasks.idl](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/range_deletion_task.idl), with the relevant fields being the following:
|
||||
- `collectioUUID`: the UUID of the collection the range belongs to
|
||||
- `range`: the `[min, max)` shard key range to be deleted
|
||||
- `pending`: boolean flag present if the range is not yet ready for deletion
|
||||
The complete format of range deletion tasks is defined in [range_deletion_tasks.idl](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/range_deletion_task.idl), with the relevant fields being the following:
|
||||
|
||||
- `collectioUUID`: the UUID of the collection the range belongs to
|
||||
- `range`: the `[min, max)` shard key range to be deleted
|
||||
- `pending`: boolean flag present if the range is not yet ready for deletion
|
||||
|
||||
When a migration starts, a `pending` range deletion task is created both on donor and recipient side and such documents are later modified depending on the migration outcome:
|
||||
- Commit: donor range deletion document flagged as ready (remove `pending` flag) AND recipient range deletion document deleted.
|
||||
- Abort: donor range deletion document deleted AND recipient range deletion document flagged as ready (remove `pending` flag).
|
||||
|
||||
- Commit: donor range deletion document flagged as ready (remove `pending` flag) AND recipient range deletion document deleted.
|
||||
- Abort: donor range deletion document deleted AND recipient range deletion document flagged as ready (remove `pending` flag).
|
||||
|
||||
## Range deleter service
|
||||
|
||||
The [range deleter service](https://github.com/mongodb/mongo/blob/v7.0/src/mongo/db/s/range_deleter_service.h) is a primary only service living on shards that is driven by the state persisted on `config.rangeDeletions`: for each collection with at least one range deletion task, the service keeps track in memory of the mapping `<collection uuid, [list of ranges containing orphaned documents]>`.
|
||||
|
||||
Its main functions are:
|
||||
- Scheduling for deletion an orphaned range only when ongoing reads retaining such range have locally drained.
|
||||
- Performing the actual deletion of orphaned docs belonging to a range (this happens on a dedicated thread only performing range deletions).
|
||||
- Providing a way to know when orphans deletions on specific ranges have completed (callers can get a future to wait for a range to become “orphans free”, such future gets notified only when all range deletion tasks overlapping with the specified range have completed).
|
||||
|
||||
- Scheduling for deletion an orphaned range only when ongoing reads retaining such range have locally drained.
|
||||
- Performing the actual deletion of orphaned docs belonging to a range (this happens on a dedicated thread only performing range deletions).
|
||||
- Providing a way to know when orphans deletions on specific ranges have completed (callers can get a future to wait for a range to become “orphans free”, such future gets notified only when all range deletion tasks overlapping with the specified range have completed).
|
||||
|
||||
### Initialization (service in `INITIALIZING` state)
|
||||
|
||||
When a shard node steps up, a thread to asynchronously recover range deletions is spawned (being asynchronous, it does not block step-up); recovery happens in the following steps:
|
||||
|
||||
1. Disallow writes on config.rangeDeletions
|
||||
2. For each document in config.rangeDeletions:
|
||||
---- Register a task on the range deleter service
|
||||
---- Register a task on the range deleter service
|
||||
3. Reallow writes on config.rangeDeletions
|
||||
|
||||
### STEADY STATE (service in `UP` state)
|
||||
|
||||
The [range deleter service observer](https://github.com/mongodb/mongo/blob/v7.0/src/mongo/db/s/range_deleter_service_op_observer.h) is taking care of keeping in sync the in-memory state with the persistent state by reacting to documents modifications on `config.rangeDeletions` in the following ways:
|
||||
- Register a task on the service when a range deletion task document is flagged as ready (insert document without “pending” flag or update existing document to remove such flag)
|
||||
- De-register a task from the service when a range deletion task is completed (document deleted)
|
||||
|
||||
- Register a task on the service when a range deletion task document is flagged as ready (insert document without “pending” flag or update existing document to remove such flag)
|
||||
- De-register a task from the service when a range deletion task is completed (document deleted)
|
||||
|
||||
### SHUTDOWN (service in `DOWN` state)
|
||||
|
||||
When a shard primary steps down, the whole service is destroyed:
|
||||
- Potentially ongoing orphans cleanup are stopped
|
||||
- In-memory state is cleared up
|
||||
|
||||
- Potentially ongoing orphans cleanup are stopped
|
||||
- In-memory state is cleared up
|
||||
|
||||
The shutdown of the service is structured in the following way in order not to block step-down:
|
||||
- Call for shutdown of the thread performing range deletions
|
||||
- On next step-up, join such thread (it's very improbable for the shutdown to have not completed yet following a step-down+up, hence this should be a no-op)
|
||||
|
||||
- Call for shutdown of the thread performing range deletions
|
||||
- On next step-up, join such thread (it's very improbable for the shutdown to have not completed yet following a step-down+up, hence this should be a no-op)
|
||||
|
|
|
|||
|
|
@ -1,32 +1,37 @@
|
|||
# Consistency Model of the Routing Info Cache
|
||||
|
||||
This section builds upon the definitions of the sharding catalog in [this section](README_sharding_catalog.md#catalog-containers) and elaborates on the consistency model of the [CatalogCache](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134), which is what backs the [Router role](README_sharding_catalog.md#router-role).
|
||||
|
||||
## Timelines
|
||||
Let's define the set of operations which a DDL coordinator performs over a set of catalog objects as the **timeline** of that object. The timelines of different objects can be **causally dependent** (or just *dependent* for brevity) on one another, or they can be **independent**.
|
||||
|
||||
Let's define the set of operations which a DDL coordinator performs over a set of catalog objects as the **timeline** of that object. The timelines of different objects can be **causally dependent** (or just _dependent_ for brevity) on one another, or they can be **independent**.
|
||||
|
||||
For example, creating a sharded collection only happens after a DBPrimary has been created for the owning database, therefore the timeline of a collection is causally dependent on the timeline of the owning database. Similarly, placing a database on a shard can only happen after that shard has been added, therefore the timeline of a database is dependent on the timeline of the shards data.
|
||||
|
||||
On the other hand, two different clients creating two different sharded collections under two different DBPrimaries are two timelines which are independent from each other.
|
||||
|
||||
## Routing info cache objects
|
||||
|
||||
The list below enumerates the current set of catalog objects in the routing info cache, their cardinality (how many exist in the cluster), their dependencies and the DDL coordinators which are responsible for their timelines:
|
||||
|
||||
* ConfigData: Cardinality = 1, Coordinator = CSRS, Causally dependent on the clusterTime on the CSRS.
|
||||
* ShardsData: Cardinality = 1, Coordinator = CSRS, Causally dependent on ConfigData.
|
||||
* Database: Cardinality = NumDatabases, Coordinator = (CSRS with a hand-off to the DBPrimary after creation), Causally dependent on ShardsData.
|
||||
* Collection: Cardinality = NumCollections, Coordinator = DBPrimary, Causally dependent on Database.
|
||||
* CollectionPlacement: Cardinality = NumCollections, Coordinator = (DBPrimary with a hand-off to the Donor Shard for migrations), Causally dependent on Collection.
|
||||
* CollectionIndexes: Cardinality = NumCollections, Coordinator = DBPrimary, Causally dependent on Collection.
|
||||
- ConfigData: Cardinality = 1, Coordinator = CSRS, Causally dependent on the clusterTime on the CSRS.
|
||||
- ShardsData: Cardinality = 1, Coordinator = CSRS, Causally dependent on ConfigData.
|
||||
- Database: Cardinality = NumDatabases, Coordinator = (CSRS with a hand-off to the DBPrimary after creation), Causally dependent on ShardsData.
|
||||
- Collection: Cardinality = NumCollections, Coordinator = DBPrimary, Causally dependent on Database.
|
||||
- CollectionPlacement: Cardinality = NumCollections, Coordinator = (DBPrimary with a hand-off to the Donor Shard for migrations), Causally dependent on Collection.
|
||||
- CollectionIndexes: Cardinality = NumCollections, Coordinator = DBPrimary, Causally dependent on Collection.
|
||||
|
||||
## Consistency model
|
||||
|
||||
Since the sharded cluster is a distributed system, it would be prohibitive to have each user operation go to the CSRS in order to obtain an up-to-date view of the routing information. Therefore the cache's consistency model needs to be relaxed.
|
||||
|
||||
Currently, the cache exposes a view of the routing table which preserves the causal dependency of only *certain* dependent timelines and provides no guarantees for timelines which are not related.
|
||||
Currently, the cache exposes a view of the routing table which preserves the causal dependency of only _certain_ dependent timelines and provides no guarantees for timelines which are not related.
|
||||
|
||||
The only dependent timelines which are preserved are:
|
||||
* Everything dependent on ShardsData: Meaning that if a database or collection placement references shard S, then shard S will be present in the ShardRegistry
|
||||
* CollectionPlacement and Collection: Meaning that if the cache references placement version V, then it will also reference the collection description which corresponds to that placement
|
||||
* CollectionIndexes and Collection: Meaning that if the cache references index version V, then it will also reference the collection description which corresponds to that placement
|
||||
|
||||
- Everything dependent on ShardsData: Meaning that if a database or collection placement references shard S, then shard S will be present in the ShardRegistry
|
||||
- CollectionPlacement and Collection: Meaning that if the cache references placement version V, then it will also reference the collection description which corresponds to that placement
|
||||
- CollectionIndexes and Collection: Meaning that if the cache references index version V, then it will also reference the collection description which corresponds to that placement
|
||||
|
||||
For example, if the CatalogCache returns a chunk which is placed on shard S1, the same caller is guaranteed to see shard S1 in the ShardRegistry, rather than potentially get ShardNotFound. The inverse is not guaranteed: if a shard S1 is found in the ShardRegistry, there is no guarantee that any collections that have chunks on S1 will be in the CatalogCache.
|
||||
|
||||
|
|
@ -35,27 +40,32 @@ Similarly, because collections have independent timelines, there is no guarantee
|
|||
Implementing the consistency model described in the previous section can be achieved in a number of ways which range from always fetching the most up-to-date snapshot of all the objects in the CSRS to a more precise (lazy) fetching of just an object and its dependencies. The current implementation of sharding opts for the latter approach. In order to achieve this, it assigns "timestamps" to all the objects in the catalog and imposes relationships between these timestamps such that the "relates to" relationship is preserved.
|
||||
|
||||
### Object timestamps
|
||||
|
||||
The objects and their timestamps are as follows:
|
||||
* ConfigData: `configTime`, which is the most recent majority timestamp on the CSRS
|
||||
* ShardData: `topologyTime`, which is an always increasing value that increments as shards are added and removed and is stored in the config.shards document
|
||||
* Database*: `databaseTimestamp`, which is an always-increasing value that increments each time a database is created or moved
|
||||
* CollectionPlacement*: `collectionTimestamp/epoch/majorVersion/minorVersion`, henceforth referred to as the `collectionVersion`
|
||||
* CollectionIndexes*: `collectionTimestamp/epoch/indexVersion`, henceforth referred to as the `indexVersion`
|
||||
|
||||
- ConfigData: `configTime`, which is the most recent majority timestamp on the CSRS
|
||||
- ShardData: `topologyTime`, which is an always increasing value that increments as shards are added and removed and is stored in the config.shards document
|
||||
- Database\*: `databaseTimestamp`, which is an always-increasing value that increments each time a database is created or moved
|
||||
- CollectionPlacement\*: `collectionTimestamp/epoch/majorVersion/minorVersion`, henceforth referred to as the `collectionVersion`
|
||||
- CollectionIndexes\*: `collectionTimestamp/epoch/indexVersion`, henceforth referred to as the `indexVersion`
|
||||
|
||||
Because of the "related to" relationships explained above, there is a strict dependency between the various timestamps (please refer to the following section as well for more detail):
|
||||
* `configTime > topologyTime`: If a node is aware of `topologyTime`, it will be aware of the `configTime` of the write which added the new shard (please refer to the section on [object timestamps selection](#object-timestamps-selection) for more information of why the relationship is "greater-than")
|
||||
* `databaseTimestamp > topologyTime`: Topology time which includes the DBPrimary Shard (please refer to the section on [object timestamps selection](#object-timestamps-selection) for more information of why the relationship is "greater-than")
|
||||
* `collectionTimestamp > databaseTimestamp`: DatabaseTimestamp which includes the creation of that database
|
||||
|
||||
- `configTime > topologyTime`: If a node is aware of `topologyTime`, it will be aware of the `configTime` of the write which added the new shard (please refer to the section on [object timestamps selection](#object-timestamps-selection) for more information of why the relationship is "greater-than")
|
||||
- `databaseTimestamp > topologyTime`: Topology time which includes the DBPrimary Shard (please refer to the section on [object timestamps selection](#object-timestamps-selection) for more information of why the relationship is "greater-than")
|
||||
- `collectionTimestamp > databaseTimestamp`: DatabaseTimestamp which includes the creation of that database
|
||||
|
||||
Because every object in the cache depends on the `configTime` and the `topologyTime`, which are singletons in the system, these values are propagated on every communication within the cluster. Any change to the `topologyTime` informs the ShardRegistry that there is new information present on the CSRS, so that a subsequent `getShard` will refresh if necessary (i.e., if the caller asks for a DBPrimary which references a newly added shard).
|
||||
|
||||
As a result, the process of sending of a request to a DBPrimary is as follows:
|
||||
* Ask for a database object from the CatalogCache
|
||||
* The CatalogCache fetches the database object from the CSRS (only if its been told that there is a more recent object in the persistent store), which implicitly fetches the `topologyTime` and the `configTime`
|
||||
* Ask for the DBPrimary shard object from the ShardRegistry
|
||||
* The ShardRegistry ensures that it has caught up at least up to the topologyTime that the fetch of the DB Primary brought and if necessary reaches to the CSRS
|
||||
|
||||
- Ask for a database object from the CatalogCache
|
||||
- The CatalogCache fetches the database object from the CSRS (only if its been told that there is a more recent object in the persistent store), which implicitly fetches the `topologyTime` and the `configTime`
|
||||
- Ask for the DBPrimary shard object from the ShardRegistry
|
||||
- The ShardRegistry ensures that it has caught up at least up to the topologyTime that the fetch of the DB Primary brought and if necessary reaches to the CSRS
|
||||
|
||||
## Object timestamps selection
|
||||
|
||||
In the replication subsystem, the optime for an oplog entry is usually generated when that oplog entry is written to the oplog. Because of this, it is difficult to make an oplog entry to contain its own optime, or for a document to contain the optime of when it was written.
|
||||
|
||||
As a consequence of the above, since the `topologyTime`, `databaseTimestamp` and `collectionTimestamp` are chosen before the write to the relevant collection happens, it is always less than the oplog entry of that write. This is not a problem, because none of these documents are visible before the majority timestamp has advanced to include the respective writes.
|
||||
|
|
|
|||
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Logical Sessions
|
||||
|
||||
Some operations, such as retryable writes and transactions, require durably storing metadata in the
|
||||
|
|
@ -31,6 +30,7 @@ which the session's metadata expires.
|
|||
|
||||
The logical session cache is an in-memory cache of sessions that are open and in use on a certain
|
||||
node. Each node (router, shard, config server) has its own in-memory cache. A cache entry contains:
|
||||
|
||||
1. `_id` - The session’s logical session id
|
||||
1. `user` - The session’s logged-in username (if authentication is enabled)
|
||||
1. `lastUse` - The date and time that the session was last used
|
||||
|
|
@ -40,7 +40,7 @@ as the "sessions collection." The sessions collection has different placement be
|
|||
whether the user is running a standalone node, a replica set, or a sharded cluster.
|
||||
|
||||
| Cluster Type | Sessions Collection Durable Storage |
|
||||
|-----------------|------------------------------------------------------------------------------------------------------------------|
|
||||
| --------------- | ---------------------------------------------------------------------------------------------------------------- |
|
||||
| Standalone Node | Sessions collection exists on the same node as the in-memory cache. |
|
||||
| Replica Set | Sessions collection exists on the primary node and replicates to secondaries. |
|
||||
| Sharded Cluster | Sessions collection is a regular sharded collection - can exist on multiple shards and can have multiple chunks. |
|
||||
|
|
@ -87,17 +87,17 @@ the following steps will be performed:
|
|||
#### Configurable parameters related to the logical session cache
|
||||
|
||||
| Parameter | Value Type | Default Value | Startup/Runtime | Description |
|
||||
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|----------------------|-----------------|----------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | -------------------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [disableLogicalSessionCacheRefresh](https://github.com/mongodb/mongo/blob/9cbbb66d7536ab4f92baf99ef5332e96be0e4153/src/mongo/db/logical_session_cache.idl#L49-L54) | boolean | false | Startup | Disables the logical session cache's periodic "refresh" and "reap" functions on this node. Recommended for testing only. |
|
||||
| [logicalSessionRefreshMillis](https://github.com/mongodb/mongo/blob/9cbbb66d7536ab4f92baf99ef5332e96be0e4153/src/mongo/db/logical_session_cache.idl#L34-L40) | integer | 300000ms (5 minutes) | Startup | Changes how often the logical session cache runs its periodic "refresh" and "reap" functions on this node. |
|
||||
| [localLogicalSessionTimeoutMinutes](https://github.com/mongodb/mongo/blob/9cbbb66d7536ab4f92baf99ef5332e96be0e4153/src/mongo/db/logical_session_id.idl#L191-L196) | integer | 30 minutes | Startup | Changes the TTL index timeout for the sessions collection. In sharded clusters, this parameter is supported only on the config server. |
|
||||
|
||||
#### Code references
|
||||
|
||||
* [Place where a session is placed (or replaced) in the logical session cache](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache.h#L71-L75)
|
||||
* [The logical session cache refresh function](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache_impl.cpp#L207-L355)
|
||||
* [The periodic job to clean up the session catalog and transactions table (the "reap" function)](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache_impl.cpp#L141-L205)
|
||||
* [Location of the session catalog and transactions table cleanup code on mongod](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/session/session_catalog_mongod.cpp#L331-L398)
|
||||
- [Place where a session is placed (or replaced) in the logical session cache](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache.h#L71-L75)
|
||||
- [The logical session cache refresh function](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache_impl.cpp#L207-L355)
|
||||
- [The periodic job to clean up the session catalog and transactions table (the "reap" function)](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/logical_session_cache_impl.cpp#L141-L205)
|
||||
- [Location of the session catalog and transactions table cleanup code on mongod](https://github.com/mongodb/mongo/blob/1f94484d52064e12baedc7b586a8238d63560baf/src/mongo/db/session/session_catalog_mongod.cpp#L331-L398)
|
||||
|
||||
## The logical session catalog
|
||||
|
||||
|
|
@ -127,7 +127,7 @@ the first kill request.
|
|||
|
||||
The runtime state in a node's in-memory session catalog is made durable in the node's
|
||||
`config.transactions` collection, also called its transactions table. The in-memory session catalog
|
||||
is
|
||||
is
|
||||
[invalidated](https://github.com/mongodb/mongo/blob/56655b06ac46825c5937ccca5947dc84ccbca69c/src/mongo/db/session/session_catalog_mongod.cpp#L324)
|
||||
if the `config.transactions` collection is dropped and whenever there is a rollback. When
|
||||
invalidation occurs, all active sessions are killed, and the in-memory transaction state is marked
|
||||
|
|
@ -135,10 +135,11 @@ as invalid to force it to be
|
|||
[reloaded from storage the next time a session is checked out](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/session/session_catalog_mongod.cpp#L426).
|
||||
|
||||
#### Code references
|
||||
* [**SessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/session/session_catalog.h)
|
||||
* [**MongoDSessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/session/session_catalog_mongod.h)
|
||||
* [**RouterSessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/session_catalog_router.h)
|
||||
* How [**mongod**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/service_entry_point_common.cpp#L537) and [**mongos**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/commands/strategy.cpp#L412) check out a session prior to executing a command.
|
||||
|
||||
- [**SessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/session/session_catalog.h)
|
||||
- [**MongoDSessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/session/session_catalog_mongod.h)
|
||||
- [**RouterSessionCatalog class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/session_catalog_router.h)
|
||||
- How [**mongod**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/service_entry_point_common.cpp#L537) and [**mongos**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/commands/strategy.cpp#L412) check out a session prior to executing a command.
|
||||
|
||||
## Retryable writes
|
||||
|
||||
|
|
@ -191,8 +192,8 @@ originally been returned. In version 5.0 and earlier, the default behavior is to
|
|||
[record the document image into the oplog](https://github.com/mongodb/mongo/blob/33ad68c0dc4bda897a5647608049422ae784a15e/src/mongo/db/op_observer_impl.cpp#L191)
|
||||
as a no-op entry. The oplog entries generated would look something like:
|
||||
|
||||
* `{ op: "d", o: {_id: 1}, ts: Timestamp(100, 2), preImageOpTime: Timestamp(100, 1), lsid: ..., txnNumber: ...}`
|
||||
* `{ op: "n", o: {_id: 1, imageBeforeDelete: "foobar"}, ts: Timestamp(100, 1)}`
|
||||
- `{ op: "d", o: {_id: 1}, ts: Timestamp(100, 2), preImageOpTime: Timestamp(100, 1), lsid: ..., txnNumber: ...}`
|
||||
- `{ op: "n", o: {_id: 1, imageBeforeDelete: "foobar"}, ts: Timestamp(100, 1)}`
|
||||
|
||||
There's a cost in "explicitly" replicating these images via the oplog. We've addressed this cost
|
||||
with 5.1 where the default is to instead [save the image into a side collection](https://github.com/mongodb/mongo/blob/33ad68c0dc4bda897a5647608049422ae784a15e/src/mongo/db/op_observer_impl.cpp#L646-L650)
|
||||
|
|
@ -215,7 +216,7 @@ For retry images saved in the image collection, the source will "downconvert" op
|
|||
`needsRetryImage: true` into two oplog entries, simulating the old format. As chunk migrations use
|
||||
internal commands, [this downconverting procedure](https://github.com/mongodb/mongo/blob/0beb0cacfcaf7b24259207862e1d0d489e1c16f1/src/mongo/db/s/session_catalog_migration_source.cpp#L58-L97)
|
||||
is installed under the hood. For resharding and tenant migrations, a new aggregation stage,
|
||||
[_internalFindAndModifyImageLookup](https://github.com/mongodb/mongo/blob/e27dfa10b994f6deff7c59a122b87771cdfa8aba/src/mongo/db/pipeline/document_source_find_and_modify_image_lookup.cpp#L61),
|
||||
[\_internalFindAndModifyImageLookup](https://github.com/mongodb/mongo/blob/e27dfa10b994f6deff7c59a122b87771cdfa8aba/src/mongo/db/pipeline/document_source_find_and_modify_image_lookup.cpp#L61),
|
||||
was introduced to perform the identical substitution. In order for this stage to have a valid timestamp
|
||||
to assign to the forged no-op oplog entry as result of the "downconvert", we must always assign an
|
||||
extra oplog slot when writing the original retryable findAndModify oplog entry with
|
||||
|
|
@ -233,11 +234,12 @@ those timestamps:
|
|||
|Delete, NeedsRetryImage=preImage |2|Reserved for forged no-op entry eventually used by tenant migrations/resharding|Delete oplog entry|NeedsRetryImage: preImage|
|
||||
|
||||
#### Code references
|
||||
* [**TransactionParticipant class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/transaction_participant.h)
|
||||
* How a write operation [checks if a statement has been executed](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/ops/write_ops_exec.cpp#L811-L816)
|
||||
* How mongos [assigns statement ids to writes in a batch write command](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/write_ops/batch_write_op.cpp#L483-L486)
|
||||
* How mongod [assigns statement ids to insert operations](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/ops/write_ops_exec.cpp#L573)
|
||||
* [Retryable writes specifications](https://github.com/mongodb/specifications/blob/49589d66d49517f10cc8e1e4b0badd61dbb1917e/source/retryable-writes/retryable-writes.rst)
|
||||
|
||||
- [**TransactionParticipant class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/transaction_participant.h)
|
||||
- How a write operation [checks if a statement has been executed](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/ops/write_ops_exec.cpp#L811-L816)
|
||||
- How mongos [assigns statement ids to writes in a batch write command](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/write_ops/batch_write_op.cpp#L483-L486)
|
||||
- How mongod [assigns statement ids to insert operations](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/ops/write_ops_exec.cpp#L573)
|
||||
- [Retryable writes specifications](https://github.com/mongodb/specifications/blob/49589d66d49517f10cc8e1e4b0badd61dbb1917e/source/retryable-writes/retryable-writes.rst)
|
||||
|
||||
## Transactions
|
||||
|
||||
|
|
@ -283,12 +285,12 @@ transaction runs on. The command is retryable as long as no new transaction has
|
|||
and the session is still alive. The number of participant shards and the number of write shards determine
|
||||
the commit path for the transaction.
|
||||
|
||||
* If the number of participant shards is zero, the mongos skips the commit and returns immediately.
|
||||
* If the number of participant shards is one, the mongos forwards `commitTransaction` directly to that shard.
|
||||
* If the number of participant shards is greater than one:
|
||||
* If the number of write shards is zero, the mongos forwards `commitTransaction` to each shard individually.
|
||||
* Otherwise, the mongos sends `coordinateCommitTransaction` with the participant list to the coordinator shard to
|
||||
initiate two-phase commit.
|
||||
- If the number of participant shards is zero, the mongos skips the commit and returns immediately.
|
||||
- If the number of participant shards is one, the mongos forwards `commitTransaction` directly to that shard.
|
||||
- If the number of participant shards is greater than one:
|
||||
- If the number of write shards is zero, the mongos forwards `commitTransaction` to each shard individually.
|
||||
- Otherwise, the mongos sends `coordinateCommitTransaction` with the participant list to the coordinator shard to
|
||||
initiate two-phase commit.
|
||||
|
||||
To recover the commit decision after the original mongos has become unreachable, the client can send `commitTransaction`
|
||||
along with the `recoveryToken` to a different mongos. This will not initiate committing the transaction, instead
|
||||
|
|
@ -303,22 +305,23 @@ information about the transaction it is trying commit. This document is deleted
|
|||
|
||||
Below are the steps in the two-phase commit protocol.
|
||||
|
||||
* Prepare Phase
|
||||
1. The coordinator writes the participant list to the `config.transaction_coordinators` document for the
|
||||
transaction, and waits for it to be majority committed.
|
||||
1. The coordinator sends [`prepareTransaction`](https://github.com/mongodb/mongo/blob/r4.4.0-rc7/src/mongo/db/repl/README.md#lifetime-of-a-prepared-transaction) to the participants, and waits for vote reponses. Each participant
|
||||
shard responds with a vote, marks the transaction as prepared, and updates the `config.transactions`
|
||||
document for the transaction.
|
||||
1. The coordinator writes the decision to the `config.transaction_coordinators` document and waits for it to
|
||||
be majority committed. If the `coordinateCommitTransactionReturnImmediatelyAfterPersistingDecision` server parameter is
|
||||
true (default), the `coordinateCommitTransaction` command returns immediately after waiting for client's write concern
|
||||
(i.e. let the remaining work continue in the background).
|
||||
- Prepare Phase
|
||||
|
||||
* Commit Phase
|
||||
1. If the decision is 'commit', the coordinator sends `commitTransaction` to the participant shards, and waits
|
||||
for responses. If the decision is 'abort', it sends `abortTransaction` instead. Each participant shard marks
|
||||
the transaction as committed or aborted, and updates the `config.transactions` document.
|
||||
1. The coordinator deletes the coordinator document with write concern `{w: 1}`.
|
||||
1. The coordinator writes the participant list to the `config.transaction_coordinators` document for the
|
||||
transaction, and waits for it to be majority committed.
|
||||
1. The coordinator sends [`prepareTransaction`](https://github.com/mongodb/mongo/blob/r4.4.0-rc7/src/mongo/db/repl/README.md#lifetime-of-a-prepared-transaction) to the participants, and waits for vote reponses. Each participant
|
||||
shard responds with a vote, marks the transaction as prepared, and updates the `config.transactions`
|
||||
document for the transaction.
|
||||
1. The coordinator writes the decision to the `config.transaction_coordinators` document and waits for it to
|
||||
be majority committed. If the `coordinateCommitTransactionReturnImmediatelyAfterPersistingDecision` server parameter is
|
||||
true (default), the `coordinateCommitTransaction` command returns immediately after waiting for client's write concern
|
||||
(i.e. let the remaining work continue in the background).
|
||||
|
||||
- Commit Phase
|
||||
1. If the decision is 'commit', the coordinator sends `commitTransaction` to the participant shards, and waits
|
||||
for responses. If the decision is 'abort', it sends `abortTransaction` instead. Each participant shard marks
|
||||
the transaction as committed or aborted, and updates the `config.transactions` document.
|
||||
1. The coordinator deletes the coordinator document with write concern `{w: 1}`.
|
||||
|
||||
The prepare phase is skipped if the coordinator already has the participant list and the commit decision persisted.
|
||||
This can be the case if the coordinator was created as part of step-up recovery.
|
||||
|
|
@ -332,9 +335,10 @@ been started on the session and the session is still alive. In both cases, the m
|
|||
to all participant shards.
|
||||
|
||||
#### Code references
|
||||
* [**TransactionRouter class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/transaction_router.h)
|
||||
* [**TransactionCoordinatorService class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/s/transaction_coordinator_service.h)
|
||||
* [**TransactionCoordinator class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/s/transaction_coordinator.h)
|
||||
|
||||
- [**TransactionRouter class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/s/transaction_router.h)
|
||||
- [**TransactionCoordinatorService class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/s/transaction_coordinator_service.h)
|
||||
- [**TransactionCoordinator class**](https://github.com/mongodb/mongo/blob/r4.3.4/src/mongo/db/s/transaction_coordinator.h)
|
||||
|
||||
## Internal Transactions
|
||||
|
||||
|
|
@ -359,11 +363,11 @@ We expect that retryable write commands that start retryable internal transactio
|
|||
1. Write statements within the command are guaranteed to apply only once regardless of how many times a client retries.
|
||||
2. The response for the command is guaranteed to be reconstructable on retry.
|
||||
|
||||
To do this, retryable write statements executed inside of a retryable internal transaction try to emulate the behavior of ordinary retryable writes.
|
||||
To do this, retryable write statements executed inside of a retryable internal transaction try to emulate the behavior of ordinary retryable writes.
|
||||
|
||||
Each statement inside of a retryable write command should have a corresponding entry within a retryable internal transaction with the same `stmtId` as the original write statement. When a transaction participant for a retryable internal transaction notices a write statement with a previously seen `stmtId`, it will not execute the statement and instead generate the original response for the already executed statement using the oplog entry generated by the initial execution. The check for previously executed statements is done using the `retriedStmtIds` array, which contains the `stmtIds` of already retried statements, inside of a write command's response.
|
||||
Each statement inside of a retryable write command should have a corresponding entry within a retryable internal transaction with the same `stmtId` as the original write statement. When a transaction participant for a retryable internal transaction notices a write statement with a previously seen `stmtId`, it will not execute the statement and instead generate the original response for the already executed statement using the oplog entry generated by the initial execution. The check for previously executed statements is done using the `retriedStmtIds` array, which contains the `stmtIds` of already retried statements, inside of a write command's response.
|
||||
|
||||
In cases where a client retryable write command implicitly expects an auxiliary operation to be executed atomically with its current request, a retryable internal transaction may contain additional write statements that are not explicitly requested by a client retryable write command. An example could be that the client expects to atomically update an index when executing a write. Since these auxiliary write statements do not have a corresponding entry within the original client command, the `stmtId` field for these statements will be set to `{stmtId: kUninitializedStmtId}`. These auxiliary write statements are non-retryable, thus it is crucial that we use the `retriedStmtIds` to determine which client write statements were already successfully retried to avoid re-applying the corresponding auxilary write statements. Additionally, these statements will be excluded from the history check involving `retriedStmtIds`.
|
||||
In cases where a client retryable write command implicitly expects an auxiliary operation to be executed atomically with its current request, a retryable internal transaction may contain additional write statements that are not explicitly requested by a client retryable write command. An example could be that the client expects to atomically update an index when executing a write. Since these auxiliary write statements do not have a corresponding entry within the original client command, the `stmtId` field for these statements will be set to `{stmtId: kUninitializedStmtId}`. These auxiliary write statements are non-retryable, thus it is crucial that we use the `retriedStmtIds` to determine which client write statements were already successfully retried to avoid re-applying the corresponding auxilary write statements. Additionally, these statements will be excluded from the history check involving `retriedStmtIds`.
|
||||
|
||||
To guarantee that we can reconstruct the response regardless of retries, we do a "cross sectional" write history check for retryable writes and retryable internal transactions prior to running a client retryable write/retryable internal transaction command. This ensures we do not double apply non-idempotent operations, and instead recover the response for a successful execution when appropriate. To support this, the [RetryableWriteTransactionParticipantCatalog](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.h#L1221-L1299) was added as a decoration on an external session and it stores the transaction participants for all active retryable writes on the session, which we use to do our [write history check](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L3205-L3208).
|
||||
|
||||
|
|
@ -371,12 +375,11 @@ To guarantee that we can reconstruct the response regardless of retries, we do a
|
|||
|
||||
To reconstruct responses for retryable internal transactions, we use the applyOps oplog entry, which contains an inner entry with the operation run under the `o` field that has a corresponding `stmtId`. We use the `stmtId` and `opTime` cached in the `TransactionParticipant` to lookup the operation in the applyOps oplog entry, which gives us the necessary details to reconstruct the original write response. The process for reconstructing retryable write responses works the same way.
|
||||
|
||||
|
||||
#### Special considerations for findAndModify
|
||||
|
||||
`findAndModify` additionally, requires the storage of pre/post images. Upon committing or preparing an internal transaction, we insert a document into `config.image_collection` containing the pre/post image. The operation entry for the findAndModify statement inside the applyOps oplog entry will have a `needsRetryImage` field that is set to `true` to indicate that a pre/post image should be loaded from the side collection when reconstructing the write response. We can do the lookup using a transaction's `lsid` and `txnNumber`.
|
||||
|
||||
Currently, a retryable internal transaction can only support a **single** `findAndModify` statement at a time, due to the limitation that `config.image_collection` can only support storing one pre/post image entry for a given `(lsid, txnNumber)`.
|
||||
Currently, a retryable internal transaction can only support a **single** `findAndModify` statement at a time, due to the limitation that `config.image_collection` can only support storing one pre/post image entry for a given `(lsid, txnNumber)`.
|
||||
|
||||
#### Retryability across failover and restart
|
||||
|
||||
|
|
@ -390,27 +393,25 @@ Due to the use of `txnUUID` in the lsid for de-duplication purposes, retries of
|
|||
|
||||
2. If the client retries on a different mongos than the original write was run on, the new mongos will not have visibility over in-progress internal transactions run by another mongos, so this retry will not be blocked and legally begin execution. When the new mongos begins execution of the retried command, it will send commands with `startTransaction` to relevant transaction participants. The transaction participants will then [check if there is already an in-progress internal transaction that will conflict](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L2827-L2846) with the new internal transaction that is attempting to start. If so, then the transaction participant will throw `RetryableTransactionInProgress`, which will be caught and cause the new transaction to [block until the existing transaction is finished](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/service_entry_point_common.cpp#L1029-L1036).
|
||||
|
||||
|
||||
#### Supporting retryability across chunk migration and resharding
|
||||
|
||||
The session history, oplog entries, and image collection entries involving the chunk being migrated are cloned from the donor shard to the recipient shard during chunk migration. Once the recipient receives the relevant oplog entries from the donor, it will [nest and apply the each of the received oplog entries in a no-op oplog entry](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/s/session_catalog_migration_destination.cpp#L204-L347). Depending on the type of operation run, the behavior will differ as such.
|
||||
The session history, oplog entries, and image collection entries involving the chunk being migrated are cloned from the donor shard to the recipient shard during chunk migration. Once the recipient receives the relevant oplog entries from the donor, it will [nest and apply the each of the received oplog entries in a no-op oplog entry](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/s/session_catalog_migration_destination.cpp#L204-L347). Depending on the type of operation run, the behavior will differ as such.
|
||||
|
||||
* If a non-retryable write/non-retryable internal transaction is run, then the donor shard will [send a sentinel no-op oplog entry](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/s/session_catalog_migration_destination.cpp#L204-L354), which when parsed by the TransactionParticipant upon getting a retry against the recipient shard will [throw IncompleteTransactionHistory](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L323-L331).
|
||||
- If a non-retryable write/non-retryable internal transaction is run, then the donor shard will [send a sentinel no-op oplog entry](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/s/session_catalog_migration_destination.cpp#L204-L354), which when parsed by the TransactionParticipant upon getting a retry against the recipient shard will [throw IncompleteTransactionHistory](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L323-L331).
|
||||
|
||||
* If a retryable write/retryable internal transaction is run, then the donor shard will send a ["downconverted" oplog entry](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/s/session_catalog_migration_source.cpp#L669-L680), which when parsed by the TransactionParticipant upon getting a retry against the recipient shard will return the original write response.
|
||||
- If a retryable write/retryable internal transaction is run, then the donor shard will send a ["downconverted" oplog entry](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/s/session_catalog_migration_source.cpp#L669-L680), which when parsed by the TransactionParticipant upon getting a retry against the recipient shard will return the original write response.
|
||||
|
||||
`Note`: "Downconverting" in this context, is the process of extracting the operation information inside an applyOps entry for an internal transaction and constructing a new retryable write oplog entry with `lsid` and `txnNumber` set to the associated client's session id and txnNumber.
|
||||
`Note`: "Downconverting" in this context, is the process of extracting the operation information inside an applyOps entry for an internal transaction and constructing a new retryable write oplog entry with `lsid` and `txnNumber` set to the associated client's session id and txnNumber.
|
||||
|
||||
For resharding, the process is similar to how chunk migrations are handled. The session history, oplog entries, and image collection entries for operations run during resharding are cloned from the donor shard to the recipient shard. The only difference is that the recipient in this case will handle the "downconverting", nesting, and applying of the received oplog entries. The two cases discussed above apply to resharding as well.
|
||||
|
||||
|
||||
#### Code References
|
||||
|
||||
* [**Session checkout logic**](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/session/session_catalog_mongod.cpp#L694)
|
||||
* [**Cross-section history check logic**](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/transaction/transaction_participant.cpp#L3206)
|
||||
* [**Conflicting internal transaction check logic**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L2827-L2846)
|
||||
* [**Refreshing client and internal sessions logic**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L2889-L2899)
|
||||
* [**RetryableWriteTransactionParticipantCatalog**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.h#L1221-L1299)
|
||||
- [**Session checkout logic**](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/session/session_catalog_mongod.cpp#L694)
|
||||
- [**Cross-section history check logic**](https://github.com/mongodb/mongo/blob/0d84f4bab0945559abcd5b00be5ec322c5214642/src/mongo/db/transaction/transaction_participant.cpp#L3206)
|
||||
- [**Conflicting internal transaction check logic**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L2827-L2846)
|
||||
- [**Refreshing client and internal sessions logic**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.cpp#L2889-L2899)
|
||||
- [**RetryableWriteTransactionParticipantCatalog**](https://github.com/mongodb/mongo/blob/d8ce3ee2e020d1ab2fa611a2a0f0a222b06b9779/src/mongo/db/transaction/transaction_participant.h#L1221-L1299)
|
||||
|
||||
### Transaction API
|
||||
|
||||
|
|
@ -420,11 +421,11 @@ Additionally, the API can use router commands when running on a mongod. Each com
|
|||
|
||||
Transactions for non-retryable operations or operations without a session initiated through the API use sessions from the [InternalSessionPool](https://github.com/mongodb/mongo/blob/master/src/mongo/db/internal_session_pool.h) to prevent the creation and maintenance of many single-use sessions.
|
||||
|
||||
To use the transaction API, [instantiate a transaction client](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_find_and_modify_cmd.cpp#L250-L253) by providing the opCtx, an executor, and resource yielder. Then, run the commands to be grouped in the same transaction session on the transaction object. Some examples of this are listed below.
|
||||
To use the transaction API, [instantiate a transaction client](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_find_and_modify_cmd.cpp#L250-L253) by providing the opCtx, an executor, and resource yielder. Then, run the commands to be grouped in the same transaction session on the transaction object. Some examples of this are listed below.
|
||||
|
||||
* [Cluster Find and Modify Command](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_find_and_modify_cmd.cpp#L255-L265)
|
||||
* [Queryable Encryption](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/db/commands/fle2_compact.cpp#L636-L648)
|
||||
* [Cluster Write Command - WouldChangeOwningShard Error](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_write_cmd.cpp#L162-L190)
|
||||
- [Cluster Find and Modify Command](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_find_and_modify_cmd.cpp#L255-L265)
|
||||
- [Queryable Encryption](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/db/commands/fle2_compact.cpp#L636-L648)
|
||||
- [Cluster Write Command - WouldChangeOwningShard Error](https://github.com/mongodb/mongo/blob/63f99193df82777239f038666270e4bfb2be3567/src/mongo/s/commands/cluster_write_cmd.cpp#L162-L190)
|
||||
|
||||
## The historical routing table
|
||||
|
||||
|
|
@ -448,8 +449,9 @@ currently own the target chunks. Otherwise, it will return the shards that owned
|
|||
at that clusterTime and will throw a `StaleChunkHistory` error if it cannot find them.
|
||||
|
||||
#### Code references
|
||||
* [**ChunkManager class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/chunk_manager.h#L233-L451)
|
||||
* [**RoutingTableHistory class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/chunk_manager.h#L70-L231)
|
||||
* [**ChunkHistory class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/catalog/type_chunk.h#L131-L145)
|
||||
|
||||
- [**ChunkManager class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/chunk_manager.h#L233-L451)
|
||||
- [**RoutingTableHistory class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/chunk_manager.h#L70-L231)
|
||||
- [**ChunkHistory class**](https://github.com/mongodb/mongo/blob/r4.3.6/src/mongo/s/catalog/type_chunk.h#L131-L145)
|
||||
|
||||
---
|
||||
|
|
|
|||
|
|
@ -1,11 +1,15 @@
|
|||
# Sharding Catalog
|
||||
|
||||
Depending on the team, the definition of "the catalog" can be different. Here, we will define it as a combination of the following:
|
||||
* **Catalog objects:** The set of conceptual "objects" which we use to talk about in the core server without regard to how they are implemented or stored. Examples are shards, databases, collections, indexes, collMods and views; but not config servers, caches or internal system collections.
|
||||
* [**Catalog containers:**](#catalog-containers) The set of WT tables, system collections and in-memory caches that store all or part of the descriptions of the *Catalog Objects*, without regard to the protocols that are used when being read or written to. Examples are the [*__mdb_catalog*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/storage/storage_engine_impl.cpp#L75), *config.databases*, *config.collections*, *config.chunks*, [*CollectionCatalog*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/catalog/collection_catalog.h#L50), [*CatalogCache*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134), [*SS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/sharding_state.h#L51), [*DSS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/database_sharding_state.h#L45), [*CSS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/collection_sharding_state.h#L59) and any WT tables backing the data for the user collections; but not the actual classes that implement them or the shard versioning protocol
|
||||
* [**Sharding catalog API:**](#sharding-catalog-api) The actual C++ classes and methods representing the above concepts that developers use in order to program distributed data applications, along with their contracts. Examples are [*CatalogCache*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134), [*SS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/sharding_state.h#L51), [*DSS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/database_sharding_state.h#L45), [*CSS*](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/collection_sharding_state.h#L59), DDL Coordinator and the shard versioning protocol; but not the transactions API, replication subsystem or the networking code.
|
||||
|
||||
- **Catalog objects:** The set of conceptual "objects" which we use to talk about in the core server without regard to how they are implemented or stored. Examples are shards, databases, collections, indexes, collMods and views; but not config servers, caches or internal system collections.
|
||||
- [**Catalog containers:**](#catalog-containers) The set of WT tables, system collections and in-memory caches that store all or part of the descriptions of the _Catalog Objects_, without regard to the protocols that are used when being read or written to. Examples are the [_\_\_mdb_catalog_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/storage/storage_engine_impl.cpp#L75), _config.databases_, _config.collections_, _config.chunks_, [_CollectionCatalog_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/catalog/collection_catalog.h#L50), [_CatalogCache_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134), [_SS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/sharding_state.h#L51), [_DSS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/database_sharding_state.h#L45), [_CSS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/collection_sharding_state.h#L59) and any WT tables backing the data for the user collections; but not the actual classes that implement them or the shard versioning protocol
|
||||
- [**Sharding catalog API:**](#sharding-catalog-api) The actual C++ classes and methods representing the above concepts that developers use in order to program distributed data applications, along with their contracts. Examples are [_CatalogCache_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134), [_SS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/sharding_state.h#L51), [_DSS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/database_sharding_state.h#L45), [_CSS_](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/db/s/collection_sharding_state.h#L59), DDL Coordinator and the shard versioning protocol; but not the transactions API, replication subsystem or the networking code.
|
||||
|
||||
## Catalog containers
|
||||
|
||||
The catalog containers store all or part of the descriptions of the various catalog objects. There are two types of containers - persisted and in-memory (caches) and the diagram below visualises their relationships. The dotted lines between the containers indicate the flow of data between them.
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
CollectionCatalog <.. __mdb_catalog
|
||||
|
|
@ -31,32 +35,38 @@ classDiagram
|
|||
```
|
||||
|
||||
### Authoritative containers
|
||||
Put in a naive way, a container is said to be "authoritative" if it can be "frozen in time" and its data can be trusted to be definitive for the respective catalog object. For example, if the CSRS is "frozen", we can safely trust the *config.chunks* collection to know where data is located; however, currently it is not possible to "freeze" a shard and trust it to know what set of chunks it owns. On the other hand, if a shard is "frozen" we can safely trust it about what global indexes are under any collections that it owns.
|
||||
|
||||
Put in a naive way, a container is said to be "authoritative" if it can be "frozen in time" and its data can be trusted to be definitive for the respective catalog object. For example, if the CSRS is "frozen", we can safely trust the _config.chunks_ collection to know where data is located; however, currently it is not possible to "freeze" a shard and trust it to know what set of chunks it owns. On the other hand, if a shard is "frozen" we can safely trust it about what global indexes are under any collections that it owns.
|
||||
|
||||
Based on the above, as it stands, different containers on different nodes are authoritative for different parts of the catalog. Most authoritative containers are on the CSRS. In the future we would like the shards to be authoritative for everything they own and the CSRS just acting as a materialised view (i.e., cache) of all shards' catalogs.
|
||||
|
||||
### Synchronisation
|
||||
The most important requirement of any sharded feature is that it scales linearly with the size of the data or the workload.
|
||||
|
||||
The most important requirement of any sharded feature is that it scales linearly with the size of the data or the workload.
|
||||
|
||||
In order to scale, sharding utilises "optimistic" distributed synchronisation protocols to avoid creating nodes which are a bottleneck (i.e., the CSRS). One of these protocols, named [shard versioning](README_versioning_protocols.md), allows the routers to use cached information to send queries to one or more shards, and only read from the CSRS if the state of the world changes (e.g. chunk migration).
|
||||
|
||||
The main goal of these protocols is to maintain certain causal relationships between the different catalog containers, where *routers* operate on cached information and rely on the *shards* to "correct" them if the data is no longer where the router thinks it is.
|
||||
The main goal of these protocols is to maintain certain causal relationships between the different catalog containers, where _routers_ operate on cached information and rely on the _shards_ to "correct" them if the data is no longer where the router thinks it is.
|
||||
|
||||
## Sharding catalog API
|
||||
|
||||
The purpose of the Sharding Catalog API is to present server engineers with an abstract programming model, which hides the complexities of the catalog containers and the protocols used to keep them in sync.
|
||||
|
||||
Even though the code currently doesn't always reflect it, this abstract programming model in practice looks like a tree of nested router and shard loops.
|
||||
|
||||
The [*router loop*](#router-role) takes some cached routing information, sends requests to a set of shards along with some token describing the cached information it used (i.e., the *shard version*) and must be prepared for any of the targeted shards to return a stale shard version exception, indicating that the router is stale. Upon receiving that exception, the router "refreshes" and tries again.
|
||||
The [_router loop_](#router-role) takes some cached routing information, sends requests to a set of shards along with some token describing the cached information it used (i.e., the _shard version_) and must be prepared for any of the targeted shards to return a stale shard version exception, indicating that the router is stale. Upon receiving that exception, the router "refreshes" and tries again.
|
||||
|
||||
The [*shard loop*](#shard-role) takes a request from a router, checks whether the cache it used is up-to-date and if so, serves the request, otherwise returns a stale shard version exception.
|
||||
The [_shard loop_](#shard-role) takes a request from a router, checks whether the cache it used is up-to-date and if so, serves the request, otherwise returns a stale shard version exception.
|
||||
|
||||
### Router role
|
||||
|
||||
When a piece of code is running in a router loop, it is also said that it is executing in the Router role. Currently, the code for the router role is scattered across at least the following utilities:
|
||||
* [ShardRegistry](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/client/shard_registry.h#L164)
|
||||
* [CatalogCache](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134)
|
||||
* [Router](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/router.h#L41)
|
||||
* [Stale Shard Version Helpers](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/stale_shard_version_helpers.h#L71-L72)
|
||||
|
||||
- [ShardRegistry](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/client/shard_registry.h#L164)
|
||||
- [CatalogCache](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/catalog_cache.h#L134)
|
||||
- [Router](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/router.h#L41)
|
||||
- [Stale Shard Version Helpers](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/stale_shard_version_helpers.h#L71-L72)
|
||||
|
||||
### Shard role
|
||||
|
||||
For a piece of code to be executing in the shard role, it must be holding some kind of synchronisation which guarantees the stability of the catalog for that scope. See [here](https://github.com/mongodb/mongo/blob/master/src/mongo/db/README_shard_role_api.md) for details about the Shard Role API.
|
||||
|
|
|
|||
|
|
@ -1,15 +1,18 @@
|
|||
# Node startup and shutdown
|
||||
|
||||
## Startup and sharding component initialization
|
||||
|
||||
The mongod intialization process is split into three phases. The first phase runs on startup and initializes the set of stateless components based on the cluster role. The second phase then initializes additional components that must be initialized with state read from the config server. The third phase is run on the [transition to primary](https://github.com/mongodb/mongo/blob/879d50a73179d0dd94fead476468af3ee4511b8f/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L822-L901) and starts services that only run on primaries.
|
||||
|
||||
### Shard Server initialization
|
||||
|
||||
#### Phase 1:
|
||||
|
||||
1. On a shard server, the `CollectionShardingState` factory is set to an instance of the `CollectionShardingStateFactoryShard` implementation. The component lives on the service context.
|
||||
1. The sharding [OpObservers are created](https://github.com/mongodb/mongo/blob/0e08b33037f30094e9e213eacfe16fe88b52ff84/src/mongo/db/mongod_main.cpp#L1000-L1001) and registered with the service context. The `MigrationChunkClonerSourceOpObserver` class forwards operations during migration to the chunk cloner. The `ShardServerOpObserver` class is used to handle the majority of sharding related events. These include loading the shard identity document when it is inserted and performing range deletions when they are marked as ready.
|
||||
|
||||
#### Phase 2:
|
||||
|
||||
1. The [shardIdentity document is loaded](https://github.com/mongodb/mongo/blob/37ff80f6234137fd314d00e2cd1ff77cde90ce11/src/mongo/db/s/sharding_initialization_mongod.cpp#L366-L373) if it already exists on startup. For shards, the shard identity document specifies the config server connection string. If the shard does not have a shardIdentity document, it has not been added to a cluster yet, and the "Phase 2" initialization happens when the shard receives a shardIdentity document as part of addShard.
|
||||
1. If the shard identity document was found, then the [ShardingState is intialized](https://github.com/mongodb/mongo/blob/37ff80f6234137fd314d00e2cd1ff77cde90ce11/src/mongo/db/s/sharding_initialization_mongod.cpp#L416-L462) from its fields.
|
||||
1. The global sharding state is set on the Grid. The Grid contains the sharding context for a running server. It exists both on mongod and mongos because the Grid holds all the components needed for routing, and both mongos and shard servers can act as routers.
|
||||
|
|
@ -18,37 +21,46 @@ The mongod intialization process is split into three phases. The first phase run
|
|||
1. The remaining sharding components are [initialized for the current replica set role](https://github.com/mongodb/mongo/blob/37ff80f6234137fd314d00e2cd1ff77cde90ce11/src/mongo/db/s/sharding_initialization_mongod.cpp#L255-L286) before the Grid is marked as initialized.
|
||||
|
||||
#### Phase 3:
|
||||
|
||||
Shard servers [start up several services](https://github.com/mongodb/mongo/blob/879d50a73179d0dd94fead476468af3ee4511b8f/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L885-L894) that only run on primaries.
|
||||
|
||||
### Config Server initialization
|
||||
|
||||
#### Phase 1:
|
||||
|
||||
The sharding [OpObservers are created](https://github.com/mongodb/mongo/blob/0e08b33037f30094e9e213eacfe16fe88b52ff84/src/mongo/db/mongod_main.cpp#L1000-L1001) and registered with the service context. The config server registers the OpObserverImpl and ConfigServerOpObserver observers.
|
||||
|
||||
#### Phase 2:
|
||||
|
||||
The global sharding state is set on the Grid. The Grid contains the sharding context for a running server. The config server does not need to be provided with the config server connection string explicitly as it is part of its local state.
|
||||
|
||||
#### Phase 3:
|
||||
|
||||
Config servers [run some services](https://github.com/mongodb/mongo/blob/879d50a73179d0dd94fead476468af3ee4511b8f/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L866-L867) that only run on primaries.
|
||||
|
||||
### Mongos initialization
|
||||
|
||||
#### Phase 2:
|
||||
|
||||
The global sharding state is set on the Grid. The Grid contains the sharding context for a running server. Mongos is provided with the config server connection string as a startup parameter.
|
||||
|
||||
#### Code references
|
||||
* Function to [initialize global sharding state](https://github.com/mongodb/mongo/blob/eeca550092d9601d433e04c3aa71b8e1ff9795f7/src/mongo/s/sharding_initialization.cpp#L188-L237).
|
||||
* Function to [initialize sharding environment](https://github.com/mongodb/mongo/blob/37ff80f6234137fd314d00e2cd1ff77cde90ce11/src/mongo/db/s/sharding_initialization_mongod.cpp#L255-L286) on shard server.
|
||||
* Hook for sharding [transition to primary](https://github.com/mongodb/mongo/blob/879d50a73179d0dd94fead476468af3ee4511b8f/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L822-L901).
|
||||
|
||||
- Function to [initialize global sharding state](https://github.com/mongodb/mongo/blob/eeca550092d9601d433e04c3aa71b8e1ff9795f7/src/mongo/s/sharding_initialization.cpp#L188-L237).
|
||||
- Function to [initialize sharding environment](https://github.com/mongodb/mongo/blob/37ff80f6234137fd314d00e2cd1ff77cde90ce11/src/mongo/db/s/sharding_initialization_mongod.cpp#L255-L286) on shard server.
|
||||
- Hook for sharding [transition to primary](https://github.com/mongodb/mongo/blob/879d50a73179d0dd94fead476468af3ee4511b8f/src/mongo/db/repl/replication_coordinator_external_state_impl.cpp#L822-L901).
|
||||
|
||||
## Shutdown
|
||||
|
||||
If the mongod server is primary, it will [try to step down](https://github.com/mongodb/mongo/blob/0987c120f552ab6d347f6b1b6574345e8c938c32/src/mongo/db/mongod_main.cpp#L1046-L1072). Mongod and mongos then run their respective shutdown tasks which cleanup the remaining sharding components.
|
||||
|
||||
#### Code references
|
||||
* [Shutdown logic](https://github.com/mongodb/mongo/blob/2bb2f2225d18031328722f98fe05a169064a8a8a/src/mongo/db/mongod_main.cpp#L1163) for mongod.
|
||||
* [Shutdown logic](https://github.com/mongodb/mongo/blob/30f5448e95114d344e6acffa92856536885e35dd/src/mongo/s/mongos_main.cpp#L336-L354) for mongos.
|
||||
|
||||
- [Shutdown logic](https://github.com/mongodb/mongo/blob/2bb2f2225d18031328722f98fe05a169064a8a8a/src/mongo/db/mongod_main.cpp#L1163) for mongod.
|
||||
- [Shutdown logic](https://github.com/mongodb/mongo/blob/30f5448e95114d344e6acffa92856536885e35dd/src/mongo/s/mongos_main.cpp#L336-L354) for mongos.
|
||||
|
||||
### Quiesce mode on shutdown
|
||||
|
||||
mongos enters quiesce mode prior to shutdown, to allow short-running operations to finish.
|
||||
During this time, new and existing operations are allowed to run, but `isMaster`/`hello`
|
||||
requests return a `ShutdownInProgress` error, to indicate that clients should start routing
|
||||
|
|
@ -74,6 +86,6 @@ When mongos establishes outgoing connections to mongod nodes in the cluster, it
|
|||
rather than `isMaster`.
|
||||
|
||||
#### Code references
|
||||
* [isMaster command](https://github.com/mongodb/mongo/blob/r4.8.0-alpha/src/mongo/s/commands/cluster_is_master_cmd.cpp#L248) for mongos.
|
||||
* [hello command](https://github.com/mongodb/mongo/blob/r4.8.0-alpha/src/mongo/s/commands/cluster_is_master_cmd.cpp#L64) for mongos.
|
||||
|
||||
- [isMaster command](https://github.com/mongodb/mongo/blob/r4.8.0-alpha/src/mongo/s/commands/cluster_is_master_cmd.cpp#L248) for mongos.
|
||||
- [hello command](https://github.com/mongodb/mongo/blob/r4.8.0-alpha/src/mongo/s/commands/cluster_is_master_cmd.cpp#L64) for mongos.
|
||||
|
|
|
|||
|
|
@ -27,7 +27,7 @@ The `GlobalUserWriteBlockState` stores whether user write blocking is enabled in
|
|||
associated with the given `OperationContext` is
|
||||
enabled](https://github.com/mongodb/mongo/blob/25377181476e4140c970afa5b018f9b4fcc951e8/src/mongo/db/s/global_user_write_block_state.cpp#L59-L67).
|
||||
The `WriteBlockBypass` stores whether the user that initiated the write is able to perform writes
|
||||
when user write blocking is enabled. On internal requests (i.e. from other `mongod` or `mongos`
|
||||
when user write blocking is enabled. On internal requests (i.e. from other `mongod` or `mongos`
|
||||
instances in the sharded cluster/replica set), the request originator propagates `WriteBlockBypass`
|
||||
[through the request
|
||||
metadata](https://github.com/mongodb/mongo/blob/182616b7b45a1e360839c612c9ee8acaa130fe17/src/mongo/rpc/metadata.cpp#L115).
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
# Versioning Protocols
|
||||
|
||||
When a command is sent to a router, the router must decide which shards to forward this request to - a process called routing. Routing in MongoDB acts optimistically, meaning that a router will use on whatever information it has cached to decide which shards to send the request to, and then rely on the shards to return an error if this information is stale.
|
||||
|
||||
This process is implemented via the shard versioning protocol and it is what prevents the config servers from becoming a bottleneck for commands while ensuring that the router eventually sends the command to the correct set of shards.
|
||||
|
||||
## Shard Versioning Protocol
|
||||
|
||||
When a router uses its cached information to send a request to a shard, it attaches a token describing the information it used. This token is the [database version](#database-version) for unsharded collections and the [shard version](#shard-version) for sharded collections.
|
||||
|
||||
When a shard receives the request, it will check this token to make sure that it matches the shard's local information. If it matches, then the request will proceed. If the version does not match, the shard will throw [an exception](https://github.com/mongodb/mongo/blob/r6.0.0/src/mongo/s/stale_exception.h).
|
||||
|
|
@ -40,24 +42,28 @@ sequenceDiagram
|
|||
S1->>S: OK
|
||||
S2->>S: OK
|
||||
```
|
||||
|
||||
The protocol is the same when using a DBVersion, the only difference is that StaleDbRoutingVersion is returned to the router instead of StaleConfig. In practice, both the Database Version and Shard Version are more complicated than an increasing integer, and their components are described below.
|
||||
|
||||
## Database Version
|
||||
|
||||
A database version is represented as DBV<U, T, Mod> and consists of three elements:
|
||||
|
||||
1. __U__ (the uuid) : a unique identifier to distinguish different instances of the database. The UUID remains unchanged for the lifetime of the database, changing when the database is dropped and recreated.
|
||||
2. __T__ (the timestamp) : a new unique identifier introduced in version 5.0 which also remains unchanged for the lifetime of a database. The difference between the uuid and timestamp is that timestamps are comparable, allowing for ordering database versions in which the UUID/Timestamp do not match.
|
||||
3. __M__ (last modified) : an integer incremented when the database changes its primary shard.
|
||||
1. **U** (the uuid) : a unique identifier to distinguish different instances of the database. The UUID remains unchanged for the lifetime of the database, changing when the database is dropped and recreated.
|
||||
2. **T** (the timestamp) : a new unique identifier introduced in version 5.0 which also remains unchanged for the lifetime of a database. The difference between the uuid and timestamp is that timestamps are comparable, allowing for ordering database versions in which the UUID/Timestamp do not match.
|
||||
3. **M** (last modified) : an integer incremented when the database changes its primary shard.
|
||||
|
||||
## Shard Version
|
||||
The shard version is represented as SV<E, T, M, m, I> and consists of five elements:
|
||||
1. __E__ (the epoch) : a unique identifier that distinguishes an instance of the collection.
|
||||
2. __T__ (the timestamp) : a new unique identifier introduced in version 5.0. The difference between the epoch and timestamp is that timestamps are comparable, allowing for ordering shard versions in which the epoch/timestamp do not match.
|
||||
3. __M__ (major version) : an integer used to indicate a change in data placement, as from a migration.
|
||||
4. __m__ (minor version) : an integer used to indicate a change to data boundaries within a shard such as from a split or merge.
|
||||
5. __I__ (index version) : a timestamp representing the time of the last modification to a global index in the collection.
|
||||
|
||||
The epoch and timestamp serve the same functionality, that of uniquely identifying an instance of the collection. For this reason, we group them together and call them the [__collection generation__](https://github.com/mongodb/mongo/blob/10fd84b6850ef672ff6ed367ca9292ad8db262d2/src/mongo/s/chunk_version.h#L38-L80). Likewise, the major and minor versions work together to describe the layout of data on the shards. Together, they are called the [__collection placement__](https://github.com/mongodb/mongo/blob/10fd84b6850ef672ff6ed367ca9292ad8db262d2/src/mongo/s/chunk_version.h#L82-L113) (or placement version). The [index version](https://github.com/mongodb/mongo/blob/r6.2.1/src/mongo/s/index_version.h) (or collection indexes) stands alone, describing the global indexes present in a collection. The relationship between these components can be visualized as the following.
|
||||
The shard version is represented as SV<E, T, M, m, I> and consists of five elements:
|
||||
|
||||
1. **E** (the epoch) : a unique identifier that distinguishes an instance of the collection.
|
||||
2. **T** (the timestamp) : a new unique identifier introduced in version 5.0. The difference between the epoch and timestamp is that timestamps are comparable, allowing for ordering shard versions in which the epoch/timestamp do not match.
|
||||
3. **M** (major version) : an integer used to indicate a change in data placement, as from a migration.
|
||||
4. **m** (minor version) : an integer used to indicate a change to data boundaries within a shard such as from a split or merge.
|
||||
5. **I** (index version) : a timestamp representing the time of the last modification to a global index in the collection.
|
||||
|
||||
The epoch and timestamp serve the same functionality, that of uniquely identifying an instance of the collection. For this reason, we group them together and call them the [**collection generation**](https://github.com/mongodb/mongo/blob/10fd84b6850ef672ff6ed367ca9292ad8db262d2/src/mongo/s/chunk_version.h#L38-L80). Likewise, the major and minor versions work together to describe the layout of data on the shards. Together, they are called the [**collection placement**](https://github.com/mongodb/mongo/blob/10fd84b6850ef672ff6ed367ca9292ad8db262d2/src/mongo/s/chunk_version.h#L82-L113) (or placement version). The [index version](https://github.com/mongodb/mongo/blob/r6.2.1/src/mongo/s/index_version.h) (or collection indexes) stands alone, describing the global indexes present in a collection. The relationship between these components can be visualized as the following.
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
|
|
@ -71,31 +77,43 @@ classDiagram
|
|||
link CollectionPlacement "https://github.com/mongodb/mongo/blob/10fd84b6850ef672ff6ed367ca9292ad8db262d2/src/mongo/s/chunk_version.h#L82-L113"
|
||||
end
|
||||
```
|
||||
|
||||
A change in the CollectionGeneration implies that the CollectionPlacement must have changed as well, since the collection itself has changed. The index version is independent of this hierarchy.
|
||||
|
||||
Each shard has its own shard version, which consists of the collection generation, the index version, and the maximum placement version of the ranges located on the shard. Similarly, the overall collection version consists of the collection generation, index version, and the maximum placement version of any range in the collection.
|
||||
|
||||
### Operations that change the shard versions
|
||||
|
||||
Changes of the shard version indicate that some routing information has changed, and routers need to request updated information. Changes in different components of the shard version indicate different routing information changes.
|
||||
|
||||
#### Generation Changes
|
||||
|
||||
A change in the collection generation indicates that the collection has changed so significantly that all previous placement information is incorrect. Changes in this component can be caused by dropping and recreating the collection, refining its shard key, renaming it, or resharding it. This will indicate that all routing information is stale, and all routers need to fetch new information.
|
||||
|
||||
#### Placement Version Changes
|
||||
|
||||
A placement version change indicates that something has changed about what data is placed on what shard. The most important operation that changes the placement version is migration, however split, merge and even some other operations change it as well, even though they don't actually move any data around. These changes are more targeted than generation changes, and will only cause the router to refresh if it is targeting a shard that was affected by the operation.
|
||||
|
||||
#### Index Version Changes
|
||||
|
||||
An index version change indicates that there has been some change in the global index information of the collection, such as from adding or removing a global index.
|
||||
|
||||
## Routing Information Refreshes
|
||||
|
||||
For sharded collections, there are two sets of information that compose the routing information - the chunk placement information and the collection index information. The config server is [authoritative](README_sharding_catalog.md#authoritative-containers) for the placement information, while both the shards and the config server are authoritative for the index information.
|
||||
|
||||
When a router receives a stale config error, it will refresh whichever component is stale. If the router has an older CollectionGeneration or CollectionPlacement, it will refresh the placement information, whereas if it has an older IndexVersion, it will refresh the index information.
|
||||
|
||||
### Placement Information Refreshes
|
||||
|
||||
MongoS and shard primaries refresh their placement information from the config server. Shard secondaries, however, refresh from the shard primaries through a component called the Shard Server Catalog Cache Loader. When a shard primary refreshes from a config server, it persists the refreshed information to disk. This information is then replicated to secondaries who will refresh their cache from this on-disk information.
|
||||
|
||||
#### Incremental and Full Refreshes
|
||||
|
||||
A full refresh clears all cached information, and replaces the cache with the information that exists on the node’s source whereas an incremental refresh only replaces modified routing information from the node’s source.
|
||||
|
||||
Incremental refreshes will happen whenever there has been a [placement version change](#placement-version-changes), while [collection generation changes](#generation-changes) will cause a full refresh.
|
||||
|
||||
### Index Information Refreshes
|
||||
|
||||
Index information refreshes are always done from the config server. The router will fetch the whole index information from the config server and replace what it has in its cache with the new information.
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ global uniqueness enforcement.
|
|||
Unlike local indexes, global indexes are maintained in the sharding catalog, and are known to the
|
||||
entire cluster rather than individually by each shard. The sharding catalog is responsible for
|
||||
mapping a [base collection](#glossary) to its global indexes and vice-versa, for storing the index
|
||||
specifications, and for routing index key writes to the owning shards.
|
||||
specifications, and for routing index key writes to the owning shards.
|
||||
|
||||
The global index keys are stored locally in a shard in an internal system collection referred
|
||||
to as the [global index container](#glossary). Unlike local index tables, a global index container
|
||||
|
|
@ -56,12 +56,12 @@ Sample index entry:
|
|||
{_id: {shk1: 1 .. shkN: 1, _id: 1}, ik: BinData(KeyString({ik1: 1, .. ikN: 1})), tb: BinData(TypeBits({ik1: 1, .. ikN: 1}))}
|
||||
```
|
||||
|
||||
* Top-level `_id` is the [document key](#glossary).
|
||||
* `ik` is the index key. The key is stored in its [KeyString](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#keystring)
|
||||
form without [TypeBits](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#typebits), as BSON binary data with subtype 0 (Generic binary subtype).
|
||||
* `tb` are the [TypeBits](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#typebits).
|
||||
This field is only present when not empty, and is stored as BSON binary data with subtype 0
|
||||
(Generic binary subtype).
|
||||
- Top-level `_id` is the [document key](#glossary).
|
||||
- `ik` is the index key. The key is stored in its [KeyString](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#keystring)
|
||||
form without [TypeBits](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#typebits), as BSON binary data with subtype 0 (Generic binary subtype).
|
||||
- `tb` are the [TypeBits](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#typebits).
|
||||
This field is only present when not empty, and is stored as BSON binary data with subtype 0
|
||||
(Generic binary subtype).
|
||||
|
||||
The global index collection is [clustered](https://github.com/mongodb/mongo/blob/dab0694cd327eb0f7e540de5dee97c69f84ea45d/src/mongo/db/catalog/README.md#clustered-collections)
|
||||
by `_id`, it has a local unique secondary index on `ik` and is planned to be sharded by `ik`.
|
||||
|
|
@ -82,20 +82,26 @@ DDL operations replicate as `createGlobalIndex` and `dropGlobalIndex` command ty
|
|||
Key insert and delete operations replicate as `xi` and `xd` CRUD types and do not generate
|
||||
change stream events. On a secondary, these entries are applied in parallel with other CRUD
|
||||
operations, and serialized based on the container's UUID and the entry's document key.
|
||||
|
||||
# DDL operations
|
||||
|
||||
TODO (SERVER-65567)
|
||||
|
||||
# Index builds
|
||||
|
||||
TODO (SERVER-65618)
|
||||
|
||||
# Maintenance of a built index
|
||||
|
||||
TODO (SERVER-65513)
|
||||
|
||||
# Glossary
|
||||
|
||||
**Global index container**: the internal system collection that stores the range of keys owned by
|
||||
the shard for a specific global index.
|
||||
|
||||
**Base collection**: the user collection the global index belongs to.
|
||||
|
||||
**Document key**: the key that uniquely identifies a document in the base collection. It is composed
|
||||
of the _id value of the base collections's document followed by the shard key value(s) of the
|
||||
base collection's document.
|
||||
of the \_id value of the base collections's document followed by the shard key value(s) of the
|
||||
base collection's document.
|
||||
|
|
|
|||
|
|
@ -1,121 +1,166 @@
|
|||
# Resharding internals
|
||||
|
||||
Resharding is the way for users to redistribute their data across the cluster. It is critical to
|
||||
Resharding is the way for users to redistribute their data across the cluster. It is critical to
|
||||
help maintain the data distribution and achieve high query performance in a sharded cluster.
|
||||
|
||||
# ReshardCollection Command
|
||||
The resharding operation starts from a `reshardCollection` command from the user. The client sends a
|
||||
`reshardCollection` command to mongos and mongos sends an internal `_shardsvrReshardCollection`
|
||||
|
||||
The resharding operation starts from a `reshardCollection` command from the user. The client sends a
|
||||
`reshardCollection` command to mongos and mongos sends an internal `_shardsvrReshardCollection`
|
||||
command [to the primary of admin database](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/s/commands/cluster_reshard_collection_cmd.cpp#L121).
|
||||
The primary will [create a `ReshardCollectionCoordinatorDocument` and `ReshardCollectionCoordinator`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/shardsvr_reshard_collection_command.cpp#L106). The `RechardCollectionCoordinator` then issues
|
||||
The primary will [create a `ReshardCollectionCoordinatorDocument` and `ReshardCollectionCoordinator`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/shardsvr_reshard_collection_command.cpp#L106). The `RechardCollectionCoordinator` then issues
|
||||
[`_configsvrReshardCollection` command to the config shard](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/reshard_collection_coordinator.cpp#L177) to ask the config server to start the resharding process.
|
||||
|
||||
The config server will do some validations for the incoming reshardCollection request and then
|
||||
create a `ReshardingCoordinatorDocument` and `ReshardingCoordinator`, which will drive the state
|
||||
The config server will do some validations for the incoming reshardCollection request and then
|
||||
create a `ReshardingCoordinatorDocument` and `ReshardingCoordinator`, which will drive the state
|
||||
machine of resharding coordinator.
|
||||
|
||||
# Resharding State Machine
|
||||
The whole resharding process is operated by three state machines, the `ReshardingCoordinatorService`,
|
||||
the `ReshardingDonorService` and the `ReshardingRecipientService`. Each of the three state machines
|
||||
refers to its corresponding state document called [ReshardingCoordinatorDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/coordinator_document.idl),
|
||||
[ReshardingDonorDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/donor_document.idl)
|
||||
and [ReshardingRecipientDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/recipient_document.idl).
|
||||
The `ReshardingDonorService` and `ReshardingRecipientService` will be started by the monitoring
|
||||
thread once [it sees the resharding fields on the collection](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/shard_filtering_metadata_refresh.cpp#L464-L468). They also use the state documents to notify the other state machines during certain
|
||||
|
||||
The whole resharding process is operated by three state machines, the `ReshardingCoordinatorService`,
|
||||
the `ReshardingDonorService` and the `ReshardingRecipientService`. Each of the three state machines
|
||||
refers to its corresponding state document called [ReshardingCoordinatorDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/coordinator_document.idl),
|
||||
[ReshardingDonorDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/donor_document.idl)
|
||||
and [ReshardingRecipientDocument](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/recipient_document.idl).
|
||||
The `ReshardingDonorService` and `ReshardingRecipientService` will be started by the monitoring
|
||||
thread once [it sees the resharding fields on the collection](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/shard_filtering_metadata_refresh.cpp#L464-L468). They also use the state documents to notify the other state machines during certain
|
||||
states to make sure the whole process is moving forward.
|
||||
|
||||
## ReshardingCoordinatorService
|
||||
As the name telling, this is the service to coordinate the resharding process. We'll break down
|
||||
|
||||
As the name telling, this is the service to coordinate the resharding process. We'll break down
|
||||
its responsibility by its states.
|
||||
|
||||
### Initializing
|
||||
The ReshardingCoordinatorService will [insert the coordinator doc](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L931) to `config.reshardingOperations` and [add `recipientFields`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L935) to the collection metadata. After that, it
|
||||
|
||||
The ReshardingCoordinatorService will [insert the coordinator doc](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L931) to `config.reshardingOperations` and [add `recipientFields`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L935) to the collection metadata. After that, it
|
||||
[calculates the participant shards](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2195)
|
||||
and move the `kPreparingToDonate` state.
|
||||
and move the `kPreparingToDonate` state.
|
||||
|
||||
### Preparing to Donate
|
||||
The coordinator will [add donorFields](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2219)
|
||||
to the collection metadata so the `ReshardingDonorService` will be started. Then the coordinator
|
||||
will wait until [all donor shards have reported their `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2263) and move to `kCloning` state. The coordinator will pick
|
||||
the highest `minFetchTimestamp` as the `cloneTimestamp` for all the recipient shards to perform
|
||||
|
||||
The coordinator will [add donorFields](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2219)
|
||||
to the collection metadata so the `ReshardingDonorService` will be started. Then the coordinator
|
||||
will wait until [all donor shards have reported their `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2263) and move to `kCloning` state. The coordinator will pick
|
||||
the highest `minFetchTimestamp` as the `cloneTimestamp` for all the recipient shards to perform
|
||||
collection cloning on a snapshot at this timestamp.
|
||||
|
||||
Note: the `minFetchTimestamp` is a timestamp that a donor shard guarantees that after this
|
||||
Note: the `minFetchTimestamp` is a timestamp that a donor shard guarantees that after this
|
||||
timestamp, all oplog entries on this donor shard contain recipient shard information.
|
||||
|
||||
### Cloning
|
||||
|
||||
The coordinator notifies all recipients to refresh so they can start cloning. The coordinator then
|
||||
simply [wait until all recipient shards finish cloning](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2292) and then move to `kApplying` state.
|
||||
|
||||
### Applying
|
||||
|
||||
The coordinator will [wait until the `_canEnterCritial` future to be fulfilled](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L2344) then move to `kBlockingWrites`.
|
||||
|
||||
### Blocking Writes
|
||||
The coordinator will wait until all recipients are in `kStrictConsistency` state then it will move
|
||||
|
||||
The coordinator will wait until all recipients are in `kStrictConsistency` state then it will move
|
||||
to `kCommit`.
|
||||
|
||||
### Committing
|
||||
In this state, the coordinator is to keep the donors and recipients in sync on switching the
|
||||
original collection and the temporary resharding collection. Once they all successfully commit the
|
||||
|
||||
In this state, the coordinator is to keep the donors and recipients in sync on switching the
|
||||
original collection and the temporary resharding collection. Once they all successfully commit the
|
||||
changes of this resharding, the reshardCollection is considered as success.
|
||||
|
||||
### Quiesced
|
||||
The `kQuiesced` state is introduced to avoid a wrong attempt to start a new resharding operation.
|
||||
It helps in the case of retrying a reshardCollection. In this state, we keep the coordinator doc
|
||||
for a certain period of time so if the client issues the same `reshardCollection` again, we know
|
||||
|
||||
The `kQuiesced` state is introduced to avoid a wrong attempt to start a new resharding operation.
|
||||
It helps in the case of retrying a reshardCollection. In this state, we keep the coordinator doc
|
||||
for a certain period of time so if the client issues the same `reshardCollection` again, we know
|
||||
it's a duplicate and won't run resharding one more time.
|
||||
|
||||
## ReshardingDonorService
|
||||
|
||||
### Preparing to Donate
|
||||
The donor will [do a no-op write and use the OpTime as the `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_donor_service.cpp#L655), so this can make sure that all future oplog entries on this
|
||||
donor shard contain the recipient shard information. After reporting the `minFetchTimestamp` to
|
||||
|
||||
The donor will [do a no-op write and use the OpTime as the `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_donor_service.cpp#L655), so this can make sure that all future oplog entries on this
|
||||
donor shard contain the recipient shard information. After reporting the `minFetchTimestamp` to
|
||||
the coordinator, the donor is ready to be cloned.
|
||||
|
||||
### Donating Initial Data
|
||||
The donor will wait the coordinator to coordinate the cloning process and let the recipients clone
|
||||
|
||||
The donor will wait the coordinator to coordinate the cloning process and let the recipients clone
|
||||
data. It will move to `kDonatingOplogEntries` state once the cloning is completed.
|
||||
|
||||
### Donating Oplog Entries
|
||||
The donor will wait the coordinator to coordinate the applying oplog stage where recipient will
|
||||
|
||||
The donor will wait the coordinator to coordinate the applying oplog stage where recipient will
|
||||
fetch and apply oplog entries that are written after the `minFetchTimestamp`.
|
||||
|
||||
### Preparing to Block Writes
|
||||
|
||||
The donor will [write a no-op oplog](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_donor_service.cpp#L776)
|
||||
to block writes on the source collection.
|
||||
to block writes on the source collection.
|
||||
|
||||
### Blocking Writes
|
||||
The donor shard will block writes until it gets a decision from the coordinator whether this
|
||||
|
||||
The donor shard will block writes until it gets a decision from the coordinator whether this
|
||||
resharding can be committed or not.
|
||||
|
||||
## ReshardingRecipientService
|
||||
|
||||
### Awaiting Fetch Timestamp
|
||||
The recipient service waits until the coordinator collects `minFetchTimeStamp` from all donor shards,
|
||||
|
||||
The recipient service waits until the coordinator collects `minFetchTimeStamp` from all donor shards,
|
||||
then transition to `kCreatingCollection` state.
|
||||
|
||||
### Creating Collection
|
||||
The recipient service will create a temporary collection where the recipient will write all the
|
||||
data that is supposed to be in this shard into this temp collection. After resharding is committed,
|
||||
|
||||
The recipient service will create a temporary collection where the recipient will write all the
|
||||
data that is supposed to be in this shard into this temp collection. After resharding is committed,
|
||||
it will drop the original collection and change this temp collection to replace the old collection.
|
||||
|
||||
There is an optimization here that the recipient [won't create any index](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_recipient_service.cpp#L702) except the `_id` index. This helps to reduce the write
|
||||
There is an optimization here that the recipient [won't create any index](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_recipient_service.cpp#L702) except the `_id` index. This helps to reduce the write
|
||||
amplification during cloning. All indexes will be recreated during the `kBuildingIndex` state.
|
||||
|
||||
### Cloning
|
||||
This is the most time consuming part of resharding where the recipient shards clone all needed data
|
||||
to this shard based on the new shardKey range. The [Resharding Collection Cloning](#resharding-collection-cloning)
|
||||
|
||||
This is the most time consuming part of resharding where the recipient shards clone all needed data
|
||||
to this shard based on the new shardKey range. The [Resharding Collection Cloning](#resharding-collection-cloning)
|
||||
section will go into more details of the cloning process.
|
||||
|
||||
### Building Index
|
||||
The recipient will find out all the indexes that exist on the old collection and [create them](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_recipient_service.cpp#L904) on the new collection including the shardKey index
|
||||
|
||||
The recipient will find out all the indexes that exist on the old collection and [create them](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_recipient_service.cpp#L904) on the new collection including the shardKey index
|
||||
if necessary. The index building process is driven by the `IndexBuildsCoordinator`.
|
||||
|
||||
### Applying
|
||||
The recipient wait all oplog entries that are written after the `cloneTimestamp` to be applied on the
|
||||
recipient shard. This is done together with the cloning part in `ReshardingDataReplication`, which
|
||||
|
||||
The recipient wait all oplog entries that are written after the `cloneTimestamp` to be applied on the
|
||||
recipient shard. This is done together with the cloning part in `ReshardingDataReplication`, which
|
||||
we will cover in the [Resharding Collection Cloning](#resharding-collection-cloning) section.
|
||||
|
||||
### Strict Consistency
|
||||
The recipient service will enter strict consistency and wait the coordinator to collect the status
|
||||
|
||||
The recipient service will enter strict consistency and wait the coordinator to collect the status
|
||||
of all donors and shards to decide whether this resharding operation can be committed.
|
||||
|
||||
# Resharding Collection Cloning
|
||||
The collection cloning includes two parts, clone the collection at a certain timestamp and apply
|
||||
oplog entries after that timestamp.
|
||||
## ReshardingCollectionCloner
|
||||
The collection cloner is running as part of `ReshardingRecipientService`, which reads the needed
|
||||
collection data from the donor shard, then write the data into the temp resharding collection. This
|
||||
is achieved by a natural order scan on the donor. The `ReshardingCollectionCloner` [crafts a query](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L564) to do the natural order scan with a [resume token](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L589), which helps it retry on any transient error.
|
||||
|
||||
The cloning is done in parallel on the recipient by `ReshardingClonerFetcher`, where we can have
|
||||
[multiple reader threads](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L503)
|
||||
to do reads against different donor shards and [multiple writer threads](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L502) to write the fetched data on the recipient. The writer thread
|
||||
also needs to [maintain the resumeToken](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L664)
|
||||
for that shard, so one important thing here is that all data from the same donor shard should be
|
||||
inserted sequentially, which is implemented in a way that all data from one donor shard will only be
|
||||
The collection cloning includes two parts, clone the collection at a certain timestamp and apply
|
||||
oplog entries after that timestamp.
|
||||
|
||||
## ReshardingCollectionCloner
|
||||
|
||||
The collection cloner is running as part of `ReshardingRecipientService`, which reads the needed
|
||||
collection data from the donor shard, then write the data into the temp resharding collection. This
|
||||
is achieved by a natural order scan on the donor. The `ReshardingCollectionCloner` [crafts a query](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L564) to do the natural order scan with a [resume token](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L589), which helps it retry on any transient error.
|
||||
|
||||
The cloning is done in parallel on the recipient by `ReshardingClonerFetcher`, where we can have
|
||||
[multiple reader threads](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L503)
|
||||
to do reads against different donor shards and [multiple writer threads](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L502) to write the fetched data on the recipient. The writer thread
|
||||
also needs to [maintain the resumeToken](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_collection_cloner.cpp#L664)
|
||||
for that shard, so one important thing here is that all data from the same donor shard should be
|
||||
inserted sequentially, which is implemented in a way that all data from one donor shard will only be
|
||||
processed by one writer thread.
|
||||
|
||||
## ReshardingOplogFetcher and ReshardingOplogApplier
|
||||
The `ReshardingOplogFetcher` [fetches oplog entries from `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_data_replication.cpp#L158-L161) on the corresponding donor shard and the
|
||||
|
||||
The `ReshardingOplogFetcher` [fetches oplog entries from `minFetchTimestamp`](https://github.com/mongodb/mongo/blob/c8778bfa3b21e9f6c6ac125ca48b816dc1994bf0/src/mongo/db/s/resharding/resharding_data_replication.cpp#L158-L161) on the corresponding donor shard and the
|
||||
`ReshardingOplogApplier` will apply those oplog entries to the recipient shard.
|
||||
|
|
|
|||
|
|
@ -1,66 +1,74 @@
|
|||
|
||||
# Serverless Internals
|
||||
|
||||
## Shard Split
|
||||
|
||||
A shard split is one of the serverless scaling primitives, allowing for scale out by migrating data for one or many tenants from an existing replica set to a newly formed replica set.
|
||||
|
||||
The following diagram illustrates the lifetime of a shard split operation:
|
||||

|
||||
|
||||
### Protocol
|
||||
|
||||
A shard is split by calling the `commitShardSplit` command, and is generally issued by a cloud component such as the atlasproxy. The shard split protocol consists of an exchange of messages between two shards: the donor and recipient. This exchange is orchestrated by the donor shard in a PrimaryOnlyService implementation, which has the following steps:
|
||||
|
||||
1. **Start the split operation**
|
||||
The donor receives a `commitShardSplit` command with a `recipientSetName`, `recipientTagName`, and list of tenants that should be split into the recipient. The `recipientTagName` identifies recipient nodes in the donor config, and the `recipientSetName` is the setName for the recipient replica set.
|
||||
The donor receives a `commitShardSplit` command with a `recipientSetName`, `recipientTagName`, and list of tenants that should be split into the recipient. The `recipientTagName` identifies recipient nodes in the donor config, and the `recipientSetName` is the setName for the recipient replica set.
|
||||
|
||||
All active index builds for collections belonging to tenants which will be split are [aborted](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L649-L652) at the start of the split operation. All index builds for tenants being split will be blocked for the duration of the operation.
|
||||
All active index builds for collections belonging to tenants which will be split are [aborted](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L649-L652) at the start of the split operation. All index builds for tenants being split will be blocked for the duration of the operation.
|
||||
|
||||
Finally, the donor [reserves an oplog slot](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L926), called the `blockTimestamp`, after which all user requests for tenants being split will be blocked. It then durably records a state document update to the `kBlocking` state at the `blockTimestamp`, and enters the split critical section.
|
||||
|
||||
2. **Wait for recipient nodes to catch up**
|
||||
Before proceeding with any split-specific steps, the donor must wait for all recipient nodes to catch up to the `blockTimestamp`. This wait is accomplished by calling [ReplicationCoordinator::awaitReplication with a custom tagged writeConcern](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L702), which targets nodes in the local config with the `recipientTagName`. Note that because of how replica set tags are implemented, each recipient node must have a different value for the `recipientTagName` ([learn more](https://www.mongodb.com/docs/manual/tutorial/configure-replica-set-tag-sets/#std-label-configure-custom-write-concern)). Donor nodes are guaranteed to be caught up because we [wait for majority write](https://github.com/mongodb/mongo/blob/c2a1125bc0bb729acfec94a94be924b2bb65d128/src/mongo/db/serverless/shard_split_donor_service.cpp#L663-L667) of the state document establishing the `blockTimestamp`.
|
||||
Before proceeding with any split-specific steps, the donor must wait for all recipient nodes to catch up to the `blockTimestamp`. This wait is accomplished by calling [ReplicationCoordinator::awaitReplication with a custom tagged writeConcern](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L702), which targets nodes in the local config with the `recipientTagName`. Note that because of how replica set tags are implemented, each recipient node must have a different value for the `recipientTagName` ([learn more](https://www.mongodb.com/docs/manual/tutorial/configure-replica-set-tag-sets/#std-label-configure-custom-write-concern)). Donor nodes are guaranteed to be caught up because we [wait for majority write](https://github.com/mongodb/mongo/blob/c2a1125bc0bb729acfec94a94be924b2bb65d128/src/mongo/db/serverless/shard_split_donor_service.cpp#L663-L667) of the state document establishing the `blockTimestamp`.
|
||||
|
||||
4. **Applying the split**
|
||||
The donor then [prepares a "split config"](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L718-L730) which is a copy of the current config with recipient nodes removed, an increased version, and a new subdocument (`recipientConfig`) which contains the config recipient nodes will apply during split. The recipient config is a copy of the current config with donor nodes removed, recipient nodes reindexed from zero, a new set name. The donor then calls `replSetReconfig` on itself with the split config.
|
||||
3. **Applying the split**
|
||||
The donor then [prepares a "split config"](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L718-L730) which is a copy of the current config with recipient nodes removed, an increased version, and a new subdocument (`recipientConfig`) which contains the config recipient nodes will apply during split. The recipient config is a copy of the current config with donor nodes removed, recipient nodes reindexed from zero, a new set name. The donor then calls `replSetReconfig` on itself with the split config.
|
||||
|
||||
Recipient nodes learn of the split config through heartbeats. When a recipient node sees a split config, it will first [wait for its oplog buffers to drain](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L682). This guarantees that the `lastAppliedOpTime` reported by each node in their `hello` responses gives an accurate view of which node is furthest along in application.
|
||||
|
||||
After draining, the recipient node will install the embedded recipient config. Once the config is successfully installed the recipient node will clear its <code>[lastCommittedOpTime and currentCommittedOpTime](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L1065-L1066)</code> and [restart oplog application](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L1068-L1070). We clear these two pieces of metadata to guarantee that recipient nodes never propagate opTimes from the donor timeline.
|
||||
|
||||
5. **Accepting the split**
|
||||
The donor [creates one SingleServerDiscoveryMonitor per recipient node](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L561) at the beginning of a split operation in order to monitor recipient nodes for split acceptance. The primary criteria for split acceptance is that each recipient node reports the `recipientTagName`, however the split monitors will also [track the highest lastAppliedOpTime seen](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_utils.cpp#L329) for each recipient node so that we can later choose which node to elect as the recipient primary.
|
||||
4. **Accepting the split**
|
||||
The donor [creates one SingleServerDiscoveryMonitor per recipient node](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L561) at the beginning of a split operation in order to monitor recipient nodes for split acceptance. The primary criteria for split acceptance is that each recipient node reports the `recipientTagName`, however the split monitors will also [track the highest lastAppliedOpTime seen](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_utils.cpp#L329) for each recipient node so that we can later choose which node to elect as the recipient primary.
|
||||
|
||||
Once all nodes have correctly reported the recipient set name the donor will [send a replSetStepUp command](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L850) to the node with the highest `lastAppliedOpTime`, guaranteeing that the election will succeed. After sending this command the donor will [wait for a majority write](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L856) on the recipient by sending an `appendOplogNote` command with a majority write concern to the new recipient primary. We need to ensure that the new primary’s first oplog entry is majority committed otherwise it’s possible for a node with an older `lastAppliedOpTime` to become elected, and cause the chosen recipient primary to rollback before its `lastAppliedOpTime`.
|
||||
|
||||
6. **Committing the split**
|
||||
Finally, the donor commits the split decision by performing an [update to its state document to the kCommitted state](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L869-L870). Users requests which were blocked will now be rejected with a `TenantMigrationCommitted` error, indicating that the sender should update its routing tables, and retry the request against the recipient.
|
||||
5. **Committing the split**
|
||||
Finally, the donor commits the split decision by performing an [update to its state document to the kCommitted state](https://github.com/mongodb/mongo/blob/646eed48d0da896588759030f2ec546ac6fbbd48/src/mongo/db/serverless/shard_split_donor_service.cpp#L869-L870). Users requests which were blocked will now be rejected with a `TenantMigrationCommitted` error, indicating that the sender should update its routing tables, and retry the request against the recipient.
|
||||
|
||||
### Error Handling
|
||||
|
||||
`commitShardSplit` will return [TenantMigrationCommitted](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_commands.cpp#L171-L173), [CommandFailed](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_commands.cpp#L166-L169), <code>[ConflictingServerlessOperation](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/serverless_operation_lock_registry.cpp#L52-L54)</code>, or any retryable errors encountered during the operation’s execution. On retryable error, callers are expected to retry the operation against the new donor primary. A ConflictingServerlessOperation <em>may </em>be retried, however the caller should do extra work to ensure the conflicting operation has completed before retrying.
|
||||
|
||||
### Access Blocking
|
||||
[Access blockers](#access-blocking-1) are [installed](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L155-L161) on all nodes as soon as a split operation performs its first state transition to kAbortingIndexBuilds. They are initially configured to allow all reads and writes. When the donor primary transitions to the kBlocking state (entering the critical section) it first instructs its access blockers to begin [blocking writes](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_service.cpp#L918), ensuring that no writes to tenant data can commit with a timestamp after the `blockTimestamp`. We begin to block reads once the kBlocking state document [update is committed](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L201). Writes begin blocking on secondaries when the kBlocking state change is [committed on the secondary](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L195), this ensures that an access blocker is already installed and blocking writes if there is donor primary failover.
|
||||
|
||||
Access blockers are removed when the state document backing a shard split operation is [deleted](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L437). Since garbage collection of split operation state documents is [not immediate](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_service.cpp#L1178-L1182), access blockers will continue to block reads and writes to tenant data for some time after the operation has completed its critical section. If the split operation is aborted, then access blockers will be removed as soon as the state document [records a decision and is marked garbage-collectable ](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L297-L304)(the `expireAt` field is set). Otherwise, access blockers will be removed when [the state document is deleted](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L435-L438). Access blockers are removed from recipient nodes [after installing the recipient config](https://github.com/mongodb/mongo/blob/e476ee17e9258f540d97a51baf471f5496488e33/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L878-L887), they are no longer donors. a
|
||||
[Access blockers](#access-blocking-1) are [installed](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L155-L161) on all nodes as soon as a split operation performs its first state transition to kAbortingIndexBuilds. They are initially configured to allow all reads and writes. When the donor primary transitions to the kBlocking state (entering the critical section) it first instructs its access blockers to begin [blocking writes](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_service.cpp#L918), ensuring that no writes to tenant data can commit with a timestamp after the `blockTimestamp`. We begin to block reads once the kBlocking state document [update is committed](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L201). Writes begin blocking on secondaries when the kBlocking state change is [committed on the secondary](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L195), this ensures that an access blocker is already installed and blocking writes if there is donor primary failover.
|
||||
|
||||
Access blockers are removed when the state document backing a shard split operation is [deleted](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L437). Since garbage collection of split operation state documents is [not immediate](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_service.cpp#L1178-L1182), access blockers will continue to block reads and writes to tenant data for some time after the operation has completed its critical section. If the split operation is aborted, then access blockers will be removed as soon as the state document [records a decision and is marked garbage-collectable ](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L297-L304)(the `expireAt` field is set). Otherwise, access blockers will be removed when [the state document is deleted](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L435-L438). Access blockers are removed from recipient nodes [after installing the recipient config](https://github.com/mongodb/mongo/blob/e476ee17e9258f540d97a51baf471f5496488e33/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L878-L887), they are no longer donors. a
|
||||
|
||||
Access blockers may be removed in a few other scenarios:
|
||||
* [When the shard split namespace is dropped](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L456)
|
||||
* [When it fails to insert the initial state document](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L168-L169)
|
||||
|
||||
- [When the shard split namespace is dropped](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L456)
|
||||
- [When it fails to insert the initial state document](https://github.com/mongodb/mongo/blob/87b60722e3c5ddaf7bc73d1ba08b31b437ef4f48/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L168-L169)
|
||||
|
||||
Access blockers are recovered
|
||||
* On startup after the [local config is loaded](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/replication_coordinator_impl.cpp#L537)
|
||||
* After initial sync has completed in [InitialSyncer::_teardown](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/initial_syncer.cpp#L580)
|
||||
* On rollback during the [RollbackImpl::_runPhaseFromAbortToReconstructPreparedTxns](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/rollback_impl.cpp#L655)
|
||||
|
||||
- On startup after the [local config is loaded](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/replication_coordinator_impl.cpp#L537)
|
||||
- After initial sync has completed in [InitialSyncer::\_teardown](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/initial_syncer.cpp#L580)
|
||||
- On rollback during the [RollbackImpl::\_runPhaseFromAbortToReconstructPreparedTxns](https://github.com/mongodb/mongo/blob/65154f6a1356de6ca09e04975a0acdfb1a0351ef/src/mongo/db/repl/rollback_impl.cpp#L655)
|
||||
|
||||
### Cleanup
|
||||
|
||||
Once a shard slit operation has completed it will return either [CommandFailed](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_commands.cpp#L166-L169) (if the operation was aborted for any reason), or [TenantMigrationCommitted](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_commands.cpp#L171-L173) (if the operation succeeded). At this point it is the caller’s responsibility to take any necessary post-operation actions (such as updating routing tables), before calling `forgetShardSplit` on the donor primary. Calling this command will cause the donor primary to mark the operation garbage-collectable, by [setting the expireAt field in the operation state document](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_donor_service.cpp#L1140-L1141) to a configurable timeout called `repl::shardSplitGarbageCollectionDelayMS` with a [default value of 15 minutes](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/repl/repl_server_parameters.idl#L688-L696). The operation will wait for the delay and then [delete the state document](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_donor_service.cpp#L1186), which in turn removes access blockers installed for the operation. It is now the responsibility of the caller to remove orphaned data on the donor and recipient.
|
||||
|
||||
### Serverless server parameter
|
||||
The [replication.serverless](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L77) server parameter allows starting a mongod without providing a replica set name. It cannot be used at the same time as [replication.replSet](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L64) or [replication.replSetName](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L70). When `replication.serverless` is used, the replica set name is learned through [replSetInitiate](https://www.mongodb.com/docs/manual/reference/command/replSetInitiate/) or [through an hearbeat](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/repl/replication_coordinator_impl.cpp#L5848) from another mongod. Mongod can only learn its replica set name once.
|
||||
|
||||
The [replication.serverless](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L77) server parameter allows starting a mongod without providing a replica set name. It cannot be used at the same time as [replication.replSet](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L64) or [replication.replSetName](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/mongod_options_replication.idl#L70). When `replication.serverless` is used, the replica set name is learned through [replSetInitiate](https://www.mongodb.com/docs/manual/reference/command/replSetInitiate/) or [through an hearbeat](https://github.com/mongodb/mongo/blob/e75a51a7dcbe842e07a24343438706d865de96dc/src/mongo/db/repl/replication_coordinator_impl.cpp#L5848) from another mongod. Mongod can only learn its replica set name once.
|
||||
|
||||
Using `replication.serverless` also enables a node to apply a recipient config to join a new recipient set as part of a split.
|
||||
|
||||
### Glossary
|
||||
|
||||
**recipient config**
|
||||
The config for the recipient replica set.
|
||||
|
||||
|
|
@ -71,6 +79,7 @@ A config based on the original config which excludes the recipient nodes, and in
|
|||
Timestamp after which reads and writes are blocked on the donor replica set for all tenants involved until completion of the split.
|
||||
|
||||
## Shard Merge
|
||||
|
||||
A shard split is one of the serverless scaling primitives, allowing for scale in by migrating all tenant data from an underutilized replica set to another existing replica set. The initial replica set will be decomissioned by the cloud control plane after completion of the operation.
|
||||
|
||||
The following diagram illustrates the lifetime of a shard split operation:
|
||||
|
|
@ -79,46 +88,50 @@ The following diagram illustrates the lifetime of a shard split operation:
|
|||
### Protocol
|
||||
|
||||
1. **Start the merge operation**
|
||||
The donor primary receives the `donorStartMigration` command to begin the operation. The [TenantMigrationDonorOpObserver](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp#L82) creates a donor access blocker for each tenant and a global donor access blocker.
|
||||
The donor primary receives the `donorStartMigration` command to begin the operation. The [TenantMigrationDonorOpObserver](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp#L82) creates a donor access blocker for each tenant and a global donor access blocker.
|
||||
|
||||
All active index builds for collections belonging to tenants which will be migrated are [aborted](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/tenant_migration_donor_service.cpp#L949-L968) at the start of the merge operation. All index builds for tenants being migrated will be blocked for the duration of the operation.
|
||||
|
||||
The donor then reserves an oplog slot, called the `startMigrationDonorTimestamp`. It then [durably records](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/tenant_migration_donor_service.cpp#L982) a state document update to the `kDataSync` state at the `startMigrationDonorTimestamp` and sends the `recipientSyncData` command to the recipient primary with the `startMigrationDonorTimestamp` and waits for a response.
|
||||
|
||||
2. **Recipient copies donor data**
|
||||
The recipient primary receives the `recipientSyncData` command and [durably persists](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L2428) a state document used to track migration progress. The [ShardMergeRecipientOpObserver](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_op_observer.cpp#L163-L167) creates a recipient access blocker for each tenant. The primary then opens a backup cursor on the donor, records the checkpoint timestamp, and then inserts the list of wired tiger files that need to be cloned into the [donated files collection](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L1034-L1046) The backup cursor is kept alive (by periodic `getMore`'s) until all recipient nodes have copied donor data. Wiredtiger will not modify file data on the donor while the cursor is open.
|
||||
The recipient primary receives the `recipientSyncData` command and [durably persists](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L2428) a state document used to track migration progress. The [ShardMergeRecipientOpObserver](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_op_observer.cpp#L163-L167) creates a recipient access blocker for each tenant. The primary then opens a backup cursor on the donor, records the checkpoint timestamp, and then inserts the list of wired tiger files that need to be cloned into the [donated files collection](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L1034-L1046) The backup cursor is kept alive (by periodic `getMore`'s) until all recipient nodes have copied donor data. Wiredtiger will not modify file data on the donor while the cursor is open.
|
||||
|
||||
Additionally, the recipient primary will ensure that it's majority commit timestamp is greater than the backup cursor timestamp from the donor. We [advance](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/shard_merge_recipient_service.cpp#L1787-L1789) the cluster time to `donorBackupCursorCheckpointTimestamp` and then write a majority committed noop.
|
||||
|
||||
A `ShardMergeRecipientOpObserver` on each recipient node will [watch for inserts](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_op_observer.cpp#L198) into the donated files collection and then [clone and import](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/tenant_file_importer_service.cpp#L299-L303) all file data via the `TenantFileImporterService`. When the data is consistent and all files have been imported, the recipient replies `OK` to the `recipientSyncData` command and kills the backup cursor.
|
||||
|
||||
3. **Donor enters blocking state**
|
||||
Upon receiving a `recipientSyncData` response, the donor reserves an oplog slot and updates the state document to the `kBlocking` state and sets the `blockTimestamp` to prevent writes. The donor then sends a second `recipientSyncData` command to the recipient with the `returnAfterReachingDonorTimestamp` set to the `blockTimestamp` and waits for a reply.
|
||||
Upon receiving a `recipientSyncData` response, the donor reserves an oplog slot and updates the state document to the `kBlocking` state and sets the `blockTimestamp` to prevent writes. The donor then sends a second `recipientSyncData` command to the recipient with the `returnAfterReachingDonorTimestamp` set to the `blockTimestamp` and waits for a reply.
|
||||
|
||||
4. **Recipient oplog catchup**
|
||||
After the cloned data is consistent, the recipient primary enters the oplog catchup phase. Here, the primary fetches and [applies](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L2230) any donor oplog entries that were written between the backup cursor checkpoint timestamp and the `blockTimestamp`. When all entries have been majority replicated and we have [ensured](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L599-L602) that the recipient's logical clock has advanced to at least `returnAfterReachingDonorTimestamp`, the recipient replies `OK` to the second `recipientSyncData` command.
|
||||
After the cloned data is consistent, the recipient primary enters the oplog catchup phase. Here, the primary fetches and [applies](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L2230) any donor oplog entries that were written between the backup cursor checkpoint timestamp and the `blockTimestamp`. When all entries have been majority replicated and we have [ensured](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_service.cpp#L599-L602) that the recipient's logical clock has advanced to at least `returnAfterReachingDonorTimestamp`, the recipient replies `OK` to the second `recipientSyncData` command.
|
||||
|
||||
5. **Committing the merge**
|
||||
After receiving a successful response to the `recipientSyncData` command, the Donor updates its state document to `kCommitted` and sets the `commitOrAbortOpTime`. After the commit, the Donor will respond to `donorStartMigration` with `OK`. At this point, all traffic should be re-routed to the Recipient. Finally, cloud will send `donorForgetMigration` to the Donor (which will in turn send `recipientForgetMigration` to the Recipient) to mark the migration as garbage collectable.
|
||||
After receiving a successful response to the `recipientSyncData` command, the Donor updates its state document to `kCommitted` and sets the `commitOrAbortOpTime`. After the commit, the Donor will respond to `donorStartMigration` with `OK`. At this point, all traffic should be re-routed to the Recipient. Finally, cloud will send `donorForgetMigration` to the Donor (which will in turn send `recipientForgetMigration` to the Recipient) to mark the migration as garbage collectable.
|
||||
|
||||
## Access Blocking
|
||||
|
||||
During the critical section of a serverless operation the server will queue user requests for data involved in the operation, waiting to produce a response until after the critical section has completed. This process is called “blocking”, and the server provides this functionality by maintaining a [map of namespace to tenant access blocker](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/tenant_migration_access_blocker_registry.h#L242-L243). This registry is consulted when deciding to block:
|
||||
* **commands** in the ServiceEntryPoint ([InvokeCommand::run](https://github.com/mongodb/mongo/blob/bc57b7313bce890cf1a7d6cdf20f1ec25949698f/src/mongo/db/service_entry_point_common.cpp#L886-L888), or [CheckoutSessionAndInvokeCommand::run](https://github.com/mongodb/mongo/blob/bc57b7313bce890cf1a7d6cdf20f1ec25949698f/src/mongo/db/service_entry_point_common.cpp#L886-L888))
|
||||
* **linearizable reads** in the [RunCommandImpl::_epilogue](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/service_entry_point_common.cpp#L1249)
|
||||
* **writes** in [OpObserverImpl::onBatchedWriteCommit](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/op_observer/op_observer_impl.cpp#L1882-L1883), [OpObserverImpl::onUnpreparedTransactionCommit](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/op_observer/op_observer_impl.cpp#L1770-L1771), and the [_logOpsInner oplog helper](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/oplog.cpp#L429-L430)
|
||||
* **index builds** in [ReplIndexBuildState::tryAbort](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl_index_build_state.cpp#L495), IndexBuildsCoordinatorMongod::_startIndexBuild ([here](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/index_builds_coordinator_mongod.cpp#L282), [here](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/index_builds_coordinator_mongod.cpp#L356-L357))
|
||||
|
||||
- **commands** in the ServiceEntryPoint ([InvokeCommand::run](https://github.com/mongodb/mongo/blob/bc57b7313bce890cf1a7d6cdf20f1ec25949698f/src/mongo/db/service_entry_point_common.cpp#L886-L888), or [CheckoutSessionAndInvokeCommand::run](https://github.com/mongodb/mongo/blob/bc57b7313bce890cf1a7d6cdf20f1ec25949698f/src/mongo/db/service_entry_point_common.cpp#L886-L888))
|
||||
- **linearizable reads** in the [RunCommandImpl::\_epilogue](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/service_entry_point_common.cpp#L1249)
|
||||
- **writes** in [OpObserverImpl::onBatchedWriteCommit](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/op_observer/op_observer_impl.cpp#L1882-L1883), [OpObserverImpl::onUnpreparedTransactionCommit](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/op_observer/op_observer_impl.cpp#L1770-L1771), and the [\_logOpsInner oplog helper](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/oplog.cpp#L429-L430)
|
||||
- **index builds** in [ReplIndexBuildState::tryAbort](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl_index_build_state.cpp#L495), IndexBuildsCoordinatorMongod::\_startIndexBuild ([here](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/index_builds_coordinator_mongod.cpp#L282), [here](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/index_builds_coordinator_mongod.cpp#L356-L357))
|
||||
|
||||
## Mutual Exclusion
|
||||
|
||||
Of the three types of serverless operation (tenant migration, shard merge, and shard split), no new operation may start if there are any active operations of another serverless operation type. The serverless operation lock allows multiple Tenant Migrations to run simultaneously, but it does not allow running operations of a different type at the same time.
|
||||
|
||||
This so-called “serverless operation lock” is acquired the first time a state document is inserted for a particular operation ([shard split](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L150-L151), [tenant migration donor](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp#L58-L60), [tenant migration recipient](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/tenant_migration_recipient_op_observer.cpp#L127-L129), [shard merge recipient](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_op_observer.cpp#L152-L154)). Once the lock is acquired, any attempt to insert a state document of a different operation type will [result in a ConflictingServerlessOperation](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/serverless_operation_lock_registry.cpp#L52-L54). The lock is released when an operation durably records its decision, and marks its state document as garbage collectable ([shard split](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/serverless/shard_split_donor_op_observer.cpp#L261-L263), [tenant migration donor](https://github.com/mongodb/mongo/blob/1c4fafd4ae5c082f36a8af1442aa48174962b1b4/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp#L169-L171), [tenant migration recipient](https://github.com/mongodb/mongo/blob/a723af8863c5fae1eee7b0a891066e923468e974/src/mongo/db/repl/tenant_migration_recipient_op_observer.cpp#L152-L154), [shard merge recipient](https://github.com/mongodb/mongo/blob/f05053d2cb65b84eaed4db94c25e9fe4be82d78c/src/mongo/db/repl/shard_merge_recipient_op_observer.cpp#L280-L282)). Serverless operation locks continue to be held even after a stepdown for the same reason access blockers do, if an election occurs later we ensure the lock is already held to prevent conflicting operations on the newly elected primary.
|
||||
|
||||
## Change Streams
|
||||
|
||||
Change Stream data for a Serverless cluster is stored in a handful of tenantId-prefixed collections:
|
||||
|
||||
* change collection: `<tenantId>_config.system.change_collection`
|
||||
* pre-images: `<tenantId>_config.system.preimages`
|
||||
* cluster parameters: `<tenantId>_config.system.cluster_parameters`
|
||||
- change collection: `<tenantId>_config.system.change_collection`
|
||||
- pre-images: `<tenantId>_config.system.preimages`
|
||||
- cluster parameters: `<tenantId>_config.system.cluster_parameters`
|
||||
|
||||
A Shard Split operation will copy these collections from donor to recipient via Initial Sync. Upon completion, these collections will be cleaned up on the donor (by the cloud control plane) along with all other tenant-specific databases.
|
||||
|
||||
|
|
@ -126,6 +139,4 @@ A Shard Merge operation will copy these collections from donor to recipient via
|
|||
|
||||
We extract the 'o2' entry from a given noop oplog entry written during this phase (which will contain the original entry on the donor timeline) and write it to the tenant's change collection (see [here](https://github.com/mongodb/mongo/blob/26a441e07f3885dc8b3d9ef9b564eb4f5143bded/src/mongo/db/change_stream_change_collection_manager.cpp#L133-L135) for implementation details). Change collection entries written on the recipient during oplog catchup must be written on the donor timeline so that a change stream can be resumed on the recipient after the Shard Merge.
|
||||
|
||||
For pre-image support, two oplog entry fields (`donorOpTime` and `donorApplyOpsIndex`, see [here](https://github.com/mongodb/mongo/blob/26a441e07f3885dc8b3d9ef9b564eb4f5143bded/src/mongo/db/repl/oplog_entry.idl#L168-L180
|
||||
)) were added in order to ensure that pre-image entries written on the recipient will be identical to those on the donor. These fields are conditionally set on oplog entries written during the oplog catchup phase of a Shard Merge and used to determine which timestamp and applyOps index to use when writing pre-images. See [here](https://github.com/mongodb/mongo/blob/07b38e091b48acd305469d525b81aebf3aeadbf1/src/mongo/db/repl/oplog.cpp#L1237-L1268) for details.
|
||||
|
||||
For pre-image support, two oplog entry fields (`donorOpTime` and `donorApplyOpsIndex`, see [here](https://github.com/mongodb/mongo/blob/26a441e07f3885dc8b3d9ef9b564eb4f5143bded/src/mongo/db/repl/oplog_entry.idl#L168-L180)) were added in order to ensure that pre-image entries written on the recipient will be identical to those on the donor. These fields are conditionally set on oplog entries written during the oplog catchup phase of a Shard Merge and used to determine which timestamp and applyOps index to use when writing pre-images. See [here](https://github.com/mongodb/mongo/blob/07b38e091b48acd305469d525b81aebf3aeadbf1/src/mongo/db/repl/oplog.cpp#L1237-L1268) for details.
|
||||
|
|
|
|||
|
|
@ -1,11 +1,10 @@
|
|||
Storage Engine API
|
||||
==================
|
||||
# Storage Engine API
|
||||
|
||||
The purpose of the Storage Engine API is to allow for pluggable storage engines in MongoDB (refer
|
||||
to the [Storage FAQ][]). This document gives a brief overview of the API, and provides pointers
|
||||
to places with more detailed documentation. Where referencing code, links are to the version that
|
||||
was current at the time when the reference was made. Always compare with the latest version for
|
||||
changes not yet reflected here. For questions on the API that are not addressed by this material,
|
||||
changes not yet reflected here. For questions on the API that are not addressed by this material,
|
||||
use the [mongodb-dev][] Google group. Everybody involved in the Storage Engine API will read your
|
||||
post.
|
||||
|
||||
|
|
@ -23,11 +22,10 @@ See <https://github.com/mongodb-partners/mongo-rocks> for a good example of the
|
|||
For more context and information on how this API is used, see the
|
||||
[Execution Architecture Guide](https://github.com/mongodb/mongo/blob/master/src/mongo/db/catalog/README.md).
|
||||
|
||||
|
||||
Concepts
|
||||
--------
|
||||
## Concepts
|
||||
|
||||
### Record Stores
|
||||
|
||||
A database contains one or more collections, each with a number of indexes, and a catalog listing
|
||||
them. All MongoDB collections are implemented with record stores: one for the documents themselves,
|
||||
and one for each index. By using the KVEngine class, you only have to deal with the abstraction, as
|
||||
|
|
@ -35,6 +33,7 @@ the StorageEngineImpl implements the StorageEngine interface, using record store
|
|||
indexes.
|
||||
|
||||
#### Record Identities
|
||||
|
||||
A RecordId is a unique identifier, assigned by the storage engine, for a specific document or entry
|
||||
in a record store at a given time. For storage engines based in the KVEngine the record identity is
|
||||
fixed, but other storage engines may change it when updating a document. Note that changing record
|
||||
|
|
@ -42,10 +41,12 @@ ids can be very expensive, as indexes map to the RecordId. A single document wit
|
|||
have thousands of index entries, resulting in very expensive updates.
|
||||
|
||||
#### Cloning and bulk operations
|
||||
|
||||
Currently all cloning, [initial sync][] and other operations are done in terms of operating on
|
||||
individual documents, though there is a BulkBuilder class for more efficiently building indexes.
|
||||
|
||||
### Locking and Concurrency
|
||||
|
||||
MongoDB uses multi-granular intent locking; see the [Concurrency FAQ][]. In all cases, this will
|
||||
ensure that operations to meta-data, such as creation and deletion of record stores, are serialized
|
||||
with respect to other accesses.
|
||||
|
|
@ -55,6 +56,7 @@ manages. MongoDB will only use intent locks for the most common operations, leav
|
|||
at the record store layer up to the storage engine.
|
||||
|
||||
### Transactions
|
||||
|
||||
Each operation creates an OperationContext with a new RecoveryUnit, implemented by the storage
|
||||
engine, that lives until the operation finishes. Currently, query operations that return a cursor
|
||||
to the client live as long as that client cursor, with the operation context switching between its
|
||||
|
|
@ -63,22 +65,26 @@ an extra recovery unit as well. The recovery unit must implement transaction sem
|
|||
below.
|
||||
|
||||
#### Atomicity
|
||||
|
||||
Writes must only become visible when explicitly committed, and in that case all pending writes
|
||||
become visible atomically. Writes that are not committed before the unit of work ends must be
|
||||
rolled back. In addition to writes done directly through the Storage API, such as document updates
|
||||
and creation of record stores, other custom changes can be registered with the recovery unit.
|
||||
|
||||
#### Consistency
|
||||
|
||||
Storage engines must ensure that atomicity and isolation guarantees span all record stores, as
|
||||
otherwise the guarantee of atomic updates on a document and all its indexes would be violated.
|
||||
|
||||
#### Isolation
|
||||
|
||||
Storage engines must provide snapshot isolation, either through locking, through multi-version
|
||||
concurrency control (MVCC) or otherwise. The first read implicitly establishes the snapshot.
|
||||
Operations can always see all changes they make in the context of a recovery unit, but other
|
||||
operations cannot until a successful commit.
|
||||
|
||||
#### Durability
|
||||
|
||||
Once a transaction is committed, it is not necessarily durable: if, and only if the server fails,
|
||||
as result of power loss or otherwise, the database may recover to an earlier point in time.
|
||||
However, atomicity of transactions must remain preserved. Similarly, in a replica set, a primary
|
||||
|
|
@ -91,6 +97,7 @@ may use a group commit, bundling a number of transactions to achieve durability.
|
|||
storage engine may wait for durability at commit time.
|
||||
|
||||
### Write Conflicts
|
||||
|
||||
Systems with optimistic concurrency control (OCC) or multi-version concurrency control (MVCC) may
|
||||
find that a transaction conflicts with other transactions, that executing an operation would result
|
||||
in deadlock or violate other resource constraints. In such cases the storage engine may throw a
|
||||
|
|
@ -98,29 +105,28 @@ WriteConflictException to signal the transient failure. MongoDB will handle the
|
|||
and restart the transaction.
|
||||
|
||||
### Point-in-time snapshot reads
|
||||
|
||||
Two functions on the RecoveryUnit help storage engines implement point-in-time reads: setTimestamp()
|
||||
and selectSnapshot(). setTimestamp() is used by write transactions to label any forthcoming writes
|
||||
and selectSnapshot(). setTimestamp() is used by write transactions to label any forthcoming writes
|
||||
with a timestamp; these timestamps are then used to produce a point-in-time read transaction via a
|
||||
call to selectSnapshot() at the start of the read. The storage engine must produce the effect of
|
||||
call to selectSnapshot() at the start of the read. The storage engine must produce the effect of
|
||||
reading from a snapshot that includes only writes with timestamps at or earlier than the
|
||||
selectSnapshot timestamp. This means that a point-in-time read may slice across prior write
|
||||
selectSnapshot timestamp. This means that a point-in-time read may slice across prior write
|
||||
transactions by hiding only some data from a given write transaction, if that transaction had a
|
||||
different timestamp set prior to each write it did.
|
||||
|
||||
Classes to implement
|
||||
--------------------
|
||||
## Classes to implement
|
||||
|
||||
A storage engine should generally implement the following classes. See their definitions for more
|
||||
details.
|
||||
|
||||
* [KVEngine](kv/kv_engine.h)
|
||||
* [RecordStore](record_store.h)
|
||||
* [RecoveryUnit](recovery_unit.h)
|
||||
* [SeekableRecordCursor](record_store.h)
|
||||
* [SortedDataInterface](sorted_data_interface.h)
|
||||
* [ServerStatusSection](../commands/server_status.h)
|
||||
* [ServerParameter](../server_parameters.h)
|
||||
|
||||
- [KVEngine](kv/kv_engine.h)
|
||||
- [RecordStore](record_store.h)
|
||||
- [RecoveryUnit](recovery_unit.h)
|
||||
- [SeekableRecordCursor](record_store.h)
|
||||
- [SortedDataInterface](sorted_data_interface.h)
|
||||
- [ServerStatusSection](../commands/server_status.h)
|
||||
- [ServerParameter](../server_parameters.h)
|
||||
|
||||
[Concurrency FAQ]: http://docs.mongodb.org/manual/faq/concurrency/
|
||||
[initial sync]: http://docs.mongodb.org/manual/core/replica-set-sync/#replica-set-initial-sync
|
||||
|
|
|
|||
|
|
@ -1,15 +1,18 @@
|
|||
# Execution Control
|
||||
|
||||
## Throughput Probing
|
||||
|
||||
### Server Parameters
|
||||
- `throughputProbingInitialConcurrency -> gInitialConcurrency`: initial number of concurrent read and write transactions
|
||||
- `throughputProbingMinConcurrency -> gMinConcurrency`: minimum concurrent read and write transactions
|
||||
- `throughputProbingMaxConcurrency -> gMaxConcurrency`: maximum concurrenct read and write transactions
|
||||
- `throughputProbingReadWriteRatio -> gReadWriteRatio`: ratio of read and write tickets where 0.5 indicates 1:1 ratio
|
||||
- `throughputProbingConcurrencyMovingAverageWeight -> gConcurrencyMovingAverageWeight`: weight of new concurrency measurement in the exponentially-decaying moving average
|
||||
- `throughputProbingStepMultiple -> gStepMultiple`: step size for throughput probing
|
||||
|
||||
- `throughputProbingInitialConcurrency -> gInitialConcurrency`: initial number of concurrent read and write transactions
|
||||
- `throughputProbingMinConcurrency -> gMinConcurrency`: minimum concurrent read and write transactions
|
||||
- `throughputProbingMaxConcurrency -> gMaxConcurrency`: maximum concurrenct read and write transactions
|
||||
- `throughputProbingReadWriteRatio -> gReadWriteRatio`: ratio of read and write tickets where 0.5 indicates 1:1 ratio
|
||||
- `throughputProbingConcurrencyMovingAverageWeight -> gConcurrencyMovingAverageWeight`: weight of new concurrency measurement in the exponentially-decaying moving average
|
||||
- `throughputProbingStepMultiple -> gStepMultiple`: step size for throughput probing
|
||||
|
||||
### Pseudocode
|
||||
|
||||
```
|
||||
setConcurrency(concurrency)
|
||||
ticketsAllottedToReads := clamp((concurrency * gReadWriteRatio), gMinConcurrency, gMaxConcurrency)
|
||||
|
|
@ -17,7 +20,7 @@ setConcurrency(concurrency)
|
|||
|
||||
getCurrentConcurrency()
|
||||
return ticketsAllocatedToReads + ticketsAllocatedToWrites
|
||||
|
||||
|
||||
exponentialMovingAverage(stableConcurrency, currentConcurrency)
|
||||
return (currentConcurrency * gConcurrencyMovingAverageWeight) + (stableConcurrency * (1 - gConcurrencyMovingAverageWeight))
|
||||
|
||||
|
|
@ -57,12 +60,13 @@ probeDown(currentThroughput)
|
|||
```
|
||||
|
||||
### Diagram
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
A(Stable Probe) --> |at minimum and tickets not exhausted|A
|
||||
|
||||
A --> |"(above minimum and tickets not exhausted) or at maximum"|C(Probe Down)
|
||||
subgraph
|
||||
subgraph
|
||||
C --> |throughput increased|F{{Decrease stable concurrency}}
|
||||
C --> |throughput did not increase|G(Go back to stable concurrency)
|
||||
end
|
||||
|
|
@ -70,7 +74,7 @@ F --> H
|
|||
G --> H
|
||||
|
||||
A --> |below maximum and tickets exhausted| B(Probe Up)
|
||||
subgraph
|
||||
subgraph
|
||||
B --> |throughput increased|D{{Increase stable concurrency}}
|
||||
B --> |throughput did not increase|E{{Go back to stable concurrency}}
|
||||
end
|
||||
|
|
|
|||
|
|
@ -6,31 +6,33 @@ measurements while organizing the actual data in buckets.
|
|||
|
||||
A minimally configured time-series collection is defined by providing the [timeField](timeseries.idl)
|
||||
at creation. Optionally, a meta-data field may also be specified to help group
|
||||
measurements in the buckets. MongoDB also supports an expiration mechanism on measurements through
|
||||
measurements in the buckets. MongoDB also supports an expiration mechanism on measurements through
|
||||
the `expireAfterSeconds` option.
|
||||
|
||||
A time-series collection `mytscoll` in the `mydb` database is represented in the [catalog](../catalog/README.md) by a
|
||||
combination of a view and a system collection:
|
||||
* The view `mydb.mytscoll` is defined with the bucket collection as the source collection with
|
||||
certain properties:
|
||||
* Writes (inserts only) are allowed on the view. Every document inserted must contain a time field.
|
||||
* Querying the view implicitly unwinds the data in the underlying bucket collection to return
|
||||
documents in their original non-bucketed form.
|
||||
* The aggregation stage [$_internalUnpackBucket](../pipeline/document_source_internal_unpack_bucket.h) is used to
|
||||
unwind the bucket data for the view. For more information about this stage and query rewrites for
|
||||
time-series collections see [query/timeseries/README](../query/timeseries/README.md).
|
||||
* The system collection has the namespace `mydb.system.buckets.mytscoll` and is where the actual
|
||||
data is stored.
|
||||
* Each document in the bucket collection represents a set of time-series data within a period of time.
|
||||
* If a meta-data field is defined at creation time, this will be used to organize the buckets so that
|
||||
all measurements within a bucket have a common meta-data value.
|
||||
* Besides the time range, buckets are also constrained by the total number and size of measurements.
|
||||
|
||||
- The view `mydb.mytscoll` is defined with the bucket collection as the source collection with
|
||||
certain properties:
|
||||
_ Writes (inserts only) are allowed on the view. Every document inserted must contain a time field.
|
||||
_ Querying the view implicitly unwinds the data in the underlying bucket collection to return
|
||||
documents in their original non-bucketed form. \* The aggregation stage [$\_internalUnpackBucket](../pipeline/document_source_internal_unpack_bucket.h) is used to
|
||||
unwind the bucket data for the view. For more information about this stage and query rewrites for
|
||||
time-series collections see [query/timeseries/README](../query/timeseries/README.md).
|
||||
- The system collection has the namespace `mydb.system.buckets.mytscoll` and is where the actual
|
||||
data is stored.
|
||||
- Each document in the bucket collection represents a set of time-series data within a period of time.
|
||||
- If a meta-data field is defined at creation time, this will be used to organize the buckets so that
|
||||
all measurements within a bucket have a common meta-data value.
|
||||
- Besides the time range, buckets are also constrained by the total number and size of measurements.
|
||||
|
||||
Time-series collections can also be sharded. For more information about sharding-specific implementation
|
||||
details, see [db/s/README_timeseries.md](../s/README_timeseries.md).
|
||||
|
||||
## Bucket Collection Schema
|
||||
|
||||
Uncompressed bucket (version 1):
|
||||
|
||||
```
|
||||
{
|
||||
_id: <Object ID with time component equal to control.min.<time field>>,
|
||||
|
|
@ -74,11 +76,13 @@ Uncompressed bucket (version 1):
|
|||
}
|
||||
}
|
||||
```
|
||||
|
||||
There are two types of compressed buckets, version 2 and version 3. They differ only in that the
|
||||
entries in the data field of version 2 buckets are sorted on the time field, whereas this is not
|
||||
enforced for version 3 buckets.
|
||||
|
||||
Compressed bucket (version 2 and version 3):
|
||||
|
||||
```
|
||||
{
|
||||
_id: <Object ID with time component equal to control.min.<time field>>,
|
||||
|
|
@ -99,7 +103,7 @@ Compressed bucket (version 2 and version 3):
|
|||
},
|
||||
closed: <bool>, // Optional, signals the database that this document will not receive any
|
||||
// additional measurements.
|
||||
count: <int> // The number of measurements contained in this bucket. Only present in
|
||||
count: <int> // The number of measurements contained in this bucket. Only present in
|
||||
// compressed buckets.
|
||||
},
|
||||
meta: <meta-data field (if specified at creation) value common to all measurements in this bucket>,
|
||||
|
|
@ -141,13 +145,14 @@ rather than collection scans, indexes may be created on the time, meta-data, and
|
|||
of a time-series collection. Starting in v6.0, indexes on time-series collection measurement fields
|
||||
are permitted. The index key specification provided by the user via `createIndex` will be converted
|
||||
to the underlying buckets collection's schema.
|
||||
* The details for mapping the index specification between the time-series collection and the
|
||||
underlying buckets collection may be found in
|
||||
[timeseries_index_schema_conversion_functions.h](timeseries_index_schema_conversion_functions.h).
|
||||
* Newly supported index types in v6.0 and up
|
||||
[store the original user index definition](https://github.com/mongodb/mongo/blob/cf80c11bc5308d9b889ed61c1a3eeb821839df56/src/mongo/db/timeseries/timeseries_commands_conversion_helper.cpp#L140-L147)
|
||||
on the transformed index definition. When mapping the bucket collection index to the time-series
|
||||
collection index, the original user index definition is returned.
|
||||
|
||||
- The details for mapping the index specification between the time-series collection and the
|
||||
underlying buckets collection may be found in
|
||||
[timeseries_index_schema_conversion_functions.h](timeseries_index_schema_conversion_functions.h).
|
||||
- Newly supported index types in v6.0 and up
|
||||
[store the original user index definition](https://github.com/mongodb/mongo/blob/cf80c11bc5308d9b889ed61c1a3eeb821839df56/src/mongo/db/timeseries/timeseries_commands_conversion_helper.cpp#L140-L147)
|
||||
on the transformed index definition. When mapping the bucket collection index to the time-series
|
||||
collection index, the original user index definition is returned.
|
||||
|
||||
Once the indexes have been created, they can be inspected through the `listIndexes` command or the
|
||||
`$indexStats` aggregation stage. `listIndexes` and `$indexStats` against a time-series collection
|
||||
|
|
@ -160,27 +165,30 @@ field.
|
|||
time-series collections.
|
||||
|
||||
Supported index types on the time field:
|
||||
* [Single](https://docs.mongodb.com/manual/core/index-single/).
|
||||
* [Compound](https://docs.mongodb.com/manual/core/index-compound/).
|
||||
* [Hashed](https://docs.mongodb.com/manual/core/index-hashed/).
|
||||
* [Wildcard](https://docs.mongodb.com/manual/core/index-wildcard/).
|
||||
* [Sparse](https://docs.mongodb.com/manual/core/index-sparse/).
|
||||
* [Multikey](https://docs.mongodb.com/manual/core/index-multikey/).
|
||||
* [Indexes with collations](https://docs.mongodb.com/manual/indexes/#indexes-and-collation).
|
||||
|
||||
- [Single](https://docs.mongodb.com/manual/core/index-single/).
|
||||
- [Compound](https://docs.mongodb.com/manual/core/index-compound/).
|
||||
- [Hashed](https://docs.mongodb.com/manual/core/index-hashed/).
|
||||
- [Wildcard](https://docs.mongodb.com/manual/core/index-wildcard/).
|
||||
- [Sparse](https://docs.mongodb.com/manual/core/index-sparse/).
|
||||
- [Multikey](https://docs.mongodb.com/manual/core/index-multikey/).
|
||||
- [Indexes with collations](https://docs.mongodb.com/manual/indexes/#indexes-and-collation).
|
||||
|
||||
Supported index types on the metaField or its subfields:
|
||||
* All of the supported index types on the time field.
|
||||
* [2d](https://docs.mongodb.com/manual/core/2d/) from v6.0.
|
||||
* [2dsphere](https://docs.mongodb.com/manual/core/2dsphere/) from v6.0.
|
||||
* [Partial](https://docs.mongodb.com/manual/core/index-partial/) from v6.0.
|
||||
|
||||
- All of the supported index types on the time field.
|
||||
- [2d](https://docs.mongodb.com/manual/core/2d/) from v6.0.
|
||||
- [2dsphere](https://docs.mongodb.com/manual/core/2dsphere/) from v6.0.
|
||||
- [Partial](https://docs.mongodb.com/manual/core/index-partial/) from v6.0.
|
||||
|
||||
Supported index types on measurement fields in v6.0 and up only:
|
||||
* [Single](https://docs.mongodb.com/manual/core/index-single/) from v6.0.
|
||||
* [Compound](https://docs.mongodb.com/manual/core/index-compound/) from v6.0.
|
||||
* [2dsphere](https://docs.mongodb.com/manual/core/2dsphere/) from v6.0.
|
||||
* [Partial](https://docs.mongodb.com/manual/core/index-partial/) from v6.0.
|
||||
* [TTL](https://docs.mongodb.com/manual/core/index-ttl/) from v6.3. Must be used in conjunction with
|
||||
a `partialFilterExpression` based on the metaField or its subfields.
|
||||
|
||||
- [Single](https://docs.mongodb.com/manual/core/index-single/) from v6.0.
|
||||
- [Compound](https://docs.mongodb.com/manual/core/index-compound/) from v6.0.
|
||||
- [2dsphere](https://docs.mongodb.com/manual/core/2dsphere/) from v6.0.
|
||||
- [Partial](https://docs.mongodb.com/manual/core/index-partial/) from v6.0.
|
||||
- [TTL](https://docs.mongodb.com/manual/core/index-ttl/) from v6.3. Must be used in conjunction with
|
||||
a `partialFilterExpression` based on the metaField or its subfields.
|
||||
|
||||
Index types that are not supported on time-series collections include
|
||||
[unique](https://docs.mongodb.com/manual/core/index-unique/), and
|
||||
|
|
@ -210,7 +218,7 @@ full document (a so-called "classic" update), we create a DocDiff directly (a "d
|
|||
update).
|
||||
|
||||
Any time a bucket document is updated without going through the `BucketCatalog`, the writer needs
|
||||
to notify the `BucketCatalog` by calling `timeseries::handleDirectWrite` or `BucketCatalog::clear`
|
||||
to notify the `BucketCatalog` by calling `timeseries::handleDirectWrite` or `BucketCatalog::clear`
|
||||
so that it can update its internal state and avoid writing any data which may corrupt the bucket
|
||||
format.
|
||||
|
||||
|
|
@ -223,7 +231,7 @@ the `_id` of an archived bucket (more details below). In other cases, this will
|
|||
to use for a query.
|
||||
|
||||
The filters will include an exact match on the `metaField`, a range match on the `timeField`, size
|
||||
filters on the `timeseriesBucketMaxCount` and `timeseriesBucketMaxSize` server parameters, and a
|
||||
filters on the `timeseriesBucketMaxCount` and `timeseriesBucketMaxSize` server parameters, and a
|
||||
missing or `false` value for `control.closed`. At least for v6.3, the filters will also specify
|
||||
`control.version: 1` to disallow selecting compressed buckets. The last restriction is for
|
||||
performance, and may be removed in the future if we improve decompression speed or deem the benefits
|
||||
|
|
@ -264,7 +272,7 @@ The maximum span of time that a single bucket is allowed to cover is controlled
|
|||
|
||||
When a new bucket is opened by the `BucketCatalog`, the timestamp component of its `_id`, and
|
||||
equivalently the value of its `control.min.<time field>`, will be taken from the first measurement
|
||||
inserted to the bucket and rounded down based on the `bucketRoundingSeconds`. This rounding will
|
||||
inserted to the bucket and rounded down based on the `bucketRoundingSeconds`. This rounding will
|
||||
generally be accomplished by basic modulus arithmetic operating on the number of seconds since the
|
||||
epoch i.e. for an input timestamp `t` and a rounding value `r`, the rounded timestamp will be
|
||||
taken as `t - (t % r)`.
|
||||
|
|
@ -339,10 +347,12 @@ Time-series deletes support retryable writes with the existing mechanisms. For t
|
|||
they are run through the Internal Transaction API to make sure the two writes to storage are atomic.
|
||||
|
||||
# References
|
||||
|
||||
See:
|
||||
[MongoDB Blog: Time Series Data and MongoDB: Part 2 - Schema Design Best Practices](https://www.mongodb.com/blog/post/time-series-data-and-mongodb-part-2-schema-design-best-practices)
|
||||
|
||||
# Glossary
|
||||
|
||||
**bucket**: A group of measurements with the same meta-data over a limited period of time.
|
||||
|
||||
**bucket collection**: A system collection used for storing the buckets underlying a time-series
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
# Executors
|
||||
|
||||
Executors are objects used to schedule and execute work asynchronously. Users can schedule tasks on
|
||||
executors, and the executor will go through scheduled tasks in FIFO order and execute them
|
||||
asynchronously. The various types of executors provide different properties described below.
|
||||
|
||||
## OutOfLineExecutor
|
||||
|
||||
The `OutOfLineExecutor` is the base class for other asynchronous execution APIs.
|
||||
The `OutOfLineExecutor` declares a function `void schedule(Task task)` to delegate asynchronous
|
||||
execution of `task` to the executor, and each executor type implements `schedule` when extending
|
||||
|
|
@ -12,40 +14,46 @@ the caller.
|
|||
For more details and the semantics of executor APIs, see the comments in the [header file](https://github.com/mongodb/mongo/blob/master/src/mongo/util/out_of_line_executor.h).
|
||||
|
||||
### OutOfLineExecutor Wrappers
|
||||
|
||||
Each `OutOfLineExecutor` wrapper enforces some property on a provided executor. The wrappers
|
||||
include:
|
||||
* `GuaranteedExecutor`: ensures the scheduled tasks runs exactly once.
|
||||
* `GuaranteedExecutorWithFallback`: wraps a preferred and a fallback executor and allows the
|
||||
preferred executor to pass tasks to the fallback. The wrapped executors ensure the scheduled tasks
|
||||
run exactly once.
|
||||
* `CancelableExecutor`: accepts a cancelation token and an executor to add cancelation support to
|
||||
the wrapped executor.
|
||||
|
||||
- `GuaranteedExecutor`: ensures the scheduled tasks runs exactly once.
|
||||
- `GuaranteedExecutorWithFallback`: wraps a preferred and a fallback executor and allows the
|
||||
preferred executor to pass tasks to the fallback. The wrapped executors ensure the scheduled tasks
|
||||
run exactly once.
|
||||
- `CancelableExecutor`: accepts a cancelation token and an executor to add cancelation support to
|
||||
the wrapped executor.
|
||||
|
||||
## TaskExecutor
|
||||
|
||||
`TaskExecutor` is an abstract class inheriting from `OutOfLineExecutor` that supports the notion
|
||||
of events and callbacks. `TaskExecutor` provides an interface for:
|
||||
* Scheduling tasks, with functionality for cancellation or scheduling at a later time if desired.
|
||||
* Creating events, having threads subscribe to the events, and notifying the subscribed threads
|
||||
when desired.
|
||||
* Scheduling remote and exhaust commands from a single or multiple remote hosts.
|
||||
|
||||
- Scheduling tasks, with functionality for cancellation or scheduling at a later time if desired.
|
||||
- Creating events, having threads subscribe to the events, and notifying the subscribed threads
|
||||
when desired.
|
||||
- Scheduling remote and exhaust commands from a single or multiple remote hosts.
|
||||
|
||||
### Example Usage
|
||||
* [Scheduling work and cancellation](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L197-L209)
|
||||
* [Scheduling remote commands](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L568-L586)
|
||||
* [Using `scheduleWorkAt`](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L532-L566)
|
||||
* [Event waiting and signaling](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L378-L401)
|
||||
* [Using `sleepUntil`](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L509-L530)
|
||||
|
||||
- [Scheduling work and cancellation](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L197-L209)
|
||||
- [Scheduling remote commands](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L568-L586)
|
||||
- [Using `scheduleWorkAt`](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L532-L566)
|
||||
- [Event waiting and signaling](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L378-L401)
|
||||
- [Using `sleepUntil`](https://github.com/mongodb/mongo/blob/311b84df538a5ee9ab4db507f610d8b814bb2099/src/mongo/executor/task_executor_test_common.cpp#L509-L530)
|
||||
|
||||
### Task Executor Variants
|
||||
* `ThreadPoolTaskExecutor`: implements the `TaskExecutor` interface and uses a thread pool to
|
||||
execute any work scheduled on the executor.
|
||||
* `ScopedTaskExecutor`: wraps a `TaskExecutor` and cancels any outstanding operations on
|
||||
destruction.
|
||||
* `PinnedConnectionTaskExecutor`: wraps a `TaskExecutor` and acts as a `ScopedTaskExecutor` that
|
||||
additionally runs all RPCs/remote operations scheduled through it over the same transport connection.
|
||||
* `TaskExecutorCursor`: manages a remote cursor that uses an asynchronous task executor to run all
|
||||
stages of the command cursor protocol (initial command, getMore, killCursors). Offers a `pinConnections`
|
||||
option that utilizes a `PinnedConnectionTaskExecutor` to run all operations on the cursor over the
|
||||
same transport connection.
|
||||
* `TaskExecutorPool`: represents a pool of `TaskExecutors`. Work which requires a `TaskExecutor` can
|
||||
ask for an executor from the pool. This allows for work to be distributed across several executors.
|
||||
|
||||
- `ThreadPoolTaskExecutor`: implements the `TaskExecutor` interface and uses a thread pool to
|
||||
execute any work scheduled on the executor.
|
||||
- `ScopedTaskExecutor`: wraps a `TaskExecutor` and cancels any outstanding operations on
|
||||
destruction.
|
||||
- `PinnedConnectionTaskExecutor`: wraps a `TaskExecutor` and acts as a `ScopedTaskExecutor` that
|
||||
additionally runs all RPCs/remote operations scheduled through it over the same transport connection.
|
||||
- `TaskExecutorCursor`: manages a remote cursor that uses an asynchronous task executor to run all
|
||||
stages of the command cursor protocol (initial command, getMore, killCursors). Offers a `pinConnections`
|
||||
option that utilizes a `PinnedConnectionTaskExecutor` to run all operations on the cursor over the
|
||||
same transport connection.
|
||||
- `TaskExecutorPool`: represents a pool of `TaskExecutors`. Work which requires a `TaskExecutor` can
|
||||
ask for an executor from the pool. This allows for work to be distributed across several executors.
|
||||
|
|
|
|||
|
|
@ -1,41 +1,41 @@
|
|||
# IDL (Interface Definition Language)
|
||||
|
||||
- [IDL (Interface Definition Language)](#idl-interface-definition-language)
|
||||
- [Key Features](#key-features)
|
||||
- [Overview](#overview)
|
||||
- [Getting started](#getting-started)
|
||||
- [Getting Started with Commands](#getting-started-with-commands)
|
||||
- [The IDL file](#the-idl-file)
|
||||
- [Global](#global)
|
||||
- [Imports](#imports)
|
||||
- [Enums](#enums)
|
||||
- [String Enums](#string-enums)
|
||||
- [Integer Enums](#integer-enums)
|
||||
- [Reference](#reference)
|
||||
- [Types](#types)
|
||||
- [Type Overview](#type-overview)
|
||||
- [Basic Types](#basic-types)
|
||||
- [Custom Types](#custom-types)
|
||||
- [Any Types](#any-types)
|
||||
- [Type Reference](#type-reference)
|
||||
- [Structs](#structs)
|
||||
- [BSON Lifetime](#bson-lifetime)
|
||||
- [Chained Structs (aka struct reuse by composition)](#chained-structs-aka-struct-reuse-by-composition)
|
||||
- [Struct Reference](#struct-reference)
|
||||
- [Struct Fields Attribute Reference](#struct-fields-attribute-reference)
|
||||
- [Field Validator Reference](#field-validator-reference)
|
||||
- [Commands](#commands)
|
||||
- [Commands Reference](#commands-reference)
|
||||
- [Access Check Reference](#access-check-reference)
|
||||
- [Check or Privilege](#check-or-privilege)
|
||||
- [IDL Compiler Overview](#idl-compiler-overview)
|
||||
- [Trees](#trees)
|
||||
- [Passes](#passes)
|
||||
- [Error Handling and Recovery](#error-handling-and-recovery)
|
||||
- [Testing](#testing)
|
||||
- [Extending IDL](#extending-idl)
|
||||
- [Implementation Details](#implementation)
|
||||
- [Best Practices](#best-practices)
|
||||
- [IDL (Interface Definition Language)](#idl-interface-definition-language)
|
||||
- [Key Features](#key-features)
|
||||
- [Overview](#overview)
|
||||
- [Getting started](#getting-started)
|
||||
- [Getting Started with Commands](#getting-started-with-commands)
|
||||
- [The IDL file](#the-idl-file)
|
||||
- [Global](#global)
|
||||
- [Imports](#imports)
|
||||
- [Enums](#enums)
|
||||
- [String Enums](#string-enums)
|
||||
- [Integer Enums](#integer-enums)
|
||||
- [Reference](#reference)
|
||||
- [Types](#types)
|
||||
- [Type Overview](#type-overview)
|
||||
- [Basic Types](#basic-types)
|
||||
- [Custom Types](#custom-types)
|
||||
- [Any Types](#any-types)
|
||||
- [Type Reference](#type-reference)
|
||||
- [Structs](#structs)
|
||||
- [BSON Lifetime](#bson-lifetime)
|
||||
- [Chained Structs (aka struct reuse by composition)](#chained-structs-aka-struct-reuse-by-composition)
|
||||
- [Struct Reference](#struct-reference)
|
||||
- [Struct Fields Attribute Reference](#struct-fields-attribute-reference)
|
||||
- [Field Validator Reference](#field-validator-reference)
|
||||
- [Commands](#commands)
|
||||
- [Commands Reference](#commands-reference)
|
||||
- [Access Check Reference](#access-check-reference)
|
||||
- [Check or Privilege](#check-or-privilege)
|
||||
- [IDL Compiler Overview](#idl-compiler-overview)
|
||||
- [Trees](#trees)
|
||||
- [Passes](#passes)
|
||||
- [Error Handling and Recovery](#error-handling-and-recovery)
|
||||
- [Testing](#testing)
|
||||
- [Extending IDL](#extending-idl)
|
||||
- [Implementation Details](#implementation)
|
||||
- [Best Practices](#best-practices)
|
||||
|
||||
Interface Definition Language (IDL) is a custom Domain Specific Language (DSL) originally designed
|
||||
to generate code to meets MongoDB's needs for handling BSON. Server parameters and configuration
|
||||
|
|
@ -50,9 +50,9 @@ like XDR, ASN.1, MIDL, and Google's Protocol Buffers.
|
|||
1. Generate C++ classes that represent BSON documents of a specific schema, and parse/serialize
|
||||
between BSON documents and the class (aka `struct` in IDL ) using `BSONObj` and `BSONObjBuilder`
|
||||
2. Generate C++ classes that represent MongoDB BSON commands of a specific schema, and
|
||||
parse/serialize between BSON documents
|
||||
- Commands are a subset of `struct` but understand the unique requirements of commands. Also, can
|
||||
parse `OpMsg`'s document sequences.
|
||||
parse/serialize between BSON documents
|
||||
- Commands are a subset of `struct` but understand the unique requirements of commands. Also, can
|
||||
parse `OpMsg`'s document sequences.
|
||||
3. Parse and serialize Enums as strings or integers
|
||||
4. Declare, parse and serialize server parameters (aka setParameters)
|
||||
5. Declare, parse and serialize configuration options
|
||||
|
|
@ -96,8 +96,8 @@ Example Document:
|
|||
|
||||
```json
|
||||
{
|
||||
"intField" : 42,
|
||||
"stringField" : "question",
|
||||
"intField": 42,
|
||||
"stringField": "question"
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -107,10 +107,10 @@ to represent this in IDL, write the following file:
|
|||
|
||||
```yaml
|
||||
global:
|
||||
cpp_namespace: "mongo"
|
||||
cpp_namespace: "mongo"
|
||||
|
||||
imports:
|
||||
- "mongo/db/basic_types.idl"
|
||||
- "mongo/db/basic_types.idl"
|
||||
|
||||
structs:
|
||||
example:
|
||||
|
|
@ -173,11 +173,11 @@ private:
|
|||
|
||||
IDL generates 5 sets of key methods.
|
||||
|
||||
- `constructor` - a C++ constructor with only the required fields as arguments
|
||||
- `parse` - a static function that parses a BSON document to the C++ class
|
||||
- `serialize`/`toBSON` - a method that serializes the C++ class to BSON
|
||||
- `get*` - methods to value of a field after parsing
|
||||
- `set*` - methods to set a field in the C++ class before serialization
|
||||
- `constructor` - a C++ constructor with only the required fields as arguments
|
||||
- `parse` - a static function that parses a BSON document to the C++ class
|
||||
- `serialize`/`toBSON` - a method that serializes the C++ class to BSON
|
||||
- `get*` - methods to value of a field after parsing
|
||||
- `set*` - methods to set a field in the C++ class before serialization
|
||||
|
||||
To use this class in a C++ file, write the following code:
|
||||
|
||||
|
|
@ -218,10 +218,10 @@ Example Command:
|
|||
|
||||
```json
|
||||
{
|
||||
"hasEncryptedFields" : "testCollection",
|
||||
"encryptionType" : "queryableEncryption",
|
||||
"comment" : "Example command",
|
||||
"$db" : "testDB"
|
||||
"hasEncryptedFields": "testCollection",
|
||||
"encryptionType": "queryableEncryption",
|
||||
"comment": "Example command",
|
||||
"$db": "testDB"
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -229,8 +229,8 @@ which has a reply
|
|||
|
||||
```json
|
||||
{
|
||||
"answer" : "yes",
|
||||
"ok" : 1
|
||||
"answer": "yes",
|
||||
"ok": 1
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -240,10 +240,10 @@ to represent this in IDL, write the following file:
|
|||
|
||||
```yaml
|
||||
global:
|
||||
cpp_namespace: "mongo"
|
||||
cpp_namespace: "mongo"
|
||||
|
||||
imports:
|
||||
- "mongo/db/basic_types.idl"
|
||||
- "mongo/db/basic_types.idl"
|
||||
|
||||
structs:
|
||||
hasEncryptedFieldReply:
|
||||
|
|
@ -274,49 +274,51 @@ To see how to integrate a command IDL file in SCons, see the example above for s
|
|||
|
||||
A IDL file consist of a series of top-level sections (i.e. YAML maps).
|
||||
|
||||
- `global` - Global settings that affect code generation
|
||||
- `imports`- List of other IDL files that contain enums, types and structs this file refers to
|
||||
- `enums` - List of enums to generate code for
|
||||
- `types` - List of types which instruct IDL how deserialize/serialize primitives
|
||||
- `structs` - List of BSON documents to deserialize/serialize to C++ classes
|
||||
- `commands` - List of BSON commands used by MongoDB RPC to deserialize/serialize to C++ classes
|
||||
- `server_parameters` - See [docs/server-parameters.md](../../../docs/server-parameters.md)
|
||||
- `configs` - TODO SERVER-79135
|
||||
- `feature_flags` - TODO SERVER-79135
|
||||
- `generic_argument_lists`- List of arguments common to all commands requests - not documented
|
||||
- `generic_reply_field_lists`-Li st of arguments common to all command replies - not documented
|
||||
- `global` - Global settings that affect code generation
|
||||
- `imports`- List of other IDL files that contain enums, types and structs this file refers to
|
||||
- `enums` - List of enums to generate code for
|
||||
- `types` - List of types which instruct IDL how deserialize/serialize primitives
|
||||
- `structs` - List of BSON documents to deserialize/serialize to C++ classes
|
||||
- `commands` - List of BSON commands used by MongoDB RPC to deserialize/serialize to C++ classes
|
||||
- `server_parameters` - See [docs/server-parameters.md](../../../docs/server-parameters.md)
|
||||
- `configs` - TODO SERVER-79135
|
||||
- `feature_flags` - TODO SERVER-79135
|
||||
- `generic_argument_lists`- List of arguments common to all commands requests - not documented
|
||||
- `generic_reply_field_lists`-Li st of arguments common to all command replies - not documented
|
||||
|
||||
## Global
|
||||
|
||||
- `cpp_namespace` - string - The C++ namespace for all generated classes and enums to belong to.
|
||||
Must start with `mongo`.
|
||||
- `cpp_includes` - sequence - A list of C++ headers to include in the generated `.h` file. You
|
||||
should not list generated IDL headers here as includes for them are automatically generated from
|
||||
`imports`.
|
||||
- `configs` - map - A section that defines global settings for configuration options
|
||||
- `source` - sequence - a subset of [`yaml`, `cli`, `ini`]
|
||||
- `cli` - configuration option handled by command line
|
||||
- `yaml` - configuration option handled by yaml config file
|
||||
- `ini` - configuration option handled by deprecated ini file format. Do not use for new flags.
|
||||
- `section` - string - Name of displayed section in `--help`
|
||||
- `initializer` - map
|
||||
- `register` - string - Name of generated function to add configuration options.
|
||||
- `cpp_namespace` - string - The C++ namespace for all generated classes and enums to belong to.
|
||||
Must start with `mongo`.
|
||||
- `cpp_includes` - sequence - A list of C++ headers to include in the generated `.h` file. You
|
||||
should not list generated IDL headers here as includes for them are automatically generated from
|
||||
`imports`.
|
||||
- `configs` - map - A section that defines global settings for configuration options
|
||||
|
||||
If not provided, an anonymous MONGO_MODULE_STARTUP_OPTIONS_REGISTER initializer will be
|
||||
declared which will automatically register the config settings named in this file at startup
|
||||
time. This initializer will be named "idl_" followed by a string of hex digits. Currently this
|
||||
string is the SHA1 hash of the header's filename, but this should not be used in dependency
|
||||
rules since it may change at a later time.
|
||||
- `source` - sequence - a subset of [`yaml`, `cli`, `ini`]
|
||||
- `cli` - configuration option handled by command line
|
||||
- `yaml` - configuration option handled by yaml config file
|
||||
- `ini` - configuration option handled by deprecated ini file format. Do not use for new flags.
|
||||
- `section` - string - Name of displayed section in `--help`
|
||||
- `initializer` - map
|
||||
|
||||
If provided, all registration logic will be implemented in a public function of the form
|
||||
Status registerName(optionenvironment::OptionSection* options_ptr). It it up to additional
|
||||
code to decide how and when this registration function is called.
|
||||
- `register` - string - Name of generated function to add configuration options.
|
||||
|
||||
- `store` - string - Name of generated function to store configuration options.
|
||||
If not provided, an anonymous MONGO*MODULE_STARTUP_OPTIONS_REGISTER initializer will be
|
||||
declared which will automatically register the config settings named in this file at startup
|
||||
time. This initializer will be named "idl*" followed by a string of hex digits. Currently this
|
||||
string is the SHA1 hash of the header's filename, but this should not be used in dependency
|
||||
rules since it may change at a later time.
|
||||
|
||||
This behaves like `register`, but using a MONGO_STARTUP_OPTIONS_STORE initializer in the
|
||||
not-provided case, and declaring Status storeName(const optionenvironment::Environment&
|
||||
params) in the provided case.
|
||||
If provided, all registration logic will be implemented in a public function of the form
|
||||
Status registerName(optionenvironment::OptionSection\* options_ptr). It it up to additional
|
||||
code to decide how and when this registration function is called.
|
||||
|
||||
- `store` - string - Name of generated function to store configuration options.
|
||||
|
||||
This behaves like `register`, but using a MONGO_STARTUP_OPTIONS_STORE initializer in the
|
||||
not-provided case, and declaring Status storeName(const optionenvironment::Environment&
|
||||
params) in the provided case.
|
||||
|
||||
An example for a typical global section is:
|
||||
|
||||
|
|
@ -327,14 +329,14 @@ global:
|
|||
- "mongo/idl/idl_test_types.h"
|
||||
```
|
||||
|
||||
`mongo` is the C++ namespace for the generated code. One header is listed because the IDL types
|
||||
depend on it in this imaginary example.
|
||||
`mongo` is the C++ namespace for the generated code. One header is listed because the IDL types
|
||||
depend on it in this imaginary example.
|
||||
|
||||
## Imports
|
||||
|
||||
The `imports` section is a list of other IDL files to include. If your IDL references other enums,
|
||||
types, or structs, the imports section lists IDL file with the definition or IDL throws an error.
|
||||
*Note*: The IDL compiler does not generate code for imported things, it generates code for the file
|
||||
_Note_: The IDL compiler does not generate code for imported things, it generates code for the file
|
||||
listed on the command line. For instance, if your IDL file imports a struct named `ImportedStruct`,
|
||||
the generated code calls its `ImportedStruct::parse` function but does not generate the
|
||||
`ImportedStruct::parse` definition or declaration.
|
||||
|
|
@ -352,7 +354,7 @@ imports:
|
|||
- "mongo/db/basic_types.idl"
|
||||
```
|
||||
|
||||
*Note*: [src/mongo/db/basic_types.idl](../db/basic_types.idl) is a foundational file for IDL. This
|
||||
_Note_: [src/mongo/db/basic_types.idl](../db/basic_types.idl) is a foundational file for IDL. This
|
||||
file defines the standard types of IDL. Without this file, IDL does not know how to read and write a
|
||||
string or integer for instance.
|
||||
|
||||
|
|
@ -423,9 +425,9 @@ std::int32_t IntEnum_serializer(IntEnum value);
|
|||
|
||||
Each `enum` can have the following pieces:
|
||||
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `type` - string - can be either `string` or `int`
|
||||
- `values` - map - a map of `enum value name` -> `enum value`
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `type` - string - can be either `string` or `int`
|
||||
- `values` - map - a map of `enum value name` -> `enum value`
|
||||
|
||||
Like struct.fields[], enum.values[] may be given as either a simple mapping `name: value` and indeed
|
||||
most are, but the may also map names to a dictionary of information:
|
||||
|
|
@ -470,7 +472,7 @@ reference](#struct-fields-attribute-reference) for more information.
|
|||
|
||||
Type supports builtin BSON types like int32, int64, and string. These are types built into
|
||||
`BSONElement`/`BSONObjBuilder`. It also supports custom types to give the code full control of
|
||||
parsing and serialization. *Note*: IDL has no builtin types. The
|
||||
parsing and serialization. _Note_: IDL has no builtin types. The
|
||||
[src/mongo/db/basic_types.idl](../db/basic_types.idl) file declares all common BSON types and must
|
||||
be manually imported into every file. This separation makes unit testing easier and allows IDL to be
|
||||
extendable by separating most type concerns from the python code.
|
||||
|
|
@ -504,19 +506,19 @@ string:
|
|||
|
||||
The five key things to note in this example:
|
||||
|
||||
- `bson_serialization_type` - a list of types BSON generated code should check a type is before
|
||||
calling the deserializer. In this case, IDL generated code checks if the BSON type is `string`.
|
||||
- `cpp_type` - The C++ type to store the deserialized value as. This is type of the member variable
|
||||
in the generated C++ class when this type is instantiated in a struct.
|
||||
- `deserializer` - a method to all deserialize the type. Typically this is a function that takes
|
||||
`BSONElement` as a parameter. The IDL generator has custom rules for `BSONElement`.
|
||||
- `serializer` - omitted in this example because `BSONObjBuilder` has builtin support for
|
||||
`std::string`
|
||||
- `is_view` - indicates whether the type is a view or not. If the type is a view, then it's
|
||||
possible that objects of the type will not own all of it's members. If the type is not a view,
|
||||
then objects of the type are guaranteed to own all of it's members. This field is optional and
|
||||
defaults to True. To reduce the size of the C++ representation of structs including this type,
|
||||
you can specify this field as False if the type is not a view type.
|
||||
- `bson_serialization_type` - a list of types BSON generated code should check a type is before
|
||||
calling the deserializer. In this case, IDL generated code checks if the BSON type is `string`.
|
||||
- `cpp_type` - The C++ type to store the deserialized value as. This is type of the member variable
|
||||
in the generated C++ class when this type is instantiated in a struct.
|
||||
- `deserializer` - a method to all deserialize the type. Typically this is a function that takes
|
||||
`BSONElement` as a parameter. The IDL generator has custom rules for `BSONElement`.
|
||||
- `serializer` - omitted in this example because `BSONObjBuilder` has builtin support for
|
||||
`std::string`
|
||||
- `is_view` - indicates whether the type is a view or not. If the type is a view, then it's
|
||||
possible that objects of the type will not own all of it's members. If the type is not a view,
|
||||
then objects of the type are guaranteed to own all of it's members. This field is optional and
|
||||
defaults to True. To reduce the size of the C++ representation of structs including this type,
|
||||
you can specify this field as False if the type is not a view type.
|
||||
|
||||
### Custom Types
|
||||
|
||||
|
|
@ -560,46 +562,35 @@ IDLAnyType:
|
|||
|
||||
### Type Reference
|
||||
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `bson_serialization_type` - string or sequence - a list of types BSON generated code should check
|
||||
a type is before calling the deserializer. Can also be `any`.
|
||||
[buildscripts/idl/idl/bson.py](../../../buildscripts/idl/idl/bson.py) lists the supported types.
|
||||
- `bindata_subtype` - string - if `bson_serialization_type` is `bindata`, this is the required
|
||||
bindata subtype. [buildscripts/idl/idl/bson.py](../../../buildscripts/idl/idl/bson.py) lists the
|
||||
supported bindata subtypes.
|
||||
- `cpp_type` - The C++ type to store the deserialized value as. This is type of the member variable
|
||||
in the generated C++ class when a struct/command uses this type.
|
||||
- `std::string` - When using `std::string`, the getters/setters using `mongo::StringData` instead
|
||||
- `std::vector<_>` - When using `std::vector<->`, the getters/setters using
|
||||
`mongo::ConstDataRange` instead
|
||||
- `deserializer` - string - a method name to all deserialize the type. Typically this is a function
|
||||
that takes `BSONElement` as a parameter. The IDL generator has custom rules for `BSONElement`.
|
||||
- By default, IDL assumes it is a instance methods of `cpp_type`.
|
||||
- If prefixed with `::`, assumes the function is a global static function
|
||||
- By default, the deserializer's function signature is `<function_name>(<cpp_type>)`.
|
||||
- For `object` types, the deserializer's function signature is `<function_name>(const BSONObj&
|
||||
obj)`
|
||||
- For `any` types, the deserializer's function signature is `<function_name>(BSONElement
|
||||
element)`.
|
||||
- `serializer` - string -a method name to all serialize the type.
|
||||
- By default, IDL assumes it is a instance methods of `cpp_type`.
|
||||
- If prefixed with `::`, assumes the function is a global static function
|
||||
- By default, the deserializer's function signature is `<type_append> <function_name>(const
|
||||
<cpp_type>&)` where `type_append` is a type `BSONObjBuilder` understands.
|
||||
- For `object` types, the deserializer's function signature is `<function_name>(const BSONObj&
|
||||
obj)`
|
||||
- For `any` types that are not in an array, the serializer's function signature is
|
||||
`<function_name>(StringData fieldName, BSONObjBuilder* builder)`.
|
||||
- For `any` types that are in an array, the serializer's function signature is
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `bson_serialization_type` - string or sequence - a list of types BSON generated code should check
|
||||
a type is before calling the deserializer. Can also be `any`.
|
||||
[buildscripts/idl/idl/bson.py](../../../buildscripts/idl/idl/bson.py) lists the supported types.
|
||||
- `bindata_subtype` - string - if `bson_serialization_type` is `bindata`, this is the required
|
||||
bindata subtype. [buildscripts/idl/idl/bson.py](../../../buildscripts/idl/idl/bson.py) lists the
|
||||
supported bindata subtypes.
|
||||
- `cpp_type` - The C++ type to store the deserialized value as. This is type of the member variable
|
||||
in the generated C++ class when a struct/command uses this type.
|
||||
- `std::string` - When using `std::string`, the getters/setters using `mongo::StringData` instead
|
||||
- `std::vector<_>` - When using `std::vector<->`, the getters/setters using
|
||||
`mongo::ConstDataRange` instead
|
||||
- `deserializer` - string - a method name to all deserialize the type. Typically this is a function
|
||||
that takes `BSONElement` as a parameter. The IDL generator has custom rules for `BSONElement`. - By default, IDL assumes it is a instance methods of `cpp_type`. - If prefixed with `::`, assumes the function is a global static function - By default, the deserializer's function signature is `<function_name>(<cpp_type>)`. - For `object` types, the deserializer's function signature is `<function_name>(const BSONObj&
|
||||
obj)` - For `any` types, the deserializer's function signature is `<function_name>(BSONElement
|
||||
element)`.
|
||||
- `serializer` - string -a method name to all serialize the type. - By default, IDL assumes it is a instance methods of `cpp_type`. - If prefixed with `::`, assumes the function is a global static function - By default, the deserializer's function signature is `<type_append> <function_name>(const
|
||||
<cpp_type>&)` where `type_append` is a type `BSONObjBuilder` understands. - For `object` types, the deserializer's function signature is `<function_name>(const BSONObj&
|
||||
obj)` - For `any` types that are not in an array, the serializer's function signature is
|
||||
`<function_name>(StringData fieldName, BSONObjBuilder* builder)`. - For `any` types that are in an array, the serializer's function signature is
|
||||
`<function_name>(BSONArrayBuilder* builder)`.
|
||||
- `deserialize_with_tenant` - bool - if set, adds `TenantId` as the first parameter to
|
||||
`deserializer`
|
||||
- `internal_only` - bool - undocumented, DO NOT USE
|
||||
- `default` - string - default value for a type. A field in a struct inherits this value if a field
|
||||
does not set a default. See struct's `default` rules for more information.
|
||||
- `is_view` - indicates whether the type is a view or not. If the type is a view, then it's
|
||||
possible that objects of the type will not own all of it's members. If the type is not a view,
|
||||
then objects of the type are guaranteed to own all of it's members.
|
||||
- `deserialize_with_tenant` - bool - if set, adds `TenantId` as the first parameter to
|
||||
`deserializer`
|
||||
- `internal_only` - bool - undocumented, DO NOT USE
|
||||
- `default` - string - default value for a type. A field in a struct inherits this value if a field
|
||||
does not set a default. See struct's `default` rules for more information.
|
||||
- `is_view` - indicates whether the type is a view or not. If the type is a view, then it's
|
||||
possible that objects of the type will not own all of it's members. If the type is not a view,
|
||||
then objects of the type are guaranteed to own all of it's members.
|
||||
|
||||
## Structs
|
||||
|
||||
|
|
@ -610,7 +601,7 @@ a command is its name, common fields across commands). See `commands` [below](#c
|
|||
|
||||
The generated C++ parsers for structs are strict by default. This means that they throw an error on
|
||||
fields that do not know about. Use `strict: false` to change this behavior. Mark persisted structs
|
||||
with `strict: false` for future backwards compatibility needs.
|
||||
with `strict: false` for future backwards compatibility needs.
|
||||
|
||||
A struct consists of one or more fields. All fields are required by default. The generated parser
|
||||
errors if field is missing from the BSON document. On serialization, if a field has not been set,
|
||||
|
|
@ -642,7 +633,7 @@ exampleStruct:
|
|||
default: 42
|
||||
```
|
||||
|
||||
This generates a C++ function with methods to parse and serialize the struct. *Note*: This code has
|
||||
This generates a C++ function with methods to parse and serialize the struct. _Note_: This code has
|
||||
been simplified from the full code IDL generates.
|
||||
|
||||
```cpp
|
||||
|
|
@ -733,85 +724,85 @@ is not affected as this option is only syntactic sugar.
|
|||
|
||||
### Struct Reference
|
||||
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `fields` - sequence - see [fields attributes reference below](#struct-fields-attribute-reference)
|
||||
- `strict` - bool - defaults to true, a strict parser errors if a unknown field is encountered by
|
||||
the generated parser. Persisted structs should set this to `false` to allow them to encounter
|
||||
documents from future versions of MongoDB without throwing an error.
|
||||
- `chained_types` - mapping - undocumented
|
||||
- `chained_structs` - mapping - a list of structs to include this struct. IDL adds the chained
|
||||
structs as member variables in the generated C++ class. IDL also adds a getter for each chained
|
||||
struct.
|
||||
- `inline_chained_structs` - bool - if true, exposes chained struct getters as members of this
|
||||
struct in generated code.
|
||||
- `immutable` - bool - if true, does not generate mutable getters for structs
|
||||
- `generate_comparison_operators` - bool - if true, generates support for C++ operatiors: `==`,
|
||||
`!=`, `<`, `>`, `<=`, `>=`,
|
||||
- `non_const_getter` - bool - if true, generates mutable getters for non-struct fields
|
||||
- `cpp_validator_func` - string - name of a C++ function to call after a BSON document has been
|
||||
deserialized. Function has signature of `void <function_name>(<struct_name>* obj)`. Method is
|
||||
expected to thrown a C++ exception (i.e. `uassert`) if validation fails.
|
||||
- `is_command_reply` - bool - if true, marks the struct as a command reply. A struct marked a
|
||||
`is_command_reply` generates a parser that ignores known generic or common fields across all
|
||||
replies when parsing replies (i.e. `ok`, `errmsg`, etc)
|
||||
- `is_generic_cmd_list` - string - choice [`arg`, `reply`], if set, generates functions `bool
|
||||
hasField(StringData)` and `bool shouldForwardToShards(StringData)` for each field in the struct
|
||||
- `query_shape_component` - bool - true indicates this special serialization code will be generated
|
||||
to serialize as a query shape
|
||||
- `unsafe_dangerous_disable_extra_field_duplicate_checks` - bool - undocumented, DO NOT USE
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `fields` - sequence - see [fields attributes reference below](#struct-fields-attribute-reference)
|
||||
- `strict` - bool - defaults to true, a strict parser errors if a unknown field is encountered by
|
||||
the generated parser. Persisted structs should set this to `false` to allow them to encounter
|
||||
documents from future versions of MongoDB without throwing an error.
|
||||
- `chained_types` - mapping - undocumented
|
||||
- `chained_structs` - mapping - a list of structs to include this struct. IDL adds the chained
|
||||
structs as member variables in the generated C++ class. IDL also adds a getter for each chained
|
||||
struct.
|
||||
- `inline_chained_structs` - bool - if true, exposes chained struct getters as members of this
|
||||
struct in generated code.
|
||||
- `immutable` - bool - if true, does not generate mutable getters for structs
|
||||
- `generate_comparison_operators` - bool - if true, generates support for C++ operatiors: `==`,
|
||||
`!=`, `<`, `>`, `<=`, `>=`,
|
||||
- `non_const_getter` - bool - if true, generates mutable getters for non-struct fields
|
||||
- `cpp_validator_func` - string - name of a C++ function to call after a BSON document has been
|
||||
deserialized. Function has signature of `void <function_name>(<struct_name>* obj)`. Method is
|
||||
expected to thrown a C++ exception (i.e. `uassert`) if validation fails.
|
||||
- `is_command_reply` - bool - if true, marks the struct as a command reply. A struct marked a
|
||||
`is_command_reply` generates a parser that ignores known generic or common fields across all
|
||||
replies when parsing replies (i.e. `ok`, `errmsg`, etc)
|
||||
- `is_generic_cmd_list` - string - choice [`arg`, `reply`], if set, generates functions `bool
|
||||
hasField(StringData)` and `bool shouldForwardToShards(StringData)` for each field in the struct
|
||||
- `query_shape_component` - bool - true indicates this special serialization code will be generated
|
||||
to serialize as a query shape
|
||||
- `unsafe_dangerous_disable_extra_field_duplicate_checks` - bool - undocumented, DO NOT USE
|
||||
|
||||
### Struct Fields Attribute Reference
|
||||
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `cpp_name` - string - Optional name to use for member variable and getters/setters. Defaults to
|
||||
`camelCase` of field name.
|
||||
- `type` - string or mapping - supports a single type, `array<type>`, or variant. Can also be
|
||||
arrays.
|
||||
- string name of a type must be a `enum`, `type`, or `struct` that is defined in an IDL file or
|
||||
imported
|
||||
- string can also be `array<type>` where type must be a `enum`, `type`, `struct`, or `variant`.
|
||||
The C++ type will be `std::vector<type>` in this case
|
||||
- Mappings or Variants - IDL supports a variant that chooses among a set of IDL types. You can
|
||||
have a variant of strings and structs.
|
||||
- Variant string support differentiates the type to choose based on the BSON type.
|
||||
- Variant struct support differentiates the type to choose based on the *first* field of the
|
||||
struct. The first field must be unique in each struct across the structs. When parsing a
|
||||
BSON object as a variant of multiple structs, the parser assumes that the first field
|
||||
declared in the IDL struct is always the first field in its BSON representation.
|
||||
See `bulkWrite` for an example.
|
||||
- `ignore` - bool - true means field generates no code but is ignored by the generated deserializer.
|
||||
Used to deprecate fields that no longer have an affect but allow strict parsers to ignore them.
|
||||
- `optional` - bool - true means the field is optional. Generated C++ type is
|
||||
`boost::optional<type>`.
|
||||
- `default` - string - the default value of type. Types with default values are not required to be
|
||||
found in the original document or set before serialization
|
||||
- `supports_doc_sequence` - bool - true indicates the field can be found in a `OpMsg`'s document
|
||||
sequence. Must use the generated `<struct>::parse(OpMsgRequest)` parser to use this
|
||||
- `comparison_order` - sequence - comparison order for fields
|
||||
- `validator` - see [validator reference](#field-validator-reference)
|
||||
- `non_const_getter` - bool - true indicates it generates a mutable getter
|
||||
- `unstable` - bool - deprecated, prefer `stability` = `unstable` instead
|
||||
- `stability` - string - choice [`unstable`, `stable`] - if `unstable`, parsing the field throws a
|
||||
field if strict api checking is enabled
|
||||
- `always_serialize` - bool - whether to always serialize optional fields even if none
|
||||
- `forward_to_shards` - bool - used by generic arg code to generate `shouldForwardToShards`, no
|
||||
affect on BSON deserialization/serialization
|
||||
- `forward_from_shards` - bool - used by generic arg code to generate `shouldForwardFromShards`, no
|
||||
affect on BSON deserialization/serialization
|
||||
- `query_shape` - choice of [`anonymize`, `literal`, `parameter`, `custom`] - see
|
||||
[src/mongo/db/query/query_shape.h]
|
||||
- `description` - string - A comment to add to the generated C++
|
||||
- `cpp_name` - string - Optional name to use for member variable and getters/setters. Defaults to
|
||||
`camelCase` of field name.
|
||||
- `type` - string or mapping - supports a single type, `array<type>`, or variant. Can also be
|
||||
arrays.
|
||||
- string name of a type must be a `enum`, `type`, or `struct` that is defined in an IDL file or
|
||||
imported
|
||||
- string can also be `array<type>` where type must be a `enum`, `type`, `struct`, or `variant`.
|
||||
The C++ type will be `std::vector<type>` in this case
|
||||
- Mappings or Variants - IDL supports a variant that chooses among a set of IDL types. You can
|
||||
have a variant of strings and structs.
|
||||
- Variant string support differentiates the type to choose based on the BSON type.
|
||||
- Variant struct support differentiates the type to choose based on the _first_ field of the
|
||||
struct. The first field must be unique in each struct across the structs. When parsing a
|
||||
BSON object as a variant of multiple structs, the parser assumes that the first field
|
||||
declared in the IDL struct is always the first field in its BSON representation.
|
||||
See `bulkWrite` for an example.
|
||||
- `ignore` - bool - true means field generates no code but is ignored by the generated deserializer.
|
||||
Used to deprecate fields that no longer have an affect but allow strict parsers to ignore them.
|
||||
- `optional` - bool - true means the field is optional. Generated C++ type is
|
||||
`boost::optional<type>`.
|
||||
- `default` - string - the default value of type. Types with default values are not required to be
|
||||
found in the original document or set before serialization
|
||||
- `supports_doc_sequence` - bool - true indicates the field can be found in a `OpMsg`'s document
|
||||
sequence. Must use the generated `<struct>::parse(OpMsgRequest)` parser to use this
|
||||
- `comparison_order` - sequence - comparison order for fields
|
||||
- `validator` - see [validator reference](#field-validator-reference)
|
||||
- `non_const_getter` - bool - true indicates it generates a mutable getter
|
||||
- `unstable` - bool - deprecated, prefer `stability` = `unstable` instead
|
||||
- `stability` - string - choice [`unstable`, `stable`] - if `unstable`, parsing the field throws a
|
||||
field if strict api checking is enabled
|
||||
- `always_serialize` - bool - whether to always serialize optional fields even if none
|
||||
- `forward_to_shards` - bool - used by generic arg code to generate `shouldForwardToShards`, no
|
||||
affect on BSON deserialization/serialization
|
||||
- `forward_from_shards` - bool - used by generic arg code to generate `shouldForwardFromShards`, no
|
||||
affect on BSON deserialization/serialization
|
||||
- `query_shape` - choice of [`anonymize`, `literal`, `parameter`, `custom`] - see
|
||||
[src/mongo/db/query/query_shape.h]
|
||||
|
||||
### Field Validator Reference
|
||||
|
||||
Validators generate functions that ensure a value during parse or set in a setter are valid.
|
||||
Comparisons are generated with C++ operators for these comparisons
|
||||
|
||||
- `gt` - string - Validates field is greater than `string`
|
||||
- `lt` - string - Validates field is less than or equal to `string`
|
||||
- `gte` - string - Validates field is greater than `string`
|
||||
- `lte` - string - Validates field is less than or equal to `string`
|
||||
- `callback` - string - A static function to call of the shape `Status <function_name>(const
|
||||
<cpp_type> value)`. For non-simple types, `value` is passed by const-reference.
|
||||
- `gt` - string - Validates field is greater than `string`
|
||||
- `lt` - string - Validates field is less than or equal to `string`
|
||||
- `gte` - string - Validates field is greater than `string`
|
||||
- `lte` - string - Validates field is less than or equal to `string`
|
||||
- `callback` - string - A static function to call of the shape `Status <function_name>(const
|
||||
<cpp_type> value)`. For non-simple types, `value` is passed by const-reference.
|
||||
|
||||
## Commands
|
||||
|
||||
|
|
@ -841,67 +832,63 @@ Commands can also specify their replies that they return. Replies are regular `s
|
|||
|
||||
### Commands Reference
|
||||
|
||||
- `description` - [see structs](#struct-reference)
|
||||
- `chained_types` - - [see structs](#struct-reference)
|
||||
- `chained_structs` - - [see structs](#struct-reference)
|
||||
- `fields` - - [see structs](#struct-reference)
|
||||
- `cpp_name` - - [see structs](#struct-reference)
|
||||
- `strict` - - [see structs](#struct-reference)
|
||||
- `generate_comparison_operators` - [see structs](#struct-reference)
|
||||
- `inline_chained_structs` - [see structs](#struct-reference)
|
||||
- `immutable` - [see structs](#struct-reference)
|
||||
- `non_const_getter` - [see structs](#struct-reference)
|
||||
- `namespace` - string - choice of a string [`concatenate_with_db`, `concatenate_with_db_or_uuid`,
|
||||
`ignored`, `type`]. Instructs how the value of command field should be parsed
|
||||
- `concatenate_with_db` - Indicates the command field is a string and should be treated as a
|
||||
- `description` - [see structs](#struct-reference)
|
||||
- `chained_types` - - [see structs](#struct-reference)
|
||||
- `chained_structs` - - [see structs](#struct-reference)
|
||||
- `fields` - - [see structs](#struct-reference)
|
||||
- `cpp_name` - - [see structs](#struct-reference)
|
||||
- `strict` - - [see structs](#struct-reference)
|
||||
- `generate_comparison_operators` - [see structs](#struct-reference)
|
||||
- `inline_chained_structs` - [see structs](#struct-reference)
|
||||
- `immutable` - [see structs](#struct-reference)
|
||||
- `non_const_getter` - [see structs](#struct-reference)
|
||||
- `namespace` - string - choice of a string [`concatenate_with_db`, `concatenate_with_db_or_uuid`,
|
||||
`ignored`, `type`]. Instructs how the value of command field should be parsed - `concatenate_with_db` - Indicates the command field is a string and should be treated as a
|
||||
collection name. Typically used by commands that deal with collections. Automatically
|
||||
concatenated with `$db` by the IDL parser. Adds a method `const NamespaceString getNamespace()`
|
||||
to the generated class.
|
||||
- `concatenate_with_db_or_uuid` - Indicates the command field is a string or uuid, and should be
|
||||
to the generated class. - `concatenate_with_db_or_uuid` - Indicates the command field is a string or uuid, and should be
|
||||
treated as a collection name. Typically used by commands that deal with collections.
|
||||
Automatically concatenated with `$db` by the IDL parser. Adds a method `const
|
||||
NamespaceStringOrUUID& getNamespaceOrUUID()` to the generated class.
|
||||
- `ignored` - Ignores the value of the command field. Used by commands that ignore their command
|
||||
argument entirely
|
||||
- `type` - Indicates the command takes a custom type for the first field. `type` field must be
|
||||
NamespaceStringOrUUID& getNamespaceOrUUID()` to the generated class. - `ignored` - Ignores the value of the command field. Used by commands that ignore their command
|
||||
argument entirely - `type` - Indicates the command takes a custom type for the first field. `type` field must be
|
||||
set.
|
||||
- `type` - string - name of IDL type or struct to parse the command field as
|
||||
- `command_name` - string - IDL generated parser expects the command to be named the name of YAML
|
||||
map. This can be overwritten with `command_name`. Commands should be `camelCase`
|
||||
- `command_alias` - string - allows commands to have multiple names. DO NOT USE. Some older commands
|
||||
have both `lowercase` and `camelCase` names.
|
||||
- `reply_type` - string - IDL struct that this command replies with. Reply struct must have
|
||||
`is_command_reply` set
|
||||
- `api_version` - string - Typically set to the empty string `""`. Only set to a non-empty string if
|
||||
command is part of the stable API. Generates a class name
|
||||
`<command_name>CommandNameCmdVersion1Gen` derived from `TypedCommand` that commands should be
|
||||
derived from.
|
||||
- `is_deprecated` - bool - indicates command is deprecated
|
||||
- `allow_global_collection_name` - bool - if true, command can accept both collect names and
|
||||
non-collection names. Used by the `aggregate` command
|
||||
- `access_check` - mapping - see [access check reference](#access-check-reference)
|
||||
- `type` - string - name of IDL type or struct to parse the command field as
|
||||
- `command_name` - string - IDL generated parser expects the command to be named the name of YAML
|
||||
map. This can be overwritten with `command_name`. Commands should be `camelCase`
|
||||
- `command_alias` - string - allows commands to have multiple names. DO NOT USE. Some older commands
|
||||
have both `lowercase` and `camelCase` names.
|
||||
- `reply_type` - string - IDL struct that this command replies with. Reply struct must have
|
||||
`is_command_reply` set
|
||||
- `api_version` - string - Typically set to the empty string `""`. Only set to a non-empty string if
|
||||
command is part of the stable API. Generates a class name
|
||||
`<command_name>CommandNameCmdVersion1Gen` derived from `TypedCommand` that commands should be
|
||||
derived from.
|
||||
- `is_deprecated` - bool - indicates command is deprecated
|
||||
- `allow_global_collection_name` - bool - if true, command can accept both collect names and
|
||||
non-collection names. Used by the `aggregate` command
|
||||
- `access_check` - mapping - see [access check reference](#access-check-reference)
|
||||
|
||||
### Access Check Reference
|
||||
|
||||
A list of privileges the command checks. Only applicable for commands that are a part of
|
||||
API Version 1. Checked at runtime when test commands are enabled.
|
||||
|
||||
- `none` - bool - No privileges required
|
||||
- `simple` - mapping - single [check or privilege](#check-or-privilege)
|
||||
- `complex` - sequence - list of [check and/or privilege](#check-or-privilege)
|
||||
- `none` - bool - No privileges required
|
||||
- `simple` - mapping - single [check or privilege](#check-or-privilege)
|
||||
- `complex` - sequence - list of [check and/or privilege](#check-or-privilege)
|
||||
|
||||
#### Check or Privilege
|
||||
|
||||
- `check` - string - checks a part of the access control system like is_authenticated. See
|
||||
[`src/mongo/db/auth/access_checks.idl`](../db/auth/access_checks.idl) for a complete list.
|
||||
- `privilege` - mapping
|
||||
- `resource_pattern` - string - a resource pattern to check for a given set of privileges. See
|
||||
`MatchType` enum in [`src/mongo/db/auth/action_type.idl`](../db/auth/action_type.idl) for
|
||||
complete list.
|
||||
- `action_type` - sequence - list of action types the command may check. See `ActionType` enum in
|
||||
[`src/mongo/db/auth/action_type.idl`](../db/auth/action_type.idl) for complete list.
|
||||
- `agg_stage` - string - aggregation only. Name of aggregation stage. Used to appease the idl
|
||||
compatibility checker.
|
||||
- `check` - string - checks a part of the access control system like is_authenticated. See
|
||||
[`src/mongo/db/auth/access_checks.idl`](../db/auth/access_checks.idl) for a complete list.
|
||||
- `privilege` - mapping
|
||||
- `resource_pattern` - string - a resource pattern to check for a given set of privileges. See
|
||||
`MatchType` enum in [`src/mongo/db/auth/action_type.idl`](../db/auth/action_type.idl) for
|
||||
complete list.
|
||||
- `action_type` - sequence - list of action types the command may check. See `ActionType` enum in
|
||||
[`src/mongo/db/auth/action_type.idl`](../db/auth/action_type.idl) for complete list.
|
||||
- `agg_stage` - string - aggregation only. Name of aggregation stage. Used to appease the idl
|
||||
compatibility checker.
|
||||
|
||||
## IDL Compiler Overview
|
||||
|
||||
|
|
@ -1003,20 +990,17 @@ members. If the struct is not a view, then objects of the type are guaranteed to
|
|||
members. This is determined by recursively checking the fields of a struct. This info is used
|
||||
during generation to determine whether or not a struct will need a `BSONObj` anchor.
|
||||
|
||||
|
||||
## Best Practices
|
||||
|
||||
IDL has been in use since 2017. In that time, here are a few best practices:
|
||||
|
||||
1. strict or non-strict parsers - Structs that are persisted to disk should set `strict: false`.
|
||||
It's better for upgrade/downgrade. Commands should set `strict: true` or omit it as `strict:
|
||||
true` is the default.
|
||||
1. For persistance: For upgrade/downgrade, if a persisted document with a strict parser has a
|
||||
field added in new version N+1 and then the user downgrades to old version N, the strict
|
||||
parser will throw an exception and reject the document. If this document was part of the
|
||||
storage catalog for instance, the server would fail to start.
|
||||
2. For commands: By using strict parsers, it gives the server the ability to add fields without
|
||||
the risk of clients accidentally sending fields with the same name that had been ignored.
|
||||
true` is the default. 1. For persistance: For upgrade/downgrade, if a persisted document with a strict parser has a
|
||||
field added in new version N+1 and then the user downgrades to old version N, the strict
|
||||
parser will throw an exception and reject the document. If this document was part of the
|
||||
storage catalog for instance, the server would fail to start. 2. For commands: By using strict parsers, it gives the server the ability to add fields without
|
||||
the risk of clients accidentally sending fields with the same name that had been ignored.
|
||||
2. Extending existing structs/commands - all new fields in a struct/command must be marked optional
|
||||
to support backwards compatibility. For new structs/commands, there should be some required
|
||||
fields. It does not matter if the struct is not persisted, non-optional fields break backwards
|
||||
|
|
@ -1026,8 +1010,8 @@ IDL has been in use since 2017. In that time, here are a few best practices:
|
|||
that `BSONObj` that it is returned from its getter. This means that once the `BSONObj` that was
|
||||
passed to `parse()` goes out of scope, the object will point to free memory. Use `object_owned`
|
||||
if this is not desired. `object_owned` incurs extra memory allocations though.
|
||||
1. An alternative is to use either the `parseSharingOwnership` or `parseOwned` methods. These
|
||||
methods will ensure the IDL generated class has an anchor to the `BSONObj`. See comments in
|
||||
the generated class. It is not advisable though to use these methods during normal command
|
||||
request processing. The network buffer that holds the inbound request is available during the
|
||||
lifetime of the request even though IDL does not anchor the network buffer.
|
||||
1. An alternative is to use either the `parseSharingOwnership` or `parseOwned` methods. These
|
||||
methods will ensure the IDL generated class has an anchor to the `BSONObj`. See comments in
|
||||
the generated class. It is not advisable though to use these methods during normal command
|
||||
request processing. The network buffer that holds the inbound request is available during the
|
||||
lifetime of the request even though IDL does not anchor the network buffer.
|
||||
|
|
|
|||
|
|
@ -1,5 +1,4 @@
|
|||
MongoDB TLA+/PlusCal Specifications
|
||||
===============================
|
||||
# MongoDB TLA+/PlusCal Specifications
|
||||
|
||||
This folder contains formal specifications of various components in order to prove their
|
||||
correctness. Some are experiments, some reflect MongoDB's actual implementation. See the
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ The basics of implementing a check are in the [clang docs](https://releases.llvm
|
|||
|
||||
#### Basic usage of the custom checks
|
||||
|
||||
The current directory contains the individual check source files, the main `MongoTidyModule.cpp` source file which registers the checks, and the SConscript responsible for building the check library module. The module will be installed into the DESTDIR, by default `build/install/lib/libmongo_tidy_checks.so`.
|
||||
The current directory contains the individual check source files, the main `MongoTidyModule.cpp` source file which registers the checks, and the SConscript responsible for building the check library module. The module will be installed into the DESTDIR, by default `build/install/lib/libmongo_tidy_checks.so`.
|
||||
|
||||
Our internal `buildscripts/clang_tidy.py` will automatically check this location and attempt to load the module if it exists. If it is installed to a non-default location you will need to supply the `--check-module` argument with the location to the module.
|
||||
|
||||
|
|
@ -28,7 +28,7 @@ Below is a checklist of all the steps to make sure to perform when writing a new
|
|||
|
||||
1. Implement the check in the respectively named `.h` and `.cpp` files.
|
||||
2. Add the check's `#include` to the `MongoTidyModule.cpp`.
|
||||
3. Register the check class with a check name in the `MongoTidyModule.cpp`.
|
||||
3. Register the check class with a check name in the `MongoTidyModule.cpp`.
|
||||
4. Add the `.cpp` file to the source list in the `SConscript` file.
|
||||
5. Write a unittest file named `tests/test_{CHECK_NAME}.cpp` which minimally reproduces the issue.
|
||||
6. Add the test file to the list of test sources in `tests/SConscript`.
|
||||
|
|
@ -38,6 +38,3 @@ Below is a checklist of all the steps to make sure to perform when writing a new
|
|||
#### Questions and Troubleshooting
|
||||
|
||||
If you have any questions please reach out to the `#server-build-help` slack channel.
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -13,6 +13,7 @@ during the simulation by registering _observer_ requests that are maintained in
|
|||
queue.
|
||||
|
||||
More specifically:
|
||||
|
||||
1. The event queue waits until it has received an _event_ request from each reported actor thread.
|
||||
2. When it is sufficiently full, the queue determines the minimum time of a queued _event_
|
||||
request, and advances the mock clock to that time.
|
||||
|
|
@ -38,6 +39,7 @@ name of the target _must_ end in `_simulator`.
|
|||
|
||||
A `*_simulator` executable simply runs all selected workloads and outputs logs to stdout.
|
||||
These logs can be piped to `process_logs.py` to generate graphs. Ex.
|
||||
|
||||
```
|
||||
$ throughput_probing_simulator | process_logs.py -o ~/sim_output
|
||||
```
|
||||
|
|
|
|||
|
|
@ -5,13 +5,14 @@ throughput probing algorithm used for execution control under a simulated worklo
|
|||
output consists of graphs showing the actual and optimal ticket allocations over time.
|
||||
|
||||
The rough workflow for an individual simulation is:
|
||||
1. Specify an optimal concurrency level, and the throughput at that concurrency level.
|
||||
2. Provide a well-behaved model for what throughput we will observe at different concurrency
|
||||
levels, and what operation latencies would produce that throughput.
|
||||
3. Use that model to specify a workload driver which simulates operations with the given
|
||||
latencies based on the current concurrency level selected by the throughput probing
|
||||
algorithm.
|
||||
4. Run the mock workload for specified duration.
|
||||
|
||||
1. Specify an optimal concurrency level, and the throughput at that concurrency level.
|
||||
2. Provide a well-behaved model for what throughput we will observe at different concurrency
|
||||
levels, and what operation latencies would produce that throughput.
|
||||
3. Use that model to specify a workload driver which simulates operations with the given
|
||||
latencies based on the current concurrency level selected by the throughput probing
|
||||
algorithm.
|
||||
4. Run the mock workload for specified duration.
|
||||
|
||||
## File Structure
|
||||
|
||||
|
|
@ -19,4 +20,4 @@ New workloads can be added in [`workloads.cpp`](workloads.cpp)
|
|||
|
||||
The existing `TicketedWorkloadDriver` is quite flexible, and should suffice for most workloads by
|
||||
simply tuning the workload characteristics, runtime, etc. New workload characterstic types can be
|
||||
defined in [`workload_characteristics.h`](../workload_characteristics.h).
|
||||
defined in [`workload_characteristics.h`](../workload_characteristics.h).
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
# Transport Internals
|
||||
|
||||
## Ingress Networking
|
||||
|
||||
Ingress networking refers to a server accepting incoming connections
|
||||
|
|
@ -7,6 +8,7 @@ server, issues commands, and receives responses. A server can be
|
|||
configured to accept connections on several network endpoints.
|
||||
|
||||
### Session and ServiceEntryPoint
|
||||
|
||||
Once a client connection is accepted, a `Session` object is created to manage
|
||||
it. This in turn is given to the `ServiceEntryPoint` singleton for the server.
|
||||
The `ServiceEntryPoint` creates a `SessionWorkflow` for each `Session`, and
|
||||
|
|
@ -14,6 +16,7 @@ maintains a collection of these. The `Session` represents the client side of
|
|||
that conversation, and the `ServiceEntryPoint` represents the server side.
|
||||
|
||||
### SessionWorkflow
|
||||
|
||||
While `Session` manages only the transfer of `Messages`, the `SessionWorkflow`
|
||||
organizes these into a higher layer: the MongoDB protocol. It organizes `Session`
|
||||
messages into simple request and response sequences represented internally as
|
||||
|
|
@ -42,28 +45,30 @@ produces no responses. This is known as a "fire and forget" command. This
|
|||
behavior is also managed by the `SessionWorkflow`.
|
||||
|
||||
### Builders
|
||||
In order to return the results to the user whether it be a document or a response
|
||||
code, MongoDB uses the [ReplyBuilderInterface]. This interface helps to build
|
||||
|
||||
In order to return the results to the user whether it be a document or a response
|
||||
code, MongoDB uses the [ReplyBuilderInterface]. This interface helps to build
|
||||
message bodies for replying to commands or even returning documents.
|
||||
|
||||
This builder interface includes a standard BodyBuilder that builds reply
|
||||
This builder interface includes a standard BodyBuilder that builds reply
|
||||
messages in serialized-BSON format.
|
||||
|
||||
A Document body builder ([DocSequenceBuilder]) is also defined to help build a
|
||||
A Document body builder ([DocSequenceBuilder]) is also defined to help build a
|
||||
reply that can be used to build a response centered around a document.
|
||||
|
||||
The various builders supplied in the `ReplyBuilderInterface` can be appended
|
||||
together to generate responses containing document bodies, error codes, and
|
||||
The various builders supplied in the `ReplyBuilderInterface` can be appended
|
||||
together to generate responses containing document bodies, error codes, and
|
||||
other appropriate response types.
|
||||
|
||||
This interface acts as a cursor to build a response message to be sent out back
|
||||
This interface acts as a cursor to build a response message to be sent out back
|
||||
to the client.
|
||||
|
||||
## See Also
|
||||
- For details on egress networking, see [Egress Networking][egress_networking].
|
||||
- For details on command dispatch, see [Command Dispatch][command_dispatch].
|
||||
- For details on *NetworkingBaton* and *AsioNetworkingBaton*, see [Baton][baton].
|
||||
- For more detail about `SessionWorkflow`, see WRITING-10398 (internal).
|
||||
|
||||
- For details on egress networking, see [Egress Networking][egress_networking].
|
||||
- For details on command dispatch, see [Command Dispatch][command_dispatch].
|
||||
- For details on _NetworkingBaton_ and _AsioNetworkingBaton_, see [Baton][baton].
|
||||
- For more detail about `SessionWorkflow`, see WRITING-10398 (internal).
|
||||
|
||||
[ServiceExecutor]: service_executor.h
|
||||
[SessionWorkflow]: session_workflow.h
|
||||
|
|
|
|||
|
|
@ -5,40 +5,45 @@ server. This is a work in progress and more sections will be added gradually.
|
|||
|
||||
## Fail Points
|
||||
|
||||
For details on the server-internal *FailPoint* pattern, see [this document][fail_points].
|
||||
For details on the server-internal _FailPoint_ pattern, see [this document][fail_points].
|
||||
|
||||
[fail_points]: ../../../docs/fail_points.md
|
||||
|
||||
## Cancellation Sources and Tokens
|
||||
|
||||
### Intro
|
||||
|
||||
When writing asynchronous code, we often schedule code or operations to run at some point in the future, in a different execution context. Sometimes, we want to cancel that scheduled work - to stop it from ever running if it hasn't yet run, and possibly to interrupt its execution if it is safe to do so. For example, in the MongoDB server, we might want to:
|
||||
- Cancel work scheduled on executors
|
||||
- Cancel asynchronous work chained as continuations on futures
|
||||
- Write and use services that asynchronously perform cancelable work for consumers in the background
|
||||
|
||||
|
||||
- Cancel work scheduled on executors
|
||||
- Cancel asynchronous work chained as continuations on futures
|
||||
- Write and use services that asynchronously perform cancelable work for consumers in the background
|
||||
|
||||
In the MongoDB server, we have two types that together make it easy to manage the cancellation of this sort of asynchronous work: CancellationSources and CancellationTokens.
|
||||
|
||||
### The CancellationSource and CancellationToken Types
|
||||
|
||||
A `CancellationSource` manages the cancellation state for some unit of asynchronous work. This unit of asynchronous work can consist of any number of cancelable operations that should all be canceled together, e.g. functions scheduled to run on an executor, continuations on futures, or operations run by a service implementing cancellation. A `CancellationSource` is constructed in an uncanceled state, and cancellation can be requested by calling the member `CancellationSource::cancel()`.
|
||||
|
||||
|
||||
A `CancellationSource` can be used to produce associated CancellationTokens with the member function `CancellationSource::token()`. These CancellationTokens can be passed to asynchronous operations to make them cancelable. By accepting a `CancellationToken` as a parameter, an asynchronous operation signals that it will attempt to cancel the operation when the `CancellationSource` associated with that `CancellationToken` has been canceled.
|
||||
|
||||
When passed a `CancellationToken`, asynchronous operations are able to handle the cancellation of the `CancellationSource` associated with that `CancellationToken` in two ways:
|
||||
|
||||
- The `CancellationToken::isCanceled()` member function can be used to check at any point in time if the `CancellationSource` the `CancellationToken` was obtained from has been canceled. The code implementing the asynchronous operation can therefore check the value of this member function at appropriate points and, if the `CancellationSource` has been canceled, refuse to run the work or stop running work if it is ongoing.
|
||||
- The `CancellationToken::isCanceled()` member function can be used to check at any point in time if the `CancellationSource` the `CancellationToken` was obtained from has been canceled. The code implementing the asynchronous operation can therefore check the value of this member function at appropriate points and, if the `CancellationSource` has been canceled, refuse to run the work or stop running work if it is ongoing.
|
||||
|
||||
- The `CancellationToken:onCancel()` member function returns a `SemiFuture` that will be resolved successfully when the underlying `CancellationSource` has been canceled or resolved with an error if it is destructed before being canceled. Continuations can therefore be chained on this future that will run when the associated `CancellationSource` has been canceled. Importantly, because `CancellationToken:onCancel()` returns a `SemiFuture`, implementors of asynchronous operations must provide an execution context in which they want their chained continuation to run. Normally, this continuation should be scheduled to run on an executor, by passing one to `SemiFuture::thenRunOn()`.
|
||||
- The `CancellationToken:onCancel()` member function returns a `SemiFuture` that will be resolved successfully when the underlying `CancellationSource` has been canceled or resolved with an error if it is destructed before being canceled. Continuations can therefore be chained on this future that will run when the associated `CancellationSource` has been canceled. Importantly, because `CancellationToken:onCancel()` returns a `SemiFuture`, implementors of asynchronous operations must provide an execution context in which they want their chained continuation to run. Normally, this continuation should be scheduled to run on an executor, by passing one to `SemiFuture::thenRunOn()`.
|
||||
|
||||
- Alternatively, the continuation can be forced to run inline by transforming the `SemiFuture` into an inline future, by using `SemiFuture::unsafeToInlineFuture()`. This should be used very cautiously. When a continuation is chained to the `CancellationToken:onCancel()` future via `SemiFuture::unsafeToInlineFuture()`, the thread that calls `CancellationSource::cancel()` will be forced to run the continuation inline when it makes that call. Note that this means if a service chains many continuations in this way on `CancellationToken`s obtained from the same `CancellationSource`, then whatever thread calls`CancellationSource::cancel()` on that source will be forced to run all of those continuations potentially blocking that thread from making further progress for a non-trivial amount of time. Do not use `SemiFuture::unsafeToInlineFuture()` in this way unless you are sure you can block the thread that cancels the underlying `CancellationSource` until cancellation is complete. Additionally, remember that because the `SemiFuture` returned by `CancellationToken::onCancel()` is resolved as soon as that `CancellationToken` is canceled, if you attempt to chain a continuation on that future when the `CancellationToken` has _already_ been canceled, that continuation will be ready to run right away. Ordinarily, this just means the continuation will immediately be scheduled on the provided executor, but if `SemiFuture::unsafeToInlineFuture` is used to force the continuation to run inline, it will run inline immediately, potentially leading to deadlocks if you're not careful.
|
||||
- Alternatively, the continuation can be forced to run inline by transforming the `SemiFuture` into an inline future, by using `SemiFuture::unsafeToInlineFuture()`. This should be used very cautiously. When a continuation is chained to the `CancellationToken:onCancel()` future via `SemiFuture::unsafeToInlineFuture()`, the thread that calls `CancellationSource::cancel()` will be forced to run the continuation inline when it makes that call. Note that this means if a service chains many continuations in this way on `CancellationToken`s obtained from the same `CancellationSource`, then whatever thread calls`CancellationSource::cancel()` on that source will be forced to run all of those continuations potentially blocking that thread from making further progress for a non-trivial amount of time. Do not use `SemiFuture::unsafeToInlineFuture()` in this way unless you are sure you can block the thread that cancels the underlying `CancellationSource` until cancellation is complete. Additionally, remember that because the `SemiFuture` returned by `CancellationToken::onCancel()` is resolved as soon as that `CancellationToken` is canceled, if you attempt to chain a continuation on that future when the `CancellationToken` has _already_ been canceled, that continuation will be ready to run right away. Ordinarily, this just means the continuation will immediately be scheduled on the provided executor, but if `SemiFuture::unsafeToInlineFuture` is used to force the continuation to run inline, it will run inline immediately, potentially leading to deadlocks if you're not careful.
|
||||
|
||||
### Example of a Service Performing Cancelable, Asynchronous Work
|
||||
|
||||
We'll use the WaitForMajorityService as an example of how work can be scheduled and cancelled using CancellationSources and CancellationTokens. First, we'll see how a consumer might use a service implementing cancellation. Then, we'll see how a service might implement cancellation internally.
|
||||
|
||||
#### Using a Cancelable Service
|
||||
#### Using a Cancelable Service
|
||||
|
||||
The WaitForMajorityService allows consumers to asynchronously wait until an `opTime` is majority committed. Consumers can request a wait on a specific `opTime` by calling the function `WaitForMajorityService::waitUntilMajority(OpTime opTime, CancellationToken token)`. This call will return a future that will be resolved when that `opTime` has been majority committed or otherwise set with an error.
|
||||
|
||||
In some cases, though, a consumer might realize that it no longer wants to wait on an `opTime`. For example, the consumer might be going through shut-down, and it needs to clean up all of its resources cleanly right away. Or, it might just realize that the `opTime` is no longer relevant, and it would like to tell the `WaitForMajorityService` that it no longer needs to wait on it, so the `WaitForMajorityService` can conserve its own resources.
|
||||
In some cases, though, a consumer might realize that it no longer wants to wait on an `opTime`. For example, the consumer might be going through shut-down, and it needs to clean up all of its resources cleanly right away. Or, it might just realize that the `opTime` is no longer relevant, and it would like to tell the `WaitForMajorityService` that it no longer needs to wait on it, so the `WaitForMajorityService` can conserve its own resources.
|
||||
|
||||
Consumers can easily cancel existing requests to wait on `opTime`s in situations like these by making use of the `CancellationToken` argument accepted by `WaitForMajorityService::waitUntilMajority`. For any `opTime` waits that should be cancelled together, they simply pass `CancellationTokens` from the same `CancellationSource` into the requests to wait on those `opTimes`:
|
||||
|
||||
|
|
@ -53,14 +58,16 @@ And whenever they want to cancel the waits on those `opTime`s, they can simply c
|
|||
```c++
|
||||
source.cancel()
|
||||
```
|
||||
|
||||
After this call, the `WaitForMajorityService` will stop waiting on `opTime` and `opTime2`. And the futures returned by all calls to `waitUntilMajority` that were passed `CancellationToken`s from the `CancellationSource source` (in this case, `opTimeFuture` and `opTimeFuture2`) will be resolved with `ErrorCodes::CallbackCanceled`.
|
||||
|
||||
#### Implementing a Cancelable Service
|
||||
|
||||
Now we'll see how `WaitForMajorityService` might be implemented, at a high level, to support the cancelable API we saw in the last section. The `WaitForMajorityService` will need to ensure that calls to `WaitForMajorityService::waitUntilMajority(opTime, token)` schedule work to wait until `opTime` is majority committed, and that this scheduled work will stop if `token` has been canceled. To do so, it can use either of the functions `CancellationToken` has that expose the underlying cancellation state: it can either call `CancellationToken::isCanceled()` at some appropriate time on `token` and conditionally stop waiting on `opTime`, or it can chain a continuation onto the future returned by `CancellationToken:onCancel()` that will stop the wait. This continuation will run when the `token` is canceled, as cancellation resolves the future returned by `CancellationToken::onCancel()`.
|
||||
|
||||
To keep the example general, we're going to elide some details: for now, assume that calling `stopWaiting(opTime)` performs all the work needed for the `WaitForMajorityService` to stop waiting on `opTime`. Additionally, assume that `opTimePromise` is the promise that resolves the future returned by the call to `waitUntilMajority(opTime, token)`. Then, to implement cancellation for some request to wait on `opTime` with an associated token `token`, the `WaitForMajorityService` can add something like the following code to the function it uses to accept requests:
|
||||
|
||||
``` c++
|
||||
```c++
|
||||
SemiFuture<void> WaitForMajorityService::waitUntilMajority(OpTime opTime, CancellationToken token) {
|
||||
// ... Create request that will be processed by a background thread
|
||||
|
||||
|
|
@ -72,19 +79,22 @@ SemiFuture<void> WaitForMajorityService::waitUntilMajority(OpTime opTime, Cancel
|
|||
});
|
||||
}
|
||||
```
|
||||
|
||||
Whenever `token` is canceled, the continuation above will run, which will stop the `WaitForMajorityService` from waiting on `opTime`, and resolve the future originally returned from the call to `waitUntilMajority(opTime, token)` with an error. There's just one more detail -- we don't want the cancellation and ordinary completion of the work to race. If `token` is canceled _after_ we've finished waiting for opTime to be majority committed, there's no work to cancel, and we can't set opTimePromise twice! To fix this, we can simply protect opTimePromise with an atomic, ensuring that it will be set exactly once. Then, before we perform either cancellation work or fulfilling the promise by ordinary means, we can use the atomic to check that the promise has not already been completed. This is the gist of what it takes to write a service performing cancelable work! To see the full details of making a cancelable `WaitForMajorityService`, see this [commit](https://github.com/mongodb/mongo/commit/4fa2fcb16107c860448b58cd66798bae140e7263).
|
||||
|
||||
### Cancellation Hierarchies
|
||||
|
||||
In the above example, we saw how we can use a single `CancellationSource` and associated tokens to cancel work. This works well when we can associate a specific group of asynchronous tasks with a single `CancellationSource`. Sometimes, we may want sub-tasks of a larger operation to be individually cancelable, but also to be able to cancel _all_ tasks associated with the larger operation at once. CancellationSources can be managed hierarchically to make this sort of situation manageable. A hierarchy between CancellationSources is created by passing a `CancellationToken` associated with one `CancellationSource` to the constructor of a newly-created `CancellationSource` as follows:
|
||||
|
||||
```c++
|
||||
CancellationSource parentSource;
|
||||
CancellationSource childSource(parentSource.token());
|
||||
```
|
||||
|
||||
As the naming suggests, we say that the `parentSource` and `childSource` `CancellationSources` are in a hierarchy, with `parentSource` higher up in the hierarchy. When a `CancellationSource` higher up in a cancellation hierarchy is canceled, all descendant `CancellationSources` are automatically canceled as well. Conversely, the `CancellationSources` lower down in a cancellation hierarchy can be canceled without affecting any other `CancellationSources` higher up or at the same level of the hierarchy.
|
||||
|
||||
As an example of this sort of hierarchy of cancelable operations, let's consider the case of [hedged reads](https://docs.mongodb.com/manual/core/read-preference-hedge-option/). Note that hedged reads do not currently use `CancellationTokens` in their implementation; this is for example purposes only. When a query is specified with the "hedged read" option on a sharded cluster, mongos will route the read to two replica set members for each shard targeted by the query return the first response it receives per-shard. Therefore, as soon as it receives a response from one replica set member on a shard, it can cancel the other request it made to a different member on the same shard. We can use the pattern discussed above to do this sort of cancellation. At a high level, the code might look something like this:
|
||||
|
||||
|
||||
```c++
|
||||
// Assuming we already have a CancellationSource called hedgedReadSource for the entire
|
||||
// hedged-read operation, we create child CancellationSources used to manage the cancellation
|
||||
|
|
@ -98,10 +108,10 @@ auto readTwoFuture = routeRequestToHost(query, host2, hostTwoSource.token());
|
|||
|
||||
// whenAny(F1, F2, ..., FN) takes a list of N future types and returns a future
|
||||
// that is resolved as soon as any of the input futures are ready. The value of
|
||||
// the returned future is a struct containing the result of the future that resolved
|
||||
// the returned future is a struct containing the result of the future that resolved
|
||||
// as well as its index in the input list.
|
||||
auto firstResponse = whenAny(std::move(readOneFuture), std::move(readTwoFuture)).then(
|
||||
[hostOneSource = std::move(hostOneSource), hostTwoSource = std::move(hostTwoSource)]
|
||||
[hostOneSource = std::move(hostOneSource), hostTwoSource = std::move(hostTwoSource)]
|
||||
(auto readResultAndIndex)
|
||||
{
|
||||
if (readResultAndIndex.result.isOK()) {
|
||||
|
|
@ -116,6 +126,7 @@ auto firstResponse = whenAny(std::move(readOneFuture), std::move(readTwoFuture))
|
|||
}
|
||||
});
|
||||
```
|
||||
|
||||
We can see the utility of the hierarchy of `CancellationSources` by examining the case where the client indicates that it would like to kill the entire hedged read operation. Rather than having to track every `CancellationSource` used to manage different requests performed throughout the operation, we can call
|
||||
|
||||
```c++
|
||||
|
|
@ -127,40 +138,49 @@ and all of the operations taking place as a part of the hedged read will be canc
|
|||
There's also a performance benefit to cancellation hierarchies: since the requests to each host is only a part of the work performed by the larger hedged-read operation, at least one request will complete well before the entire operation does. Since all of the cancellation callback state for work done by, say, the request to `host1`, is owned by `hostOneSource`, rather than the parent `hedgedReadSource`, it can independently be cleaned up and the relevant memory freed before the entire hedged read operation is complete. For more details, see the comment for the constructor `CancellationSource(const CancellationToken& token)` in [cancellation.h](https://github.com/mongodb/mongo/blob/99d28dd184ada37720d0dae1f3d8c35fec85bd4b/src/mongo/util/cancellation.h#L216-L229).
|
||||
|
||||
### Integration With Future Types
|
||||
|
||||
`CancellationSources` and `CancellationTokens` integrate neatly with the variety of `Future` types to make it easy to cancel work chained onto `Future` continuations.
|
||||
|
||||
#### ExecutorFutures
|
||||
|
||||
Integration with `ExecutorFutures` is provided primarily by the `CancelableExecutor` type. If you have some work that you'd like to run on an Executor `exec`, but want to cancel that work if a `CancellationToken` `token` is canceled, you can simply use `CancelableExecutor::make(exec, token)` to get an executor that will run work on `exec` only if `token` has not been canceled when that work is ready to run. As an example, take the following code snippet:
|
||||
|
||||
```c++
|
||||
ExecutorFuture(exec).then([] { doThing1(); })
|
||||
.thenRunOn(CancelableExecutor::make(exec, token))
|
||||
.then([] { doThing2(); })
|
||||
.then([] { doThing2(); })
|
||||
.thenRunOn(exec)
|
||||
.then([] { doThing3();})
|
||||
.onError([](Status) { doThing4(); })
|
||||
```
|
||||
In this example, `doThing1()` will run on the executor `exec`; when it has completed, `doThing2()` will run on `exec` only if `token` has not yet been canceled. If `token` wasn't canceled, `doThing3()` will run after `doThing2()`; if `token` was canceled, then the error`CallbackCanceled` will be propagated down the continuation chain until a continuation and executor that accept the error are found (in this case, `doThing4()` will run on `exec`).
|
||||
|
||||
In this example, `doThing1()` will run on the executor `exec`; when it has completed, `doThing2()` will run on `exec` only if `token` has not yet been canceled. If `token` wasn't canceled, `doThing3()` will run after `doThing2()`; if `token` was canceled, then the error`CallbackCanceled` will be propagated down the continuation chain until a continuation and executor that accept the error are found (in this case, `doThing4()` will run on `exec`).
|
||||
|
||||
#### Future, SemiFuture, and SharedSemiFuture
|
||||
|
||||
The primary interface for waiting on futures in a cancelable manner is the free function template:
|
||||
|
||||
```c++
|
||||
template <typename FutureT, typename Value = typename FutureT::value_type>
|
||||
SemiFuture<Value> future_util::withCancellation(FutureT&& f, const CancellationToken& token);
|
||||
```
|
||||
|
||||
Note that this function also works with executor futures. This function returns a SemiFuture\<T\> that is resolved when either the input future `f` is resolved or the input `CancellationToken token` is canceled - whichever comes first. The returned `SemiFuture` is set with the result of the input future when it resolves first, and with an `ErrorCodes::CallbackCanceled` status if cancellation occurs first.
|
||||
|
||||
For example, if we have a `Future<Request> requestFuture` and `CancellationToken token`, and we want to do some work when _either_ `requestFuture` is resolved _or_ `token` is canceled, we can simply do the following:
|
||||
|
||||
```c++
|
||||
future_util::withCancellation(requestFuture, token)
|
||||
.then([](Request r) { /* requestFuture was fulfilled; handle it */ })
|
||||
.onError<ErrorCodes::CallbackCanceled>([](Status s) { /* handle cancellation */ })
|
||||
.onError([](Status s) {/* handle other errors */})
|
||||
```
|
||||
```
|
||||
|
||||
### Links to Relevant Code + Example Tests
|
||||
- [CancellationSource/CancellationToken implementations](https://github.com/mongodb/mongo/blob/master/src/mongo/util/cancellation.h)
|
||||
- [CancellationSource/CancellationToken unit tests](https://github.com/mongodb/mongo/blob/master/src/mongo/util/cancellation_test.cpp)
|
||||
- [CancelableExecutor implementation](https://github.com/mongodb/mongo/blob/master/src/mongo/executor/cancelable_executor.h)
|
||||
- [CancelableExecutor unit tests](https://github.com/mongodb/mongo/blob/master/src/mongo/executor/cancelable_executor_test.cpp)
|
||||
- [future_util::withCancellation implementation](https://github.com/mongodb/mongo/blob/99d28dd184ada37720d0dae1f3d8c35fec85bd4b/src/mongo/util/future_util.h#L658)
|
||||
- [future_util::withCancellation unit tests](https://github.com/mongodb/mongo/blob/99d28dd184ada37720d0dae1f3d8c35fec85bd4b/src/mongo/util/future_util_test.cpp#L1268-L1343)
|
||||
|
||||
- [CancellationSource/CancellationToken implementations](https://github.com/mongodb/mongo/blob/master/src/mongo/util/cancellation.h)
|
||||
- [CancellationSource/CancellationToken unit tests](https://github.com/mongodb/mongo/blob/master/src/mongo/util/cancellation_test.cpp)
|
||||
- [CancelableExecutor implementation](https://github.com/mongodb/mongo/blob/master/src/mongo/executor/cancelable_executor.h)
|
||||
- [CancelableExecutor unit tests](https://github.com/mongodb/mongo/blob/master/src/mongo/executor/cancelable_executor_test.cpp)
|
||||
- [future_util::withCancellation implementation](https://github.com/mongodb/mongo/blob/99d28dd184ada37720d0dae1f3d8c35fec85bd4b/src/mongo/util/future_util.h#L658)
|
||||
- [future_util::withCancellation unit tests](https://github.com/mongodb/mongo/blob/99d28dd184ada37720d0dae1f3d8c35fec85bd4b/src/mongo/util/future_util_test.cpp#L1268-L1343)
|
||||
|
|
|
|||
|
|
@ -18,11 +18,11 @@ an immutable container. Otherwise, a standard container may make more sense.
|
|||
The currently supported containers are all based on classes from the
|
||||
[`immer`](https://sinusoid.es/immer/) library.
|
||||
|
||||
- [`immutable::map`](map.h): ordered map interface backed by `immer::flex_vector`
|
||||
- [`immutable::set`](set.h): ordered set interface backed by `immer::flex_vector`
|
||||
- [`immutable::unordered_map`](unordered_map.h): typedef for `immer:map`
|
||||
- [`immutable::unordered_set`](unordered_set.h): typedef for `immer:set`
|
||||
- [`immutable::vector`](vector.h): typedef for `immer::vector`
|
||||
- [`immutable::map`](map.h): ordered map interface backed by `immer::flex_vector`
|
||||
- [`immutable::set`](set.h): ordered set interface backed by `immer::flex_vector`
|
||||
- [`immutable::unordered_map`](unordered_map.h): typedef for `immer:map`
|
||||
- [`immutable::unordered_set`](unordered_set.h): typedef for `immer:set`
|
||||
- [`immutable::vector`](vector.h): typedef for `immer::vector`
|
||||
|
||||
Both ordered and unordered map and set variants support heterogeneous lookup.
|
||||
|
||||
|
|
|
|||
|
|
@ -2,40 +2,40 @@
|
|||
|
||||
## Table Of Contents
|
||||
|
||||
- [High Level Overview](#high-level-overview)
|
||||
- [ASIO](#asio)
|
||||
- [FIPS Mode](#fips-mode)
|
||||
- [Authentication & Authorization](#authentication--authorization)
|
||||
- [The Transport Layer](#the-transport-layer)
|
||||
- [SNI](#sni)
|
||||
- [Protocol Theory](#protocol-theory)
|
||||
- [The TLS Handshake](#the-tls-handshake)
|
||||
- [Ciphers](#ciphers)
|
||||
- [X.509](#x509)
|
||||
- [Certificate Authorities](#certificate-authorities)
|
||||
- [Certificate Metadata](#certificate-metadata)
|
||||
- [Certificate Expiration and Revocation](#certificate-expiration-and-revocation)
|
||||
- [Member Certificates](#member-certificates)
|
||||
- [High Level Overview](#high-level-overview)
|
||||
- [ASIO](#asio)
|
||||
- [FIPS Mode](#fips-mode)
|
||||
- [Authentication & Authorization](#authentication--authorization)
|
||||
- [The Transport Layer](#the-transport-layer)
|
||||
- [SNI](#sni)
|
||||
- [Protocol Theory](#protocol-theory)
|
||||
- [The TLS Handshake](#the-tls-handshake)
|
||||
- [Ciphers](#ciphers)
|
||||
- [X.509](#x509)
|
||||
- [Certificate Authorities](#certificate-authorities)
|
||||
- [Certificate Metadata](#certificate-metadata)
|
||||
- [Certificate Expiration and Revocation](#certificate-expiration-and-revocation)
|
||||
- [Member Certificates](#member-certificates)
|
||||
|
||||
## High Level Overview
|
||||
|
||||
TLS stands for **Transport Layer Security**. It used to be known as **SSL (Secure Sockets Layer)**, but was renamed to
|
||||
TLS in 1999. TLS is a specification for the secure transport of data over a network connection using both
|
||||
[**symmetric**](https://en.wikipedia.org/wiki/Symmetric-key_algorithm) and
|
||||
TLS in 1999. TLS is a specification for the secure transport of data over a network connection using both
|
||||
[**symmetric**](https://en.wikipedia.org/wiki/Symmetric-key_algorithm) and
|
||||
[**asymmetric**](https://en.wikipedia.org/wiki/Public-key_cryptography) key cryptography, most commonly with certificates.
|
||||
There is no "official" implementation of TLS, not even from the [RFC](https://tools.ietf.org/html/rfc5246) authors.
|
||||
There are, however, several different implementations commonly used in different environments. MongoDB uses three different
|
||||
There is no "official" implementation of TLS, not even from the [RFC](https://tools.ietf.org/html/rfc5246) authors.
|
||||
There are, however, several different implementations commonly used in different environments. MongoDB uses three different
|
||||
implementations of TLS:
|
||||
|
||||
1. [OpenSSL](https://www.openssl.org/docs/) on _Linux_
|
||||
* OpenSSL is also available on MacOS and Windows, but we do not officially support those configurations anymore.
|
||||
- OpenSSL is also available on MacOS and Windows, but we do not officially support those configurations anymore.
|
||||
2. [SChannel](https://docs.microsoft.com/en-us/windows-server/security/tls/tls-ssl-schannel-ssp-overview) which is made
|
||||
by Microsoft and is avialable exclusively on _Windows_.
|
||||
by Microsoft and is avialable exclusively on _Windows_.
|
||||
3. [Secure Transport](https://developer.apple.com/documentation/security/secure_transport) which is made by Apple and is
|
||||
available exclusively on _MacOS_.
|
||||
available exclusively on _MacOS_.
|
||||
|
||||
We manage TLS through an interface called
|
||||
[`SSLManagerInterface`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L181). There are
|
||||
We manage TLS through an interface called
|
||||
[`SSLManagerInterface`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L181). There are
|
||||
three different implementations of this interface for each implementation of TLS respectively:
|
||||
|
||||
1. [`SSLManagerOpenSSL`](ssl_manager_openssl.cpp)
|
||||
|
|
@ -44,15 +44,15 @@ three different implementations of this interface for each implementation of TLS
|
|||
|
||||
Every SSLManager has a set of key methods that describe the general idea of establishing a connection over TLS:
|
||||
|
||||
* [`connect`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L193): initiates a TLS connection
|
||||
* [`accept`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L201): waits for a peer to
|
||||
initiate a TLS connection
|
||||
* [`parseAndValidatePeerCertificate`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L268):
|
||||
parses a certificate acquired from a peer during connection negotiation and validates it.
|
||||
- [`connect`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L193): initiates a TLS connection
|
||||
- [`accept`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L201): waits for a peer to
|
||||
initiate a TLS connection
|
||||
- [`parseAndValidatePeerCertificate`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L268):
|
||||
parses a certificate acquired from a peer during connection negotiation and validates it.
|
||||
|
||||
The SSLManagers are wrappers around these TLS implementations such that they conform to our practices and standards, and
|
||||
so that all interaction with them can be done through a standard interface. Note that `connect` and `accept` are
|
||||
are only for the _synchronous_ code paths used by the transport layer. For the _asynchronous_ paths used by the
|
||||
are only for the _synchronous_ code paths used by the transport layer. For the _asynchronous_ paths used by the
|
||||
transport layer, we use **ASIO**.
|
||||
|
||||
Also note that we used to do synchronous networking through the [`mongo::Socket`](sock.h) codepath; however, this has
|
||||
|
|
@ -60,65 +60,65 @@ been deprecated and will eventually be removed.
|
|||
|
||||
### ASIO
|
||||
|
||||
We use a third-party asynchronous networking library called [**ASIO**](https://think-async.com/Asio/) to accept and
|
||||
establish connections, as well as read and write wire protocol messages. ASIO's
|
||||
[`GenericSocket`](https://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio/reference/generic__stream_protocol/socket.html) object is used to
|
||||
We use a third-party asynchronous networking library called [**ASIO**](https://think-async.com/Asio/) to accept and
|
||||
establish connections, as well as read and write wire protocol messages. ASIO's
|
||||
[`GenericSocket`](https://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio/reference/generic__stream_protocol/socket.html) object is used to
|
||||
read and write from sockets. We do this to handle an interesting quirk about MongoDB's use of TLS. Typically, services
|
||||
will listen for clear text connections on one port, and TLS connections on another port. MongoDB, however, only listens
|
||||
on one port. Since we give administrators the ability to upgrade clusters to TLS, we need to use the same port to avoid
|
||||
having to edit the replset config. Because the transport layer has to be ready to receive TLS or cleartext traffic, it
|
||||
has to be able to determine what it is speaking on the fly by inspecting the first message it receives from an incoming
|
||||
connection to determine if it needs to speak cleartext or TLS. If the transport layer believes it is speaking TLS,
|
||||
connection to determine if it needs to speak cleartext or TLS. If the transport layer believes it is speaking TLS,
|
||||
[it will create a new `asio::ssl::stream<GenericSocket>` object which wraps the underlying physical connection's `GenericSocket` object](../../transport/asio/asio_session.h).
|
||||
The outer `Socket` accepts requests to perform reads and writes, but encrypts/decrypts data before/after interacting
|
||||
with the physical socket. ASIO also provides the
|
||||
The outer `Socket` accepts requests to perform reads and writes, but encrypts/decrypts data before/after interacting
|
||||
with the physical socket. ASIO also provides the
|
||||
[implementation](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/asio/asio_session.h#L75)
|
||||
for the TLS socket for OpenSSL, but we provide [our own implementation](ssl) for SChannel and Secure Transport.
|
||||
|
||||
### FIPS Mode
|
||||
|
||||
[`tlsFIPSMode`](https://docs.mongodb.com/manual/reference/program/mongod/#cmdoption-mongod-tlsfipsmode)
|
||||
is a configuration option that allows TLS to operate in [**FIPS** mode](https://www.openssl.org/docs/fips.html),
|
||||
meaning that all cryptography must be done on a FIPS-140 certified device (physical or virtual). As such, `tlsFIPSMode`
|
||||
[`tlsFIPSMode`](https://docs.mongodb.com/manual/reference/program/mongod/#cmdoption-mongod-tlsfipsmode)
|
||||
is a configuration option that allows TLS to operate in [**FIPS** mode](https://www.openssl.org/docs/fips.html),
|
||||
meaning that all cryptography must be done on a FIPS-140 certified device (physical or virtual). As such, `tlsFIPSMode`
|
||||
can only be enabled if a FIPS-compliant library is installed.
|
||||
|
||||
### Authentication & Authorization
|
||||
|
||||
MongoDB can use the X.509 certificates presented by clients during the TLS handshake for
|
||||
[**Authentication**](https://docs.mongodb.com/manual/tutorial/configure-x509-client-authentication/) and **Authorization**,
|
||||
although X.509 authorization is uncommon. X.509 authorization can be done by embedding roles
|
||||
into certificates along with the public key and certificate metadata. Internal authorization, using privilege documents
|
||||
stored in the database, is more commonly used for acquiring access rights, and X.509 authorization is not a typical
|
||||
use case of certificates in general. We do not provide any utility for generating these unique certificates, but the
|
||||
logic for parsing them is in
|
||||
MongoDB can use the X.509 certificates presented by clients during the TLS handshake for
|
||||
[**Authentication**](https://docs.mongodb.com/manual/tutorial/configure-x509-client-authentication/) and **Authorization**,
|
||||
although X.509 authorization is uncommon. X.509 authorization can be done by embedding roles
|
||||
into certificates along with the public key and certificate metadata. Internal authorization, using privilege documents
|
||||
stored in the database, is more commonly used for acquiring access rights, and X.509 authorization is not a typical
|
||||
use case of certificates in general. We do not provide any utility for generating these unique certificates, but the
|
||||
logic for parsing them is in
|
||||
[`parsePeerRoles`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L294).
|
||||
|
||||
X.509 certificates _are_ commonly used for _authentication_.
|
||||
X.509 certificates _are_ commonly used for _authentication_.
|
||||
[`SSLManager::parseAndValidatePeerCertificate`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L268)
|
||||
extracts information required for authentication, such as the client name, during connection negotiation. We use
|
||||
TLS for authentication, as being able to successfully perform a cryptographic handshake after key exchange should be as
|
||||
sufficient proof of authenticity as providing a username and password. This idea is the same as using one's RSA public
|
||||
key to authenticate to an SSH server instead of having to type a password. When a server is configured to authenticate
|
||||
with TLS (using the
|
||||
[**MONGODB-X509**](https://docs.mongodb.com/manual/reference/program/mongo/#cmdoption-mongo-authenticationmechanism)
|
||||
extracts information required for authentication, such as the client name, during connection negotiation. We use
|
||||
TLS for authentication, as being able to successfully perform a cryptographic handshake after key exchange should be as
|
||||
sufficient proof of authenticity as providing a username and password. This idea is the same as using one's RSA public
|
||||
key to authenticate to an SSH server instead of having to type a password. When a server is configured to authenticate
|
||||
with TLS (using the
|
||||
[**MONGODB-X509**](https://docs.mongodb.com/manual/reference/program/mongo/#cmdoption-mongo-authenticationmechanism)
|
||||
mechanism), it receives a certificate from the client during the initial TLS handshake, extracts the subject name, which
|
||||
MongoDB will consider a username, and performs the cryptographic handshake. If the handshake succeeds, then the server
|
||||
will note that the client proved ownership of the presented certificate. The authentication logic happens in the
|
||||
MongoDB will consider a username, and performs the cryptographic handshake. If the handshake succeeds, then the server
|
||||
will note that the client proved ownership of the presented certificate. The authentication logic happens in the
|
||||
[`authX509`](https://github.com/mongodb/mongo/blob/master/src/mongo/client/authenticate.cpp#L127) function, although
|
||||
there are many callers of this function that use it in different ways. Later, when that client tries to authenticate,
|
||||
there are many callers of this function that use it in different ways. Later, when that client tries to authenticate,
|
||||
the server will know that the previous TLS handshake has proved their authenticity, and will grant themm the appropriate
|
||||
access rights.
|
||||
|
||||
### The Transport Layer
|
||||
|
||||
The _transport layer_ calls appropriate TLS functions inside of its own related functions.
|
||||
[`TransportLayerManager::connect`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/transport_layer_manager.h#L66)
|
||||
will make a call to
|
||||
The _transport layer_ calls appropriate TLS functions inside of its own related functions.
|
||||
[`TransportLayerManager::connect`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/transport_layer_manager.h#L66)
|
||||
will make a call to
|
||||
[`AsioSession::handshakeSSLForEgress`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/asio_transport_layer.cpp#L496)
|
||||
when it needs to speak TLS. This works the same for
|
||||
when it needs to speak TLS. This works the same for
|
||||
[`asyncConnect`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/transport_layer.h#L85) as well.
|
||||
[`TransportLayerManager`](../../transport/transport_layer_manager.h) provides these and
|
||||
[`getReactor`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/transport_layer_manager.cpp#L78) as
|
||||
[`getReactor`](https://github.com/mongodb/mongo/blob/master/src/mongo/transport/transport_layer_manager.cpp#L78) as
|
||||
wrappers around synonymous functions in whichever [`SSLManager`](https://github.com/mongodb/mongo/blob/master/src/mongo/util/net/ssl_manager.h#L181)
|
||||
the server is using.
|
||||
|
||||
|
|
@ -127,45 +127,45 @@ the server is using.
|
|||
MongoDB uses a TLS feature called [**Server Name Indication (SNI)**](https://www.cloudflare.com/learning/ssl/what-is-sni/)
|
||||
to transport server names over the network. The shell will use SNI to communicate what it believes the server's name is
|
||||
when it initiates a connection to a server. SNI is a small, _optional_ parameter of a TLS handshake packet where a client
|
||||
can advertize to a host what the client believes the host's name to be. SNI is normally used for hosts to know which
|
||||
certificate to use for an incoming connection when multiple domains are hosted at the same IP address. MongoDB also uses
|
||||
it often for communication between nodes in a replica set. TLS requires a client to initiate a handshake to a server, but
|
||||
_in replsets, two servers have to communicate with each other._ Because of this, _one server will take the role of a
|
||||
client_ and initiate the handshake to another server. In
|
||||
[**Split Horizon** routing](https://en.wikipedia.org/wiki/Split_horizon_route_advertisement), nodes will use SNI to
|
||||
figure out which horizon they are communicating on. Most cloud providers will allocate private, _internal network_ IP
|
||||
addresses that nodes can use to communicate with each other without reaching out to the internet, but each individual
|
||||
node will also have a public IP from which remote clients can access them. This is useful, for example, on deployments
|
||||
that have nodes across multiple cloud providers or across different regions that would not be able to exist on the same
|
||||
private network (horizon). When contacted by another node or client, a server will be able to tell the horizon on which
|
||||
can advertize to a host what the client believes the host's name to be. SNI is normally used for hosts to know which
|
||||
certificate to use for an incoming connection when multiple domains are hosted at the same IP address. MongoDB also uses
|
||||
it often for communication between nodes in a replica set. TLS requires a client to initiate a handshake to a server, but
|
||||
_in replsets, two servers have to communicate with each other._ Because of this, _one server will take the role of a
|
||||
client_ and initiate the handshake to another server. In
|
||||
[**Split Horizon** routing](https://en.wikipedia.org/wiki/Split_horizon_route_advertisement), nodes will use SNI to
|
||||
figure out which horizon they are communicating on. Most cloud providers will allocate private, _internal network_ IP
|
||||
addresses that nodes can use to communicate with each other without reaching out to the internet, but each individual
|
||||
node will also have a public IP from which remote clients can access them. This is useful, for example, on deployments
|
||||
that have nodes across multiple cloud providers or across different regions that would not be able to exist on the same
|
||||
private network (horizon). When contacted by another node or client, a server will be able to tell the horizon on which
|
||||
it is communicating based on the name placed in the SNI field.
|
||||
|
||||
SNI also has the unfortunate property of being sent in the packet that _initiates_ a TLS connection. This means that
|
||||
SNI also has the unfortunate property of being sent in the packet that _initiates_ a TLS connection. This means that
|
||||
_SNI cannot be encrypted_. This means that, if SNI is used, anyone who intercepts your packets can know with which host
|
||||
specifically you are trying to interact.
|
||||
|
||||
## Protocol Theory
|
||||
|
||||
TLS works by having two peers with a _client/server relationship_ perform a handshake wherein they agree upon how
|
||||
TLS works by having two peers with a _client/server relationship_ perform a handshake wherein they agree upon how
|
||||
messages will be encrypted and decrypted as they are sent between parties. Once the handshake completes
|
||||
successfully, encrypted data will be sent across a regular transport protocol and then decrypted once it arrives on the
|
||||
other side. For example, in the popular HTTPS protocol, a TLS handshake is negotiated, and then encrypted HTTP packets
|
||||
other side. For example, in the popular HTTPS protocol, a TLS handshake is negotiated, and then encrypted HTTP packets
|
||||
are formed into TLS records before being sent across the wire and decrypted on the other side.
|
||||
|
||||
### The TLS Handshake
|
||||
|
||||
As defined by TLS's specification, a TLS handshake will always consist of the following steps:
|
||||
|
||||
1. The client connects to a TLS-enabled server and presents a _list of supported **cipher suites**_ and supported
|
||||
_TLS versions_. This step is called the **Client Hello**.
|
||||
1. The client connects to a TLS-enabled server and presents a _list of supported **cipher suites**_ and supported
|
||||
_TLS versions_. This step is called the **Client Hello**.
|
||||
2. The server will then _pick a cipher suite_ that it supports and send it to the client, along with its certificate.
|
||||
This is called the **Server Hello**. The server's certificate will contain the server's _public key_, the server's _name_,
|
||||
and the _signature of a trusted Certificate Authority_.
|
||||
This is called the **Server Hello**. The server's certificate will contain the server's _public key_, the server's _name_,
|
||||
and the _signature of a trusted Certificate Authority_.
|
||||
3. The client will then **validate** the server certificate's signature against the CA's public key, which is stored
|
||||
_on disk_ as a single certificate or as part of a certificate store. The client will also ensure that the server's
|
||||
certificate corresponds to the host to which it is trying to connect.
|
||||
4. If the certificate is valid, then the client and server will perform a **key exchange** to generate a unique, random
|
||||
**session key**. This key is used for both encryption and decryption.
|
||||
_on disk_ as a single certificate or as part of a certificate store. The client will also ensure that the server's
|
||||
certificate corresponds to the host to which it is trying to connect.
|
||||
4. If the certificate is valid, then the client and server will perform a **key exchange** to generate a unique, random
|
||||
**session key**. This key is used for both encryption and decryption.
|
||||
5. Using the session key, encrypted data is sent across the network.
|
||||
|
||||
### Ciphers
|
||||
|
|
@ -174,10 +174,10 @@ A **cipher** is an algorithm or function for turning cleartext data into an encr
|
|||
A [**cipher suite**](https://en.wikipedia.org/wiki/Cipher_suite) is a named collection of algorithms to be used in a TLS
|
||||
conversation. It contains, at the minimum, a _symmetric cipher_ for bulk encryption, and a _hash algorithm_. For TLS
|
||||
versions older than 1.3, cipher suites also included a _key exchange algorithm_ for generating a session key, and an
|
||||
_asymmetric signature algorithm_. A **forward-secret cipher suite** is a type of cipher suite that can be used to
|
||||
produce and exchange a forward-secret session key. **Forward-secrecy** means that even if a party's private key is
|
||||
_asymmetric signature algorithm_. A **forward-secret cipher suite** is a type of cipher suite that can be used to
|
||||
produce and exchange a forward-secret session key. **Forward-secrecy** means that even if a party's private key is
|
||||
compromised, previous sessions cannot be decrypted without knowing that session's key.
|
||||
Any ciphers used by TLS 1.3 or newer have to be forward-secret ciphers. MongoDB supports two types of
|
||||
Any ciphers used by TLS 1.3 or newer have to be forward-secret ciphers. MongoDB supports two types of
|
||||
key-exchange algorithms that both provide forward-secrecy:
|
||||
|
||||
1. [Ephemeral Eliptic Curve Diffie-Hellman (ECDHE)](https://en.wikipedia.org/wiki/Elliptic-curve_Diffie–Hellman)
|
||||
|
|
@ -186,8 +186,8 @@ key-exchange algorithms that both provide forward-secrecy:
|
|||
If a client advertises both of these types of cipher suites, _the server will prefer ECDHE by default_, although this
|
||||
can be re-configured. This is a property of the TLS libraries we use that we chose not to override.
|
||||
|
||||
One important requirement of a forward-secret system is that the forward-secret key is _never transmitted over the
|
||||
network,_ meaning that the client and server both have to derive the same forward-secret key independently.
|
||||
One important requirement of a forward-secret system is that the forward-secret key is _never transmitted over the
|
||||
network,_ meaning that the client and server both have to derive the same forward-secret key independently.
|
||||
|
||||
## X.509
|
||||
|
||||
|
|
@ -196,72 +196,71 @@ X.509 certificates contain a _public key,_ an _identity,_ and a _signature,_ usu
|
|||
|
||||
### Certificate Authorities
|
||||
|
||||
A **Certificate Authority (CA)** is a trusted third party used to definitevely bind names to public keys. When a service
|
||||
wants to be verified by a Certificate Authority, it will submit a **Certificate Signing Request (CSR)**, containing the
|
||||
service's identity and public key, to the Certificate Authority of its choosing. The CA will then cryptographically
|
||||
"sign" the certificate using its private key. When a MongoDB client receives a certificate, it will validate it using a CA
|
||||
certificate containing the CA's _public_ key. mongod, mongos, and mongo must be
|
||||
A **Certificate Authority (CA)** is a trusted third party used to definitevely bind names to public keys. When a service
|
||||
wants to be verified by a Certificate Authority, it will submit a **Certificate Signing Request (CSR)**, containing the
|
||||
service's identity and public key, to the Certificate Authority of its choosing. The CA will then cryptographically
|
||||
"sign" the certificate using its private key. When a MongoDB client receives a certificate, it will validate it using a CA
|
||||
certificate containing the CA's _public_ key. mongod, mongos, and mongo must be
|
||||
[configured](https://docs.mongodb.com/manual/tutorial/configure-ssl/#mongod-and-mongos-certificate-key-file) to use a CA
|
||||
certificate **key file**, or the **system certificate store,** which is a keychain of trusted CA public keys which is
|
||||
common in various systems, services, browsers, etc. The keychain's set of public keys is typically refreshed via
|
||||
certificate **key file**, or the **system certificate store,** which is a keychain of trusted CA public keys which is
|
||||
common in various systems, services, browsers, etc. The keychain's set of public keys is typically refreshed via
|
||||
software updates. MongoDB clients can use the system keychain if a path to a certificate is not provided. This is used
|
||||
most commonly by clients connecting to _Atlas clusters_. Certificate validation via a key file or the system keychain
|
||||
both _do not require use of the network_.
|
||||
|
||||
|
||||
### Certificate Metadata
|
||||
|
||||
Certificates are used to attach an identifier to a public key. There are two main types of identifiers that can be
|
||||
Certificates are used to attach an identifier to a public key. There are two main types of identifiers that can be
|
||||
embedded in certificates: Common Name (CN) and Subject Alternative Name (SAN).
|
||||
|
||||
A **Subject Alternative Name (SAN)** allows DNS names or other identifying information to be embedded into a certificate.
|
||||
This information is useful for contacting adminstrators of the domain, or finding other information about the domain. A
|
||||
A **Subject Alternative Name (SAN)** allows DNS names or other identifying information to be embedded into a certificate.
|
||||
This information is useful for contacting adminstrators of the domain, or finding other information about the domain. A
|
||||
SAN can include any combination of the following:
|
||||
|
||||
* DNS names
|
||||
* Email addresses
|
||||
* IP addresses
|
||||
* URIs
|
||||
* Directory Names (DNs)
|
||||
* General (arbitrary) names
|
||||
- DNS names
|
||||
- Email addresses
|
||||
- IP addresses
|
||||
- URIs
|
||||
- Directory Names (DNs)
|
||||
- General (arbitrary) names
|
||||
|
||||
A **Common Name (CN)** is allowed to _only_ contain hostnames.
|
||||
|
||||
Because of the richness of the information in a SAN, it is a much preferred identifier to CN. If the SAN is not
|
||||
provided in a certificate, the client will use the CN for validation; however, this is not ideal, as CN has been
|
||||
Because of the richness of the information in a SAN, it is a much preferred identifier to CN. If the SAN is not
|
||||
provided in a certificate, the client will use the CN for validation; however, this is not ideal, as CN has been
|
||||
deprecated [since May 2000](https://tools.ietf.org/html/rfc2818#section-3.1).
|
||||
|
||||
### Certificate Expiration and Revocation
|
||||
|
||||
All certificates can be set to expire on an exact date. When the
|
||||
certificate expires, the certificate has to be renewed by the service. Ideally, the service will renew the certificate before it expires.
|
||||
Signatures can also be manually **revoked** by the CA before the expiration date if the certificate's private key
|
||||
All certificates can be set to expire on an exact date. When the
|
||||
certificate expires, the certificate has to be renewed by the service. Ideally, the service will renew the certificate before it expires.
|
||||
Signatures can also be manually **revoked** by the CA before the expiration date if the certificate's private key
|
||||
is compromised. Traditionally, this is done through a **Certificate Revocation List (CRL)**. This is a list, stored
|
||||
on disk, of certificates that have been revoked by a CA. These lists have to be re-downloaded periodically in order to
|
||||
know which certificates have recently been revoked. The alternative to this is the
|
||||
[**Online Certificate Status Protocol (OCSP)**](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) which
|
||||
allows certificate revocation to be checked online. OCSP is
|
||||
on disk, of certificates that have been revoked by a CA. These lists have to be re-downloaded periodically in order to
|
||||
know which certificates have recently been revoked. The alternative to this is the
|
||||
[**Online Certificate Status Protocol (OCSP)**](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) which
|
||||
allows certificate revocation to be checked online. OCSP is
|
||||
[supported by MongoDB](https://docs.mongodb.com/manual/core/security-transport-encryption/#ocsp-online-certificate-status-protocol).
|
||||
|
||||
### Member Certificates
|
||||
|
||||
MongoDB uses two types of certificates:
|
||||
|
||||
* **Client certificates**, which are used by clients to authenticate to a server.
|
||||
* **Member certificates**, which are used by members of a _sharder cluster_ or _replset_ to authenticate to each other.
|
||||
- **Client certificates**, which are used by clients to authenticate to a server.
|
||||
- **Member certificates**, which are used by members of a _sharder cluster_ or _replset_ to authenticate to each other.
|
||||
|
||||
Member certificates are specified with
|
||||
[`net.tls.clusterFile`](https://docs.mongodb.com/manual/reference/configuration-options/#net.tls.clusterFile). This
|
||||
Member certificates are specified with
|
||||
[`net.tls.clusterFile`](https://docs.mongodb.com/manual/reference/configuration-options/#net.tls.clusterFile). This
|
||||
parameter points to a certificate key file (usually in .pem format) which contains both a certificate and a private key.
|
||||
All member certificates in a cluster _must be issued by the same CA._ Members of a cluster will present these
|
||||
certificates on outbound connections to other nodes and authenticate to each other in the same way that a client would
|
||||
authenticate to them. If `net.tls.clusterFile` is not specified, then
|
||||
[`net.tls.certificateKeyFile`](https://docs.mongodb.com/manual/reference/configuration-options/#net.tls.certificateKeyFile)
|
||||
All member certificates in a cluster _must be issued by the same CA._ Members of a cluster will present these
|
||||
certificates on outbound connections to other nodes and authenticate to each other in the same way that a client would
|
||||
authenticate to them. If `net.tls.clusterFile` is not specified, then
|
||||
[`net.tls.certificateKeyFile`](https://docs.mongodb.com/manual/reference/configuration-options/#net.tls.certificateKeyFile)
|
||||
will be used.
|
||||
|
||||
By default, nodes will only consider a peer certificate to be a member certificate if the
|
||||
_Organization (O)_, _Organizational Unit (OU)_, and _Domain Component (DC)_ that might be contained
|
||||
in the certificate's _Subject Name_ match those contained in _its own_ subject name. This behavior
|
||||
can be customized to check for different attributes via `net.tls.clusterAuthX509.attributes` or
|
||||
`net.tls.clusterAuthX509.extensionValue`. See the [`auth`](../../db/auth/README.md) documentation
|
||||
By default, nodes will only consider a peer certificate to be a member certificate if the
|
||||
_Organization (O)_, _Organizational Unit (OU)_, and _Domain Component (DC)_ that might be contained
|
||||
in the certificate's _Subject Name_ match those contained in _its own_ subject name. This behavior
|
||||
can be customized to check for different attributes via `net.tls.clusterAuthX509.attributes` or
|
||||
`net.tls.clusterAuthX509.extensionValue`. See the [`auth`](../../db/auth/README.md) documentation
|
||||
for more information about X.509 intracluster auth.
|
||||
|
|
|
|||
Loading…
Reference in New Issue