mirror of https://github.com/mongodb/mongo
289 lines
14 KiB
Markdown
289 lines
14 KiB
Markdown
# Modules POC
|
|
|
|
This folder contains a POC implementation of a module metrics tracker and enforcement. This documentation includes
|
|
basic information about modules, and commands which will run the scanner across the entire first-party codebase and merge the results. All
|
|
commands are assumed to run at the root of the checkout, inside of a correctly activated python
|
|
virtual env.
|
|
|
|
## What is a module
|
|
|
|
A module:
|
|
|
|
- Provides a coherent public API
|
|
- Has internal details that are not intended to be directly accessed from outside the module
|
|
- Is a set of files covering the API (headers), implementations (headers and cpp files), and tests
|
|
|
|
### Submodules
|
|
|
|
TODO
|
|
|
|
## Assigning files to modules
|
|
|
|
The file `modules_poc/modules.yaml` contains a list of modules, each containing
|
|
a list of files. Each file must be contained in only one module. Note that
|
|
module assignment is not required to map neatly to team ownership.
|
|
|
|
In cases where multiple globs match a file, the current rule is that the
|
|
longest glob wins. This is used as a simpler-to-implement version of
|
|
most-specific glob wins, which we may switch to in the future.
|
|
|
|
When submitting a review, you are strongly encouraged to include
|
|
a generated diff of the changes to the modules list. This can be done by running:
|
|
|
|
```bash
|
|
modules_poc/mod_mapping.py --dump-modules-list > modules.old
|
|
# make your changes
|
|
diff -u0 modules.old <(modules_poc/mod_mapping.py --dump-modules-list)
|
|
```
|
|
|
|
Github will nicely format the diff if you put it in a block like this:
|
|
|
|
````markdown
|
|
```diff
|
|
@@ -1234 +1234 @@
|
|
-path/to/file -- old_module
|
|
+path/to/file -- new_module
|
|
```
|
|
````
|
|
|
|
<!--
|
|
if you are reading the markdown source, ignore the lines with 4 backticks (````),
|
|
and only use a single wrapping with ```diff
|
|
-->
|
|
|
|
### Showing assigned and unassigned files
|
|
|
|
Run `modules_poc/mod_mapping.py --dump-modules` to produce a `modules_dump.yaml`
|
|
file in current directory. This file is a multi-level map from
|
|
module name to team name to directory path to list of file names.
|
|
For unassigned files it uses `__NONE__` as the module name, and for unowned
|
|
files it uses `__NO_OWNER__` as the team, both of which conveniently sort first.
|
|
For owned files it uses the part of the team-name after `@10gen/` with `-`
|
|
replaced with `_` to be friendlier to querying. In cases where multiple teams
|
|
own a file, the file is duplicated to each team's list.
|
|
|
|
This file can be viewed directly in VSCode. The yaml plugin's breadcrumbs and
|
|
folding are very helpful. [`yq`](https://github.com/kislyuk/yq)
|
|
([`jq`](https://jqlang.org) for yaml) is also a powerful tool. Here are a few
|
|
examples using it, some of which produce enough output to be worth opening in vscode:
|
|
|
|
```bash
|
|
# list of teams
|
|
yq '[.[] | keys] | add | sort | unique[]' -r modules_dump.yaml
|
|
# unassigned files owned by server-programmability
|
|
yq '.__NONE__.server_programmability' modules_dump.yaml
|
|
# files owned by server-programmability across all modules (or lack thereof)
|
|
yq '.[] |= (.server_programmability | values)' modules_dump.yaml
|
|
# assigned files owned by server-programmability outside of the core module
|
|
yq '.[] |= (.server_programmability | values) | del(.core) | del(.__NONE__)' modules_dump.yaml
|
|
# assigned files owned by server-programmability in modules that don't start with core
|
|
yq '.[] |= (.server_programmability | values) | with_entries(select(.key | startswith("core") | not)) | del(.__NONE__)' modules_dump.yaml
|
|
# unowned files as a flat list
|
|
yq '.[].__NO_OWNER__ | values | to_entries | map("\(.key)/\(.value[])") | .[] ' modules_dump.yaml -r | sort
|
|
# unowned files grouped by directory
|
|
yq '[.[].__NO_OWNER__ | to_entries? | .[]] | group_by(.key) | map({key: .[0].key, value: ([.[].value] | add | sort)}) | from_entries' modules_dump.yaml
|
|
# assigned files owned by non server-programmability teams inside of the core module (grouped by teams, then directory)
|
|
yq '.core | with_entries(select(.key != "server_programmability"))' modules_dump.yaml
|
|
```
|
|
|
|
## Specifying public and private module APIs
|
|
|
|
To make an API or class available for use by other modules, add a
|
|
tag to its header declaration.
|
|
|
|
```
|
|
class MONGO_MOD_PUBLIC Foo {
|
|
|
|
};
|
|
|
|
MONGO_MOD_PUBLIC int foo();
|
|
```
|
|
|
|
Availability specification can also be done at the namespace level.
|
|
|
|
```
|
|
namespace MONGO_MOD_PRIVATE my_details {
|
|
|
|
} // namespace MONGO_MOD_PRIVATE my details
|
|
```
|
|
|
|
Elements inside a class or namespace default to the visibility of the
|
|
enclosing scope. Note that the canonical version of "inside" can be
|
|
subtle, with, e.g., member functions being "inside" the class definition,
|
|
not the location the member function is defined. All forward
|
|
declarations of the same function or class should have the same visibility
|
|
tags, and forward declarations across module boundaries should be avoided.
|
|
|
|
If visibility is not specified at any containing scope,
|
|
it defaults to `MONGO_MOD_PRIVATE` (except in cases where the
|
|
header doesn't include `mongo/util/modules.h`, where the default is `UNKNOWN`
|
|
to facilitate incrementally tagging APIs).
|
|
|
|
Documentation for individual `MONGO_MOD_*` tags is present in
|
|
[`mongo/util/modules.h`](../src/mongo/util/modules.h).
|
|
|
|
### Running the scanner
|
|
|
|
This will build the `merged_decls.json` file in the current directory:
|
|
|
|
```bash
|
|
buildscripts/poetry_sync.sh # make sure the python env has the right packages installed
|
|
python modules_poc/merge_decls.py
|
|
```
|
|
|
|
`merge_decls.py` takes an optional flag `--[no-]intra-module` to indicate whether you want to include
|
|
intra module accesses and declarations that are only used from within their module or submodules. It
|
|
defaults to `--intra-module`, which provides the most information to consumers.
|
|
|
|
Running `merge_decls.py` also validates that private APIs aren't being used
|
|
outside of where they are permitted. If any are, the script will fail, though
|
|
`merged_decls.json` will still be generated, and the
|
|
invalid uses will be printed to stdout.
|
|
|
|
<!--
|
|
Commenting this section out for now since it no longer works now that the merger
|
|
invokes the scanner automatically. Unsure if we want to support this use case again.
|
|
|
|
If you only wish to include the files linked in to a given executable, replace the `bazel build` command with the following commands:
|
|
|
|
```bash
|
|
TARGET="//src/mongo/db:mongod"
|
|
bazel cquery --config=mod-scanner "filter(//src/mongo, kind(cc_*, deps($TARGET)))" | awk '{print $1}' > targets.file
|
|
bazel build --config=mod-scanner --target_pattern_file=targets.file
|
|
```
|
|
-->
|
|
|
|
### Automated marking visibility for private headers
|
|
|
|
`private_headers.py` can be used to find all files where there are no currently detected external usages
|
|
and automatically mark them as private. This does not necessarily mean that all automatically marked headers
|
|
are intended to be private. A human should review to ensure that the marked headers match intent. Filters
|
|
can be applied to adjust the files that are modified.
|
|
|
|
For headers matching the filter, the script will also warn of usages of `_forTest` external to the file
|
|
family.
|
|
|
|
Example usage:
|
|
|
|
```
|
|
./modules_poc/private_headers.py --team=server_programmability --module=core --glob="src/mongo/executor/*"
|
|
```
|
|
|
|
`--dry-run` can be used to view all of the changes without applying them.
|
|
|
|
### Note for implementers
|
|
|
|
You can also scan a single file which is useful when iterating on this. You can
|
|
either pass it the same flags used to compile, or pass it just a cpp file and it
|
|
will figure out the flags from your `compile_commands.json`. It will create a file
|
|
called `decls.yaml` to the current directory when run this way.
|
|
|
|
```bash
|
|
modules_poc/mod_scanner.py src/mongo/bson/bsonobj.cpp
|
|
```
|
|
|
|
## Browsing
|
|
|
|
You can start the browser by running `modules_poc/browse.py`. If `merged_decls.json` is not present, it will
|
|
ask if you want it to run the scanner for you. You can rescan at any time by pressing <kbd>r</kbd>.
|
|
It will show the available keybindings on the right, which can be toggled
|
|
by pressing <kbd>?</kbd>. If you are running from a VSCode or neovim terminal, you can press
|
|
<kbd>g</kbd> to go to any location in your editor. You can also press <kbd>p</kbd> to toggle an
|
|
embedded preview of the location the current line is currently on (you probably want to hide the
|
|
help when doing this). You can press <kbd>Tab ↹</kbd> to switch between the tree and preview.
|
|
|
|
The browser is primarily intended to assist in labeling public APIs, so the files are sorted with
|
|
the most number of unlabeled declarations ("unknowns") first. Only declarations that are used outside of their
|
|
module are counted and shown. You can search for a file by pressing <kbd>f</kbd> or press
|
|
<kbd>m</kbd> to filter the files by module.
|
|
|
|
As an advanced feature, you can pass a custom file to `browse.py` and it will
|
|
use it rather than the default `merged_decls.json`. It does need still to have
|
|
the same shape as the original. This works best with [jq] filtering to do
|
|
advanced filtering. For example, here is a command that will only show
|
|
declarations where some TUs will only see a forward declaration from another
|
|
module, and will assume that that module is the owner (we need to fix this):
|
|
|
|
```bash
|
|
./modules_poc/browse.py <(jq '[.[] | select(.other_mods)]' merged_decls.json)
|
|
```
|
|
|
|
In general, your `jq` query should be of the form `[.[] | select( SOME QUERY )]`
|
|
to avoid breaking the format expectations. For more advanced analysis, using
|
|
`jq` directly is a good idea.
|
|
|
|
## Uploading
|
|
|
|
Run the following command to upload
|
|
|
|
```bash
|
|
python modules_poc/upload.py $MONGO_URI # fill this in
|
|
```
|
|
|
|
If the upload fails with an error connecting and you need to update the IP
|
|
whitelist for your virtual workstation, `curl -4 wtfismyip.com/text` is a good
|
|
way to see your public IP address
|
|
|
|
## Future Work
|
|
|
|
- [ ] Once we no longer have errant forward declarations in the wrong module, we can
|
|
make the processing a lot faster by having the scanner only write out things
|
|
that are used across modules (or in the `__NONE__` module if we still want to
|
|
track that).
|
|
- [ ] We should explore if using the indexing API (eg `clang_indexSourceFile`) will
|
|
yield better results. In particular, there is a flag to opt-in to visiting all
|
|
implicit instantiations which I think is currently a blind spot. Unfortunately
|
|
it isn't exposed in the python API yet, so we would need to add it there
|
|
first.
|
|
- [ ] Other interesting options would be a clang-tidy plugin or a clang plugin. We
|
|
already have a lot of infrastructure to support clang-tidy plugins, but they
|
|
will ignore any lines with `// NOLINT` comments. A clang plugin is
|
|
particularly interesting if it will be able to run inside of clangd so we can
|
|
show warnings when accessing an unfortunately-public API across modules (we
|
|
may want to mark it as deprecated in that case) and errors when accessing
|
|
private.
|
|
- [x] Parallelize the merge script. Right now it is single threaded. We can use
|
|
`multiprocessing` or similar to parallelize it. It should use a queue of input
|
|
files and have workers merge them into a local `all_decls` map and then the
|
|
main thread should merge the results from each worker. If we just pass the
|
|
path to the workers, this will also parallelize reading the files and parsing
|
|
the json.
|
|
- [x] We should try to report `loc` as the header declaring the entity, but right
|
|
now it will report the cpp file where it is defined. This is currently
|
|
important since we use the definition to decide the canonical location and
|
|
module when merging. This may cause issues for free functions if the namespace
|
|
is marked public in the header. The latter issue can be worked around when
|
|
merging by using the visibility from other files if the current visibility is
|
|
unknown. But we should pick the right location for `loc` regardless.
|
|
- [ ] Try to collapse template instantiations that clang's `specialized_template`
|
|
helper fails to. I don't know why it fails to, but it seems to on many of
|
|
our templates. Maybe we should just merge all declarations with the same `loc`?
|
|
If we do that, we should try to prefer decls where `kind` is a template.
|
|
- [ ] Split "unfortunately public" into 3 categories:
|
|
|
|
1. `NEEDS_REPLACEMENT`: The current API isn't ideal for a public API, but consumers need its
|
|
functionality. It is on the module owner to provide a better API.
|
|
2. `USE_INSTEAD("replacement")`: A replacement for this API has been provided that external
|
|
consumers should switch to. It is on the module consumer to update their code.
|
|
3. `CURRENTLY_USED`: This is a marker we can put on code as we improve the scanner if it finds
|
|
new usages of private APIs that were hidden in older versions. The module owner should examine
|
|
these and decide if they should be public or marked for replacement.
|
|
|
|
- Browser enhancements:
|
|
- [x] Add some search functionality to the browser, at least for files.
|
|
- [ ] Add a way to show a flat list of all decls in a file, rather than
|
|
the nested view.
|
|
- [x] Include `spelling` in the output from the scanner, and use that tol
|
|
highlight that part of the decl in the list
|
|
- [ ] Have scanner include the source ranges (extents) of each usage so that
|
|
we can highlight correctly in the viewer. Should also include the name
|
|
of the entity performing the usage. Probably best to use something like
|
|
`pretty_location()` but with `spelling` rather than `display_name`
|
|
- [ ] Make it easier to filter and sort the decls for exploration. One way would be
|
|
to use [jq](https://jqlang.github.io/jq/) expressions. Need to be careful that
|
|
the data is still in the original shape. Could take two expressions, one that
|
|
goes into a [`select()`](https://jqlang.github.io/jq/manual/#select) and another
|
|
that goes into a [`sort_by()`](https://jqlang.github.io/jq/manual/#sort-sort_by).
|
|
For now you can do `browse.py <(jq '[.[] | select( ... )]' merged_decls.json)`, but we
|
|
should come up with something better.
|