Add an overview of Ruff's compilation pipeline to the docs (#5719)

## Summary I originally wrote this in Notion but it seems preferable to publish it publicly in the documentation. Feedback welcome!
2023-07-13 12:50:41 -04:00 · 2023-07-13 12:50:41 -04:00 · fee0f43925
parent 25e491ad6f
commit fee0f43925
1 changed files with 41 additions and 0 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -685,3 +685,44 @@ Module {
    parse it, serialize the parsed representation and write it back. Used to check how good our
    representation is so that fixes don't rewrite irrelevant parts of a file.
 - `cargo dev format_dev`: See ruff_python_formatter README.md
+
+## Subsystems
+
+### Compilation Pipeline
+
+If we view Ruff as a compiler, in which the inputs are paths to Python files and the outputs are
+diagnostics, then our current compilation pipeline proceeds as follows:
+
+1. **File discovery**: Given paths like `foo/`, locate all Python files in any specified subdirectories, taking into account our hierarchical settings system and any `exclude` options.
+
+1. **Package resolution**: Determine the “package root” for every file by traversing over its parent directories and looking for `__init__.py` files.
+
+1. **Cache initialization**: For every “package root”, initialize an empty cache.
+
+1. **Analysis**: For every file, in parallel:
+
+    1. **Cache read**: If the file is cached (i.e., its modification timestamp hasn't changed since it was last analyzed), short-circuit, and return the cached diagnostics.
+
+    1. **Tokenization**: Run the lexer over the file to generate a token stream.
+
+    1. **Indexing**: Extract metadata from the token stream, such as: comment ranges, `# noqa` locations, `# isort: off` locations, “doc lines”, etc.
+
+    1. **Token-based rule evaluation**: Run any lint rules that are based on the contents of the token stream (e.g., commented-out code).
+
+    1. **Filesystem-based rule evaluation**: Run any lint rules that are based on the contents of the filesystem (e.g., lack of `__init__.py` file in a package).
+
+    1. **Logical line-based rule evaluation**: Run any lint rules that are based on logical lines (e.g., stylistic rules).
+
+    1. **Parsing**: Run the parser over the token stream to produce an AST. (This consumes the token stream, so anything that relies on the token stream needs to happen before parsing.)
+
+    1. **AST-based rule evaluation**: Run any lint rules that are based on the AST. This includes the vast majority of lint rules. As part of this step, we also build the semantic model for the current file as we traverse over the AST. Some lint rules are evaluated eagerly, as we iterate over the AST, while others are evaluated in a deferred manner (e.g., unused imports, since we can’t determine whether an import is unused until we’ve finished analyzing the entire file), after we’ve finished the initial traversal.
+
+    1. **Import-based rule evaluation**: Run any lint rules that are based on the module’s imports (e.g., import sorting). These could, in theory, be included in the AST-based rule evaluation phase — they’re just separated for simplicity.
+
+    1. **Physical line-based rule evaluation**: Run any lint rules that are based on physical lines (e.g., line-length).
+
+    1. **Suppression enforcement**: Remove any violations that are suppressed via `# noqa` directives or `per-file-ignores`.
+
+    1. **Cache write**: Write the generated diagnostics to the package cache using the file as a key.
+
+1. **Reporting**: Print diagnostics in the specified format (text, JSON, etc.), to the specified output channel (stdout, a file, etc.).