## Summary
Resolves#13294, follow-up to #13882.
At #13882, it was concluded that a fix should not be offered for raw
strings. This change implements that. The five rules in question are now
no longer always fixable.
## Test Plan
`cargo nextest run` and `cargo insta test`.
---------
Co-authored-by: Micha Reiser <micha@reiser.io>
## Summary
This PR updates the linter, specifically the token-based rules, to work
on the tokens that come after a syntax error.
For context, the token-based rules only diagnose the tokens up to the
first lexical error. This PR builds up an error resilience by
introducing a `TokenIterWithContext` which updates the `nesting` level
and tries to reflect it with what the lexer is seeing. This isn't 100%
accurate because if the parser recovered from an unclosed parenthesis in
the middle of the line, the context won't reduce the nesting level until
it sees the newline token at the end of the line.
resolves: #11915
## Test Plan
* Add test cases for a bunch of rules that are affected by this change.
* Run the fuzzer for a long time, making sure to fix any other bugs.
## Summary
This PR updates the parser to remove building the `CommentRanges` and
instead it'll be built by the linter and the formatter when it's
required.
For the linter, it'll be built and owned by the `Indexer` while for the
formatter it'll be built from the `Tokens` struct and passed as an
argument.
## Test Plan
`cargo insta test`
## Summary
This PR updates the entire parser stack in multiple ways:
### Make the lexer lazy
* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473
Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.
The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.
The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.
Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.
This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)
[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`
### Remove `SoftKeywordTransformer`
* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474
For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.
* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.
If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).
The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.
### Update the parser API
* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505
Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.
Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens
Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.
### Refactor the downstream tools (linter and formatter)
* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592
And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
* https://github.com/astral-sh/ruff/pull/11402
* https://github.com/astral-sh/ruff/pull/11406
* https://github.com/astral-sh/ruff/pull/11418
* https://github.com/astral-sh/ruff/pull/11419
* https://github.com/astral-sh/ruff/pull/11420
* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR
For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.
## Test Plan
- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects
### Running this change on dozens of open-source projects
Refer to the PR description to get the list of open source projects used
for testing.
Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`
**Conclusion: all output were same**
## What's next?
The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
## Summary
This PR follows up from #11420 to move `UP034` to use `TokenKind`
instead of `Tok`.
The main reason to have a separate PR is so that the reviewing is easy.
This required a lot more updates because the rule used an index (`i`) to
keep track of the current position in the token vector. Now, as it's
just an iterator, we just use `next` to move the iterator forward and
extract the relevant information.
This is part of https://github.com/astral-sh/ruff/issues/11401
## Test Plan
`cargo test`
## Summary
This PR moves the following rules to use `TokenKind` instead of `Tok`:
* `PLE2510`, `PLE2512`, `PLE2513`, `PLE2514`, `PLE2515`
* `E701`, `E702`, `E703`
* `ISC001`, `ISC002`
* `COM812`, `COM818`, `COM819`
* `W391`
I've paused here because the next set of rules
(`pyupgrade::rules::extraneous_parentheses`) indexes into the token
slice but we only have an iterator implementation. So, I want to isolate
that change to make sure the logic is still the same when I move to
using the iterator approach.
This is part of #11401
## Test Plan
`cargo test`
## Summary
Move `blanket-noqa` rule from the token checker to the noqa checker.
This allows us to make use of the line directives already computed in
the noqa checker.
## Test Plan
Verified test results are unchanged.
## Summary
This PR moves the `Q003` rule to AST checker.
This is the final rule that used the docstring detection state machine
and thus this PR removes it as well.
resolves: #7595resolves: #7808
## Test Plan
- [x] `cargo test`
- [x] Make sure there are no changes in the ecosystem
## Summary
Closes#10228
The PR makes the blank lines rules keep track of the cell status when
running on a notebook, and makes the rules not trigger when the line is
the first of the cell.
## Test Plan
The example given in #10228 is added as a fixture, along with a few
tests from the main blank lines fixtures.
## Summary
Continuing with #7595, this PR moves the `Q004` rule to the AST checker.
## Test Plan
- [x] Existing test cases should pass
- [x] No ecosystem updates
## Summary
Continuing with https://github.com/astral-sh/ruff/issues/7595, this PR
moves the `Q001`, `Q002`, `Q003` rules to the AST based checker.
## Test Plan
Make sure all of the existing test cases pass and verify there are no
ecosystem changes.
## Summary
Fixes https://github.com/astral-sh/ruff/issues/10039
The [recommendation for typing stub
files](https://typing.readthedocs.io/en/latest/source/stubs.html#blank-lines)
is to use **one** blank line to group related definitions and
otherwise omit blank lines.
The newly added blank line rules (`E3*`) didn't account for typing stub
files and enforced two empty lines at the top level and one empty line
otherwise, making it impossible to group related definitions.
This PR implements the `E3*` rules to:
* Not enforce blank lines. The use of blank lines in typing definitions
is entirely up to the user.
* Allow at most one empty line, including between top level statements.
## Test Plan
Added unit tests (It may look odd that many snapshots are empty but the
point is that the rule should no longer emit diagnostics)
## Summary
This PR changes the `E3*` rules to respect the `isort`
`lines-after-imports` and `lines-between-types` settings. Specifically,
the following rules required changing
* `TooManyBlannkLines` : Respects both settings.
* `BlankLinesTopLevel`: Respects `lines-after-imports`. Doesn't need to
respect `lines-between-types` because it only applies to classes and
functions
The downside of this approach is that `isort` and the blank line rules
emit a diagnostic when there are too many blank lines. The fixes aren't
identical, the blank line is less opinionated, but blank lines accepts
the fix of `isort`.
<details>
<summary>Outdated approach</summary>
Fixes
https://github.com/astral-sh/ruff/issues/10077#issuecomment-1961266981
This PR changes the blank line rules to not enforce the number of blank
lines after imports (top-level) if isort is enabled and leave it to
isort to enforce the right number of lines (depends on the
`isort.lines-after-imports` and `isort.lines-between-types` settings).
The reason to give `isort` precedence over the blank line rules is that
they are configurable. Users that always want to blank lines after
imports can use `isort.lines-after-imports=2` to enforce that
(specifically for imports).
This PR does not fix the incompatibility with the formatter in pyi files
that only uses 0 to 1 blank lines. I'll address this separately.
</details>
## Review
The first commit is a small refactor that simplified implementing the
fix (and makes it easier to reason about what's mutable and what's not).
## Test Plan
I added a new test and verified that it fails with an error that the fix
never converges. I verified the snapshot output after implementing the
fix.
---------
Co-authored-by: Hoël Bagard <34478245+hoel-bagard@users.noreply.github.com>
## Summary
Part of #7595
This PR moves the `RUF001` and `RUF002` rules to the AST checker. This
removes the use of docstring detection from these rules.
## Test Plan
As this is just a refactor, make sure existing test cases pass.
## Summary
Part of #970.
This adds Pylint's [R0244
empty_comment](https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/empty-comment.html)
lint as well as an always-safe fix.
## Test Plan
The included snapshot verifies the following:
- A line containing only a non-empty comment is not changed
- A line containing leading whitespace before a non-empty comment is not
changed
- A line containing only an empty comment has its content deleted
- A line containing only leading whitespace before an empty comment has
its content deleted
- A line containing only leading and trailing whitespace on an empty
comment has its content deleted
- A line containing trailing whitespace after a non-empty comment is not
changed
- A line containing only a single newline character (i.e. a blank line)
is not changed
- A line containing code followed by a non-empty comment is not changed
- A line containing code followed by an empty comment has its content
deleted after the last non-whitespace character
- Lines containing code and no comments are not changed
- Empty comment lines within block comments are ignored
- Empty comments within triple-quoted sections are ignored
## Comparison to Pylint
Running Ruff and Pylint 3.0.3 with Python 3.12.0 against the
`empty_comment.py` file added in this PR, we see the following:
* Identical behavior:
* empty_comment.py:3:0: R2044: Line with empty comment (empty-comment)
* empty_comment.py:4:0: R2044: Line with empty comment (empty-comment)
* empty_comment.py:5:0: R2044: Line with empty comment (empty-comment)
* empty_comment.py:18:0: R2044: Line with empty comment (empty-comment)
* Differing behavior:
* Pylint doesn't ignore empty comments in block comments commonly used
for visual spacing; I decided these were fine in this implementation
since many projects use these and likely do not want them removed.
* empty_comment.py:28:0: R2044: Line with empty comment (empty-comment)
* Pylint detects "empty comments" within the triple-quoted section at
the bottom of the file, which is arguably a bug in the Pylint
implementation since these are not truly comments. These are ignored by
this implementation.
* empty_comment.py:37:0: R2044: Line with empty comment (empty-comment)
* empty_comment.py:38:0: R2044: Line with empty comment (empty-comment)
* empty_comment.py:39:0: R2044: Line with empty comment (empty-comment)
## Summary
This PR updates the string nodes (`ExprStringLiteral`,
`ExprBytesLiteral`, and `ExprFString`) to account for implicit string
concatenation.
### Motivation
In Python, implicit string concatenation are joined while parsing
because the interpreter doesn't require the information for each part.
While that's feasible for an interpreter, it falls short for a static
analysis tool where having such information is more useful. Currently,
various parts of the code uses the lexer to get the individual string
parts.
One of the main challenge this solves is that of string formatting.
Currently, the formatter relies on the lexer to get the individual
string parts, and formats them including the comments accordingly. But,
with PEP 701, f-string can also contain comments. Without this change,
it becomes very difficult to add support for f-string formatting.
### Implementation
The initial proposal was made in this discussion:
https://github.com/astral-sh/ruff/discussions/6183#discussioncomment-6591993.
There were various AST designs which were explored for this task which
are available in the linked internal document[^1].
The selected variant was the one where the nodes were kept as it is
except that the `implicit_concatenated` field was removed and instead a
new struct was added to the `Expr*` struct. This would be a private
struct would contain the actual implementation of how the AST is
designed for both single and implicitly concatenated strings.
This implementation is achieved through an enum with two variants:
`Single` and `Concatenated` to avoid allocating a vector even for single
strings. There are various public methods available on the value struct
to query certain information regarding the node.
The nodes are structured in the following way:
```
ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"
ExprBytesLiteral - b"foo" b"bar"
|- BytesLiteral - b"foo"
|- BytesLiteral - b"bar"
ExprFString - "foo" f"bar {x}"
|- FStringPart::Literal - "foo"
|- FStringPart::FString - f"bar {x}"
|- StringLiteral - "bar "
|- FormattedValue - "x"
```
[^1]: Internal document:
https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4
#### Visitor
The way the nodes are structured is that the entire string, including
all the parts that are implicitly concatenation, is a single node
containing individual nodes for the parts. The previous section has a
representation of that tree for all the string nodes. This means that
new visitor methods are added to visit the individual parts of string,
bytes, and f-strings for `Visitor`, `PreorderVisitor`, and
`Transformer`.
## Test Plan
- `cargo insta test --workspace --all-features --unreferenced reject`
- Verify that the ecosystem results are unchanged
## Summary
This PR updates the `E703` rule to avoid flagging any semicolons if
they're present after the last expression in a notebook cell. These are
intended to hide the cell output.
Part of #8669
## Test Plan
Add test notebook and update the snapshots.
## Summary
I think it's reasonable to avoid raising `INP001` for scripts, and
shebangs are one sufficient way to detect scripts.
Closes https://github.com/astral-sh/ruff/issues/8690.
## Summary
It seems like the range of an `ExprStringLiteral` can be somewhat
unreliable when the string is part of an implicit concatenation with an
f-string. Using the tokens themselves is more reliable.
Closes#8680.
Closes https://github.com/astral-sh/ruff/issues/7784.
When using the autofixer for `Q000` it does not remove the backslashes
from quotes that no longer need escaping.
This new rule checks for such backslashes (regardless whether they come
from the autofixer or not) and can remove them.
fixes#8617
## Summary
This fixes the bug where the `flake8-commas` rules weren't taking the
new f-string tokens into account.
## Test Plan
Add new test cases around f-strings for all of `flake8-commas`'s rules.
fixes: #8556
## Summary
Throughout the codebase, we have this pattern:
```rust
let mut diagnostic = ...
if checker.patch(Rule::UnusedVariable) {
// Do the fix.
}
diagnostics.push(diagnostic)
```
This was helpful when we computed fixes lazily; however, we now compute
fixes eagerly, and this is _only_ used to ensure that we don't generate
fixes for rules marked as unfixable.
We often forget to add this, and it leads to bugs in enforcing
`--unfixable`.
This PR instead removes all of these checks, moving the responsibility
of enforcing `--unfixable` up to `check_path`. This is similar to how
@zanieb handled the `--extend-unsafe` logic: we post-process the
diagnostics to remove any fixes that should be ignored.
## Stack Summary
This stack splits `Settings` into `FormatterSettings` and `LinterSettings` and moves it into `ruff_workspace`. This change is necessary to add the `FormatterSettings` to `Settings` without adding `ruff_python_formatter` as a dependency to `ruff_linter` (and the linter should not contain the formatter settings).
A quick overview of our settings struct at play:
* `Options`: 1:1 representation of the options in the `pyproject.toml` or `ruff.toml`. Used for deserialization.
* `Configuration`: Resolved `Options`, potentially merged from multiple configurations (when using `extend`). The representation is very close if not identical to the `Options`.
* `Settings`: The resolved configuration that uses a data format optimized for reading. Optional fields are initialized with their default values. Initialized by `Configuration::into_settings` .
The goal of this stack is to split `Settings` into tool-specific resolved `Settings` that are independent of each other. This comes at the advantage that the individual crates don't need to know anything about the other tools. The downside is that information gets duplicated between `Settings`. Right now the duplication is minimal (`line-length`, `tab-width`) but we may need to come up with a solution if more expensive data needs sharing.
This stack focuses on `Settings`. Splitting `Configuration` into some smaller structs is something I'll follow up on later.
## PR Summary
This PR extracts the linter-specific settings into a new `LinterSettings` struct and adds it as a `linter` field to the `Settings` struct. This is in preparation for moving `Settings` from `ruff_linter` to `ruff_workspace`
## Test Plan
`cargo test`