## Summary
This PR implements a variety of optimizations to improve performance of
the Eradicate rule, which always shows up in all-rules benchmarks and
bothers me. (These improvements are not hugely important, but it was
kind of a fun Friday thing to spent a bit of time on.)
The improvements include:
- Doing cheaper work first (checking for some explicit substrings
upfront).
- Using `aho-corasick` to speed an exact substring search.
- Merging multiple regular expressions using a `RegexSet`.
- Removing some unnecessary `\s*` and other pieces from the regular
expressions (since we already trim strings before matching on them).
## Test Plan
I benchmarked this function in a standalone crate using a variety of
cases. Criterion reports that this version is up to 80% faster, and
almost every case is at least 50% faster:
```
Eradicate/Detection/# Warn if we are installing over top of an existing installation. This can
time: [101.84 ns 102.32 ns 102.82 ns]
change: [-77.166% -77.062% -76.943%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Eradicate/Detection/#from foo import eradicate
time: [74.872 ns 75.096 ns 75.314 ns]
change: [-84.180% -84.131% -84.079%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Eradicate/Detection/# encoding: utf8
time: [46.522 ns 46.862 ns 47.237 ns]
change: [-29.408% -28.918% -28.471%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
Eradicate/Detection/# Issue #999
time: [16.942 ns 16.994 ns 17.058 ns]
change: [-57.243% -57.064% -56.815%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Eradicate/Detection/# type: ignore
time: [43.074 ns 43.163 ns 43.262 ns]
change: [-17.614% -17.390% -17.152%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
Eradicate/Detection/# user_content_type, _ = TimelineEvent.objects.using(db_alias).get_or_create(
time: [209.40 ns 209.81 ns 210.23 ns]
change: [-32.806% -32.630% -32.470%] (p = 0.00 < 0.05)
Performance has improved.
Eradicate/Detection/# this is = to that :(
time: [72.659 ns 73.068 ns 73.473 ns]
change: [-68.884% -68.775% -68.655%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
Eradicate/Detection/#except Exception:
time: [92.063 ns 92.366 ns 92.691 ns]
change: [-64.204% -64.052% -63.909%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Eradicate/Detection/#print(1)
time: [68.359 ns 68.537 ns 68.725 ns]
change: [-72.424% -72.356% -72.278%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
Eradicate/Detection/#'key': 1 + 1,
time: [79.604 ns 79.865 ns 80.135 ns]
change: [-69.787% -69.667% -69.549%] (p = 0.00 < 0.05)
Performance has improved.
```
## Summary
This is a follow-up to #7469 that attempts to achieve similar gains, but
without introducing malachite. Instead, this PR removes the `BigInt`
type altogether, instead opting for a simple enum that allows us to
store small integers directly and only allocate for values greater than
`i64`:
```rust
/// A Python integer literal. Represents both small (fits in an `i64`) and large integers.
#[derive(Clone, PartialEq, Eq, Hash)]
pub struct Int(Number);
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum Number {
/// A "small" number that can be represented as an `i64`.
Small(i64),
/// A "large" number that cannot be represented as an `i64`.
Big(Box<str>),
}
impl std::fmt::Display for Number {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Number::Small(value) => write!(f, "{value}"),
Number::Big(value) => write!(f, "{value}"),
}
}
}
```
We typically don't care about numbers greater than `isize` -- our only
uses are comparisons against small constants (like `1`, `2`, `3`, etc.),
so there's no real loss of information, except in one or two rules where
we're now a little more conservative (with the worst-case being that we
don't flag, e.g., an `itertools.pairwise` that uses an extremely large
value for the slice start constant). For simplicity, a few diagnostics
now show a dedicated message when they see integers that are out of the
supported range (e.g., `outdated-version-block`).
An additional benefit here is that we get to remove a few dependencies,
especially `num-bigint`.
## Test Plan
`cargo test`
## Summary
## Stack Summary
This stack splits `Settings` into `FormatterSettings` and `LinterSettings` and moves it into `ruff_workspace`. This change is necessary to add the `FormatterSettings` to `Settings` without adding `ruff_python_formatter` as a dependency to `ruff_linter` (and the linter should not contain the formatter settings).
A quick overview of our settings struct at play:
* `Options`: 1:1 representation of the options in the `pyproject.toml` or `ruff.toml`. Used for deserialization.
* `Configuration`: Resolved `Options`, potentially merged from multiple configurations (when using `extend`). The representation is very close if not identical to the `Options`.
* `Settings`: The resolved configuration that uses a data format optimized for reading. Optional fields are initialized with their default values. Initialized by `Configuration::into_settings` .
The goal of this stack is to split `Settings` into tool-specific resolved `Settings` that are independent of each other. This comes at the advantage that the individual crates don't need to know anything about the other tools. The downside is that information gets duplicated between `Settings`. Right now the duplication is minimal (`line-length`, `tab-width`) but we may need to come up with a solution if more expensive data needs sharing.
This stack focuses on `Settings`. Splitting `Configuration` into some smaller structs is something I'll follow up on later.
## PR Summary
This PR moves the `ResolverSettings` and `Settings` struct to `ruff_workspace`. `LinterSettings` remains in `ruff_linter` because it gets passed to lint rules, the `Checker` etc.
## Test Plan
`cargo test`
## Summary
The tokenizer was split into a forward and a backwards tokenizer. The
backwards tokenizer uses the same names as the forwards ones (e.g.
`next_token`). The backwards tokenizer gets the comment ranges that we
already built to skip comments.
---------
Co-authored-by: Micha Reiser <micha@reiser.io>
Bumps [shlex](https://github.com/comex/rust-shlex) from 1.1.0 to 1.2.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/comex/rust-shlex/blob/master/CHANGELOG.md">shlex's
changelog</a>.</em></p>
<blockquote>
<h1>1.2.0</h1>
<ul>
<li>Adds <code>bytes</code> module to support operating directly on byte
strings.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/comex/rust-shlex/commits">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Micha Reiser <micha@reiser.io>
## Summary
This PR upgrades `strum` from 0.24.x to 0.25.x.
The breaking changes are:
* strum macros now uses syn2
* The `to_string` behavior changed when using `default`. I did a quick search, we aren't using `strum(default)`
`strum` now has a `#[derive(EnumIs)]` macro that generates `is_` methods.
## Test Plan
cargo test
## Summary
The only commit is [api: add follow_root_links() option to WalkDir](dcc527d832) whicih addsa new option wheter `walkdir` should follow a root symlink or not.
The new option defaults to `true` which is the same as before.
## Test Plan
`cargo test`
## Summary
This PR bumps the pyproject-toml crate to 0.7.0. The only difference is that it now depends on indexmap 2. I reviewed the indexmap 2 changes and they don't seem relevant to us.
I used this opportunity to remove the default features from `serde_with` which removes our indexmap 1 dependency (and some other unused dependencies)
## Test Plan
`cargo test`
## Motivation
The `ast::Arguments` for call argument are split into positional
arguments (args) and keywords arguments (keywords). We currently assume
that call consists of first args and then keywords, which is generally
the case, but not always:
```python
f(*args, a=2, *args2, **kwargs)
class A(*args, a=2, *args2, **kwargs):
pass
```
The consequence is accidentally reordering arguments
(https://github.com/astral-sh/ruff/pull/7268).
## Summary
`Arguments::args_and_keywords` returns an iterator of an `ArgOrKeyword`
enum that yields args and keywords in the correct order. I've fixed the
obvious `args` and `keywords` usages, but there might be some cases with
wrong assumptions remaining.
## Test Plan
The generator got new test cases, otherwise the stacked PR
(https://github.com/astral-sh/ruff/pull/7268) which uncovered this.
## Summary
This PR updates the `FileCache` to include an optional `NotebookIndex`
to support caching for Jupyter Notebooks.
We only require the index to compute the diagnostics and thus we don't
really need to store the entire `Notebook` on the `Diagnostics` struct.
This means we only need the index to be stored in the cache to
reconstruct the `Diagnostics`.
## Test Plan
Update an existing test case to run over the fixtures under
`ruff_notebook` crate where there are multiple Jupyter Notebook.
Locally, the following commands were run in order:
1. Remove the cache: `rm -rf .ruff_cache`
2. Run without cache: `cargo run --bin ruff -- check --isolated
crates/ruff_notebook/resources/test/fixtures/jupyter/unused_variable.ipynb
--no-cache`
3. Run with cache: `cargo run --bin ruff -- check --isolated
crates/ruff_notebook/resources/test/fixtures/jupyter/unused_variable.ipynb`
4. Check whether the `.ruff_cache` directory was created or not
5. Run with cache again and verify: `cargo run --bin ruff -- check
--isolated
crates/ruff_notebook/resources/test/fixtures/jupyter/unused_variable.ipynb`
## Benchmarks
https://github.com/astral-sh/ruff/pull/6863#issuecomment-1715675186fixes: #6671
## Summary
This PR updates the remaining lexer test cases to use the snapshots.
This is mainly a mechanical refactor.
## Motivation
The main motivation is so that when we add the token range values to the
test case output, it's easier to update the test cases.
The reason they were not using the snapshots before was because of the usage of
`test_case` macro. The macros is mainly used for different EOL test cases. If we
just generate the snapshots directly, then the snapshot name would be suffixed
with `-1`, `-2`, etc. as the test function is still the same. So, we'll create
the snapshot ourselves with the platform name for the respective EOL
test cases.
## Test Plan
`cargo test`
## Summary
This PR adds the `--preview` and `--no-preview` options to the `format` command (hidden) and passes it through to the formatte.
## Test Plan
I added the `dbg(f.options().preview())` statement in `FormatNodeRule::fmt` and verified that the option gets correctly passed to the formatter.
## Summary
This PR updates the revision of `LibCST` dependency to 9c263aa897
inorder to fix https://github.com/astral-sh/ruff/issues/4899
## Test Plan
The test case including the carriage return (`\r`) character was added for
`F504` and then `cargo test`.
fixes: #4899
## Summary
This PR moves `ruff/jupyter` into its own `ruff_notebook` crate. Beyond
the move itself, there were a few challenges:
1. `ruff_notebook` relies on the source map abstraction. I've moved the
source map into `ruff_diagnostics`, since it doesn't have any
dependencies on its own and is used alongside diagnostics.
2. `ruff_notebook` has a couple tests for end-to-end linting and
autofixing. I had to leave these tests in `ruff` itself.
3. We had code in `ruff/jupyter` that relied on Python lexing, in order
to provide a more targeted error message in the event that a user saves
a `.py` file with a `.ipynb` extension. I removed this in order to avoid
a dependency on the parser, it felt like it wasn't worth retaining just
for that dependency.
## Test Plan
`cargo test`
## Summary
Returns an exit code of 1 if any files would be reformatted:
```
ruff on charlie/format-check:main [$?⇡] is 📦 v0.0.286 via 🐍 v3.11.2 via 🦀 v1.72.0
❯ cargo run -p ruff_cli -- format foo.py --check
Compiling ruff_cli v0.0.286 (/Users/crmarsh/workspace/ruff/crates/ruff_cli)
Finished dev [unoptimized + debuginfo] target(s) in 1.69s
Running `target/debug/ruff format foo.py --check`
warning: `ruff format` is a work-in-progress, subject to change at any time, and intended only for experimentation.
1 file would be reformatted
ruff on charlie/format-check:main [$?⇡] is 📦 v0.0.286 via 🐍 v3.11.2 via 🦀 v1.72.0 took 2s
❯ echo $?
1
```
Closes#6966.
**Summary** Add recursive formatting based on `ruff check` file
discovery for `ruff format`, as a prototype for the formatter alpha.
This allows e.g. `format ../projects/django/`. It's still lacking
support for any settings except line length.
Note just like the existing `ruff format` this will become part of the
production build, i.e. you'll be able to use it - hidden by default and
with a prominent warning - with `ruff format .` after the next release.
Error handling works in my manual tests (the colors do also work):
```
$ target/debug/ruff format scripts/
warning: `ruff format` is a work-in-progress, subject to change at any time, and intended for internal use only.
```
(the above changes `add_rule.py` where we have the wrong bin op
breaking)
```
$ target/debug/ruff format ../projects/django/
warning: `ruff format` is a work-in-progress, subject to change at any time, and intended for internal use only.
Failed to format /home/konsti/projects/django/tests/test_runner_apps/tagged/tests_syntax_error.py: source contains syntax errors: ParseError { error: UnrecognizedToken(Name { name: "syntax_error" }, None), offset: 131, source_path: "<filename>" }
```
```
$ target/debug/ruff format a
warning: `ruff format` is a work-in-progress, subject to change at any time, and intended for internal use only.
Failed to read /home/konsti/ruff/a/d.py: Permission denied (os error 13)
```
**Test Plan** Missing! I'm not sure if it's worth building tests at this
stage or how they should look like.
## Summary
This PR updates the lexer tests to use the snapshot testing framework.
It also
makes the following changes:
* Remove the use of macros in the lexer tests
* Use `test_case` for EOL tests
## Test Plan
```
cargo test --package ruff_python_parser --lib --all-features -- lexer::tests --no-capture
```
In https://github.com/astral-sh/ruff/pull/6616 we are adding support for
nested replacements in format specifiers which makes actually formatting
strings infeasible without a great deal of complexity. Since we're not
using these functions (they just exist for runtime use in RustPython),
we can just remove them.
## Summary
Fixes some TODOs introduced in
https://github.com/astral-sh/ruff/pull/6538. In short, given an
expression like `1 if x > 0 else "Hello, world!"`, we now return a union
type that says the expression can resolve to either an `int` or a `str`.
The system remains very limited, it only works for obvious primitive
types, and there's no attempt to do inference on any more complex
variables. (If any expression yields `Unknown` or `TypeError`, we
propagate that result throughout and abort on the client's end.)
**Summary** Some files seems notoriously slow in the formatter (secons in debug mode). This time was however almost exclusively spent in the diff algorithm to collect the similarity index, so i replaced that. I kept `similar` for printing actual diff to avoid rewriting that too, with the disadvantage that we now have to diff libraries in format_dev.
I used this PR to remove the spinner from tracing-indicatif and changed `flamegraph --perfdata perf.data` to `flamegraph --perfdata perf.data --no-inline` as the former wouldn't finish for me on release builds with debug info.
## Summary
This PR adds the implementation for the new Jupyter AST nodes i.e.,
`ExprLineMagic` and `StmtLineMagic`.
## Test Plan
Add test cases for `unparse` containing magic commands
resolves: #6087
Reimplements https://github.com/astral-sh/ruff/pull/3104
Closes https://github.com/astral-sh/ruff/issues/5726
Note that we will generate the hash for a cache key twice in normal
operation. Once to check for the cached item and again to update the
cache. We could optimize this by generating the hash once in
`diagnostics::lint_file` and passing the `u64` into `get` and `update`.
We'd probably want to wrap it in a `CacheKeyHash` enum for type safety.
## Test plan
Unit tests for Windows and Unix.
Manual test with case from issue
```
❯ touch fake.py
❯ chmod +x fake.py
❯ ./target/debug/ruff --select EXE fake.py
fake.py:1:1: EXE002 The file is executable but no shebang is present
Found 1 error.
❯ chmod -x fake.py
❯ ./target/debug/ruff --select EXE fake.py
```
I don't know whether we want to make this change but here's some data...
Binary size:
- `main`: 30,384
- `charlie/match-phf`: 30,416
llvm-lines:
- `main`: 1,784,148
- `charlie/match-phf`: 1,789,877
llvm-lines and binary size are both unchanged (or, by < 5) when moving
from `u8` to `u32` return types, and even when moving to `char` keys and
values. I didn't expect this, but I'm not very knowledgable on this
topic.
Performance:
```
Confusables/match/src time: [4.9102 µs 4.9352 µs 4.9777 µs]
change: [+1.7469% +2.2421% +2.8710%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) low mild
4 (4.00%) high mild
6 (6.00%) high severe
Confusables/match-with-skip/src
time: [2.0676 µs 2.0945 µs 2.1317 µs]
change: [+0.9384% +1.6000% +2.3920%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
Confusables/phf/src time: [31.087 µs 31.188 µs 31.305 µs]
change: [+1.9262% +2.2188% +2.5496%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
3 (3.00%) low mild
6 (6.00%) high mild
6 (6.00%) high severe
Confusables/phf-with-skip/src
time: [2.0470 µs 2.0486 µs 2.0502 µs]
change: [-0.3093% -0.1446% +0.0106%] (p = 0.08 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
```
The `-with-skip` variants add our optimization which first checks
whether the character is ASCII. So `match` is way, way faster than PHF,
but it tends not to matter since almost all source code is ASCII anyway.
**Summary** I accidentally merged earlier while the RustPython parser
rev was still pointing to the feature branch instead of to the merged
main. This make the rev point to the RustPython parser repo main again
**Summary** Fix implemented in
https://github.com/astral-sh/RustPython-Parser/pull/35: Previously,
empty lambda arguments (e.g. `lambda: 1`) would get the range of the
entire expression, which leads to incorrect comment placement. Now empty
lambda arguments get an empty range between the `lambda` and the `:`
tokens.
**Test Plan** Added a regression test.
149 instances of unstable formatting remaining.
```
$ cargo run --bin ruff_dev --release -- format-dev --stability-check --error-file formatter-ecosystem-errors.txt --multi-project target/checkouts > formatter-ecosystem-progress.txt
$ rg "Unstable formatting" target/formatter-ecosystem-errors.txt | wc -l
149
```
## Summary
It can happen that we can't read a file (a python file, a jupyter
notebook or pyproject.toml), which needs to be handled and handled
consistently for all file types. Instead of using `Err` or `error!`, we
emit E602 with the io error as message and continue. This PR makes sure
we handle all three cases consistently, emit E602.
I'm not convinced that it should be possible to disable io errors, but
we now handle the regular case consistently and at least print warning
consistently.
I went with `warn!` but i can change them all to `error!`, too.
It also checks the error case when a pyproject.toml is not readable. The
error message is not very helpful, but it's now a bit clearer that
actually ruff itself failed instead vs this being a diagnostic.
## Examples
This is how an Err of `run` looks now:

With an unreadable file and `IOError` disabled:

(we lint zero files but count files before linting not during so we exit
0)
I'm not sure if it should (or if we should take a different path with
manual ExitStatus), but this currently also triggers when `files` is
empty:

## Test Plan
Unix only: Create a temporary directory with files with permissions
`000` (not readable by the owner) and run on that directory. Since this
breaks the assumptions of most of the test code (single file, `ruff`
instead of `ruff_cli`), the test code is rather cumbersome and looks a
bit misplaced; i'm happy about suggestions to fit it in closer with the
other tests or streamline it in other ways. I added another test for
when the entire directory is not readable.
## Summary
This crate now contains utilities for dealing with trivia more broadly:
whitespace, newlines, "simple" trivia lexing, etc. So renaming it to
reflect its increased responsibilities.
To avoid conflicts, I've also renamed `Token` and `TokenKind` to
`SimpleToken` and `SimpleTokenKind`.
## Summary
The motivation here is that it will make this rule easier to rewrite as
a deferred check. Right now, we can't run this rule in the deferred
phase, because it depends on the `except_handler` to power its autofix.
Instead of lexing the `except_handler`, we can use the `SimpleTokenizer`
from the formatter, and just lex forwards and backwards.
For context, this rule detects the unused `e` in:
```python
try:
pass
except ValueError as e:
pass
```
## Summary
Previously, `StmtIf` was defined recursively as
```rust
pub struct StmtIf {
pub range: TextRange,
pub test: Box<Expr>,
pub body: Vec<Stmt>,
pub orelse: Vec<Stmt>,
}
```
Every `elif` was represented as an `orelse` with a single `StmtIf`. This
means that this representation couldn't differentiate between
```python
if cond1:
x = 1
else:
if cond2:
x = 2
```
and
```python
if cond1:
x = 1
elif cond2:
x = 2
```
It also makes many checks harder than they need to be because we have to
recurse just to iterate over an entire if-elif-else and because we're
lacking nodes and ranges on the `elif` and `else` branches.
We change the representation to a flat
```rust
pub struct StmtIf {
pub range: TextRange,
pub test: Box<Expr>,
pub body: Vec<Stmt>,
pub elif_else_clauses: Vec<ElifElseClause>,
}
pub struct ElifElseClause {
pub range: TextRange,
pub test: Option<Expr>,
pub body: Vec<Stmt>,
}
```
where `test: Some(_)` represents an `elif` and `test: None` an else.
This representation is different tradeoff, e.g. we need to allocate the
`Vec<ElifElseClause>`, the `elif`s are now different than the `if`s
(which matters in rules where want to check both `if`s and `elif`s) and
the type system doesn't guarantee that the `test: None` else is actually
last. We're also now a bit more inconsistent since all other `else`,
those from `for`, `while` and `try`, still don't have nodes. With the
new representation some things became easier, e.g. finding the `elif`
token (we can use the start of the `ElifElseClause`) and formatting
comments for if-elif-else (no more dangling comments splitting, we only
have to insert the dangling comment after the colon manually and set
`leading_alternate_branch_comments`, everything else is taken of by
having nodes for each branch and the usual placement.rs fixups).
## Merge Plan
This PR requires coordination between the parser repo and the main ruff
repo. I've split the ruff part, into two stacked PRs which have to be
merged together (only the second one fixes all tests), the first for the
formatter to be reviewed by @michareiser and the second for the linter
to be reviewed by @charliermarsh.
* MH: Review and merge
https://github.com/astral-sh/RustPython-Parser/pull/20
* MH: Review and merge or move later in stack
https://github.com/astral-sh/RustPython-Parser/pull/21
* MH: Review and approve
https://github.com/astral-sh/RustPython-Parser/pull/22
* MH: Review and approve formatter PR
https://github.com/astral-sh/ruff/pull/5459
* CM: Review and approve linter PR
https://github.com/astral-sh/ruff/pull/5460
* Merge linter PR in formatter PR, fix ecosystem checks (ecosystem
checks can't run on the formatter PR and won't run on the linter PR, so
we need to merge them first)
* Merge https://github.com/astral-sh/RustPython-Parser/pull/22
* Create tag in the parser, update linter+formatter PR
* Merge linter+formatter PR https://github.com/astral-sh/ruff/pull/5459
---------
Co-authored-by: Micha Reiser <micha@reiser.io>
## Summary
For formatter instabilities, the message we get look something like
this:
```text
Unstable formatting /home/konsti/ruff/target/checkouts/deepmodeling:dpdispatcher/dpdispatcher/slurm.py
@@ -47,9 +47,9 @@
- script_header_dict["slurm_partition_line"] = (
- NOT_YET_IMPLEMENTED_ExprJoinedStr
- )
+ script_header_dict[
+ "slurm_partition_line"
+ ] = NOT_YET_IMPLEMENTED_ExprJoinedStr
Unstable formatting /home/konsti/ruff/target/checkouts/deepmodeling:dpdispatcher/dpdispatcher/pbs.py
@@ -26,9 +26,9 @@
- pbs_script_header_dict["select_node_line"] += (
- NOT_YET_IMPLEMENTED_ExprJoinedStr
- )
+ pbs_script_header_dict[
+ "select_node_line"
+ ] += NOT_YET_IMPLEMENTED_ExprJoinedStr
```
For ruff crashes. you don't even get that but just the file that crashed
it. To extract the actual bug, you'd need to manually remove parts of
the file, rerun to see if the bug still occurs (and revert if it
doesn't) until you have a minimal example.
With this script, you run
```shell
cargo run --bin ruff_shrinking -- target/checkouts/deepmodeling:dpdispatcher/dpdispatcher/slurm.py target/minirepo/code.py "Unstable formatting" "target/debug/ruff_dev format-dev --stability-check target/minirepo"
```
and get
```python
class Slurm():
def gen_script_header(self, job):
if resources.queue_name != "":
script_header_dict["slurm_partition_line"] = f"#SBATCH --partition {resources.queue_name}"
```
which is an nice minimal example.
I've been using this script and it would be easier for me if this were
part of main. The main disadvantage to merging is that it adds
additional dependencies.
## Test Plan
I've been using this for a number of minimization. This is an internal
helper script you only run manually. I could add a test that minimizes a
rule violation if required.
---------
Co-authored-by: Micha Reiser <micha@reiser.io>
## Summary
Comparing repos with black requires that we use the settings as black,
notably line length and magic trailing comma behaviour. Excludes and
preserving quotes (vs. a preference for either quote style) is not yet
implemented because they weren't needed for the test projects.
In the other two commits i fixed the output when the progress bar is
hidden (this way is recommonded in the indicatif docs), added a
`scratch.pyi` file to gitignore because black formats stub files
differently and also updated the ecosystem readme with the projects json
without forks.
## Test Plan
I added a `line-length` vs `line_length` test. Otherwise only my
personal usage atm, a PR to integrate the script into the CI to check
some projects will follow.
## Summary
Do not raise `EXE001` and `EXE002` if WSL is detected. Uses the
[`wsl`](https://crates.io/crates/wsl) crate.
Closes#5445.
## Test Plan
`cargo test`
I don't use Windows, so was unable to test on a WSL environment. It
would be good if someone who runs Windows could check the functionality.
## Summary
It turns out that just doing this match directly without `AhoCorasick`
is much faster, like 2x (and removes one dependency, though we likely
already rely on this transitively).
## Summary
Historically, we only used a fork to enable building without pyo3. But
pyo3 is an optional feature. I may've just not understood how to
accomplish this way back when.
## Summary
This extends the `ruff_dev` formatter script util. Instead of only doing
stability checks, you can now choose different compatible options on the
CLI and get statistics.
* It adds an option the formats all files that ruff would check to allow
looking at an entire black-formatted repository with `git diff`
* It computes the [Jaccard
index](https://en.wikipedia.org/wiki/Jaccard_index) as a measure of
deviation between input and output, which is useful as single number
metric for assessing our current deviations from black.
* It adds progress bars to both the single projects as well as the
multi-project mode.
* It adds an option to write the multi-project output to a file
Sample usage:
```
$ cargo run --bin ruff_dev -- format-dev --stability-check crates/ruff/resources/test/cpython
$ cargo run --bin ruff_dev -- format-dev --stability-check /home/konsti/projects/django
Syntax error in /home/konsti/projects/django/tests/test_runner_apps/tagged/tests_syntax_error.py: source contains syntax errors (parser error): BaseError { error: UnrecognizedToken(Name { name: "syntax_error" }, None), offset: 131, source_path: "<filename>" }
Found 0 stability errors in 2755 files (jaccard index 0.911) in 9.75s
$ cargo run --bin ruff_dev -- format-dev --write /home/konsti/projects/django
```
Options:
```
Several utils related to the formatter which can be run on one or more repositories. The selected set of files in a repository is the same as for `ruff check`.
* Check formatter stability: Format a repository twice and ensure that it looks that the first and second formatting look the same. * Format: Format the files in a repository to be able to check them with `git diff` * Statistics: The subcommand the Jaccard index between the (assumed to be black formatted) input and the ruff formatted output
Usage: ruff_dev format-dev [OPTIONS] [FILES]...
Arguments:
[FILES]...
Like `ruff check`'s files. See `--multi-project` if you want to format an ecosystem checkout
Options:
--stability-check
Check stability
We want to ensure that once formatted content stays the same when formatted again, which is known as formatter stability or formatter idempotency, and that the formatter prints syntactically valid code. As our test cases cover only a limited amount of code, this allows checking entire repositories.
--write
Format the files. Without this flag, the python files are not modified
--format <FORMAT>
Control the verbosity of the output
[default: default]
Possible values:
- minimal: Filenames only
- default: Filenames and reduced diff
- full: Full diff and invalid code
-x, --exit-first-error
Print only the first error and exit, `-x` is same as pytest
--multi-project
Checks each project inside a directory, useful e.g. if you want to check all of the ecosystem checkouts
--error-file <ERROR_FILE>
Write all errors to this file in addition to stdout. Only used in multi-project mode
```
## Test Plan
I ran this on django (2755 files, jaccard index 0.911) and discovered a
magic trailing comma problem and that we really needed to implement
import formatting. I ran the script on cpython to identify
https://github.com/astral-sh/ruff/pull/5558.
## Summary
I'll write up a more detailed description tomorrow, but in short, this
PR removes our regex-based implementation in favor of "manual" parsing.
I tried a couple different implementations. In the benchmarks below:
- `Directive/Regex` is our implementation on `main`.
- `Directive/Find` just uses `text.find("noqa")`, which is insufficient,
since it doesn't cover case-insensitive variants like `NOQA`, and
doesn't handle multiple `noqa` matches in a single like, like ` # Here's
a noqa comment # noqa: F401`. But it's kind of a baseline.
- `Directive/Memchr` uses three `memchr` iterative finders (one for
`noqa`, `NOQA`, and `NoQA`).
- `Directive/AhoCorasick` is roughly the variant checked-in here.
The raw results:
```
Directive/Regex/# noqa: F401
time: [273.69 ns 274.71 ns 276.03 ns]
change: [+1.4467% +1.8979% +2.4243%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
3 (3.00%) low mild
8 (8.00%) high mild
4 (4.00%) high severe
Directive/Find/# noqa: F401
time: [66.972 ns 67.048 ns 67.132 ns]
change: [+2.8292% +2.9377% +3.0540%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low severe
3 (3.00%) low mild
8 (8.00%) high mild
3 (3.00%) high severe
Directive/AhoCorasick/# noqa: F401
time: [76.922 ns 77.189 ns 77.536 ns]
change: [+0.4265% +0.6862% +0.9871%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
3 (3.00%) high mild
4 (4.00%) high severe
Directive/Memchr/# noqa: F401
time: [62.627 ns 62.654 ns 62.679 ns]
change: [-0.1780% -0.0887% -0.0120%] (p = 0.03 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low severe
5 (5.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
Directive/Regex/# noqa: F401, F841
time: [321.83 ns 322.39 ns 322.93 ns]
change: [+8602.4% +8623.5% +8644.5%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
Directive/Find/# noqa: F401, F841
time: [78.618 ns 78.758 ns 78.896 ns]
change: [+1.6909% +1.8771% +2.0628%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Directive/AhoCorasick/# noqa: F401, F841
time: [87.739 ns 88.057 ns 88.468 ns]
change: [+0.1843% +0.4685% +0.7854%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
Directive/Memchr/# noqa: F401, F841
time: [80.674 ns 80.774 ns 80.860 ns]
change: [-0.7343% -0.5633% -0.4031%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) low severe
9 (9.00%) low mild
1 (1.00%) high mild
Directive/Regex/# noqa time: [194.86 ns 195.93 ns 196.97 ns]
change: [+11973% +12039% +12103%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) low mild
1 (1.00%) high mild
Directive/Find/# noqa time: [25.327 ns 25.354 ns 25.383 ns]
change: [+3.8524% +4.0267% +4.1845%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
6 (6.00%) high mild
3 (3.00%) high severe
Directive/AhoCorasick/# noqa
time: [34.267 ns 34.368 ns 34.481 ns]
change: [+0.5646% +0.8505% +1.1281%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
Directive/Memchr/# noqa time: [21.770 ns 21.818 ns 21.874 ns]
change: [-0.0990% +0.1464% +0.4046%] (p = 0.26 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low mild
4 (4.00%) high mild
2 (2.00%) high severe
Directive/Regex/# type: ignore # noqa: E501
time: [278.76 ns 279.69 ns 280.72 ns]
change: [+7449.4% +7469.8% +7490.5%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
Directive/Find/# type: ignore # noqa: E501
time: [67.791 ns 67.976 ns 68.184 ns]
change: [+2.8321% +3.1735% +3.5418%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
Directive/AhoCorasick/# type: ignore # noqa: E501
time: [75.908 ns 76.055 ns 76.210 ns]
change: [+0.9269% +1.1427% +1.3955%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
Directive/Memchr/# type: ignore # noqa: E501
time: [72.549 ns 72.723 ns 72.957 ns]
change: [+1.5881% +1.9660% +2.3974%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
10 (10.00%) high mild
5 (5.00%) high severe
Directive/Regex/# type: ignore # nosec
time: [66.967 ns 67.075 ns 67.207 ns]
change: [+1713.0% +1715.8% +1718.9%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
4 (4.00%) high severe
Directive/Find/# type: ignore # nosec
time: [18.505 ns 18.548 ns 18.597 ns]
change: [+1.3520% +1.6976% +2.0333%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
Directive/AhoCorasick/# type: ignore # nosec
time: [16.162 ns 16.206 ns 16.252 ns]
change: [+1.2919% +1.5587% +1.8430%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Directive/Memchr/# type: ignore # nosec
time: [39.192 ns 39.233 ns 39.276 ns]
change: [+0.5164% +0.7456% +0.9790%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
4 (4.00%) low mild
3 (3.00%) high mild
4 (4.00%) high severe
Directive/Regex/# some very long comment that # is interspersed with characters but # no directive
time: [81.460 ns 81.578 ns 81.703 ns]
change: [+2093.3% +2098.8% +2104.2%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
2 (2.00%) high mild
Directive/Find/# some very long comment that # is interspersed with characters but # no directive
time: [26.284 ns 26.331 ns 26.387 ns]
change: [+0.7554% +1.1027% +1.3832%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
Directive/AhoCorasick/# some very long comment that # is interspersed with characters but # no direc...
time: [28.643 ns 28.714 ns 28.787 ns]
change: [+1.3774% +1.6780% +2.0028%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Directive/Memchr/# some very long comment that # is interspersed with characters but # no directive
time: [55.766 ns 55.831 ns 55.897 ns]
change: [+1.5802% +1.7476% +1.9021%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild
```
While memchr is faster than aho-corasick in some of the common cases
(like `# noqa: F401`), the latter is way, way faster when there _isn't_
a match (like 2x faster -- see the last two cases). Since most comments
_aren't_ `noqa` comments, this felt like the right tradeoff. Note that
all implementations are significantly faster than the regex version.
(I know I originally reported a 10x speedup, but I ended up improving
the regex version a bit in some prior PRs, so it got unintentionally
faster via some refactors.)
There's also one behavior change in here, which is that we now allow
variable spaces, e.g., `#noqa` or `# noqa`. Previously, we required
exactly one space. This thus closes#5177.
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR uses rayon to parallelize the stability check by scheduling each project as its own task.
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
I ran the ecosystem check. It now makes use of all cores (except at the end, there are some large projects).
## Performance
The check now completes in minutes where it took about 30 minutes before.
<!-- How was it tested? -->
## Summary
This PR adds some snapshot tests for the resolver based on executing
resolutions within a "mock" of the Airflow repo (that is: a folder that
contains a subset of the repo's files, but all empty, and with an
only-partially-complete virtual environment). It's intended to act as a
lightweight integration test, to enable us to test resolutions on a
"real" project without adding a dependency on Airflow itself.
## Summary
This PR contains the first step towards enabling robust first-party,
third-party, and standard library import resolution in Ruff (including
support for `typeshed`, stub files, native modules, etc.) by porting
Pyright's import resolver to Rust.
The strategy taken here was to start with a more-or-less direct port of
the Pyright's TypeScript resolver. The code is intentionally similar,
and the test suite is effectively a superset of Pyright's test suite for
its own resolver. Due to the nature of the port, the code is very, very
non-idiomatic for Rust. The code is also entirely unused outside of the
test suite, and no effort has been made to integrate it with the rest of
the codebase.
Future work will include:
- Refactoring the code (now that it works) to match Rust and Ruff
idioms.
- Further testing, in practice, to ensure that the resolver can resolve
imports in a complex project, when provided with a virtual environment
path.
- Caching, to minimize filesystem lookups and redundant resolutions.
- Integration into Ruff itself (use Ruff's existing settings, find rules
that can make use of robust resolution, etc.)
## Summary
This formats call expressions with magic trailing comma and parentheses
behaviour but without call chaining
## Test Plan
Lots of new test fixtures, including some that don't work yet
## Summary
Experimental release for Jupyter Notebook integration.
Currently, this requires a user to explicitly opt-in using the
[include](https://beta.ruff.rs/docs/settings/#include) configuration:
```toml
[tool.ruff]
include = ["*.py", "*.pyi", "**/pyproject.toml", "*.ipynb"]
```
Or, a user can pass in the file directly:
```sh
ruff check path/to/notebook.ipynb
```
For known limitations, please refer #5188
## Test Plan
Following command should work without the `--all-features` flag:
```sh
cargo dev round-trip /path/to/notebook.ipynb
```
Following command should work with the above config file along with
`select = ["ALL"]`:
```sh
cargo run --bin ruff -- check --no-cache --config=../test-repos/openai-cookbook/pyproject.toml --fix ../test-repos/openai-cookbook/
```
Passing the Jupyter notebook directly:
```sh
cargo run --bin ruff -- check --no-cache --isolated --select=ALL --fix ../test-repos/openai-cookbook/examples/Classification_using_embeddings.ipynb
```
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR adds tests that verify that the magic trailing comma is not respected if disabled in the formatter options.
Our test setup now allows to create a `<fixture-name>.options.json` file that contains an array of configurations that should be tested.
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
It's all about tests :)
<!-- How was it tested? -->
<!--
Thank you for contributing to Ruff! To help us out with reviewing, please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
This PR implements formatting for non-f-string Strings that do not use implicit concatenation.
Docstring formatting is out of the scope of this PR.
<!-- What's the purpose of the change? What does it do, and why? -->
## Test Plan
I added a few tests for simple string literals.
## Performance
Ouch. This is hitting performance somewhat hard. This is probably because we now iterate each string a couple of times:
1. To detect if it is an implicit string continuation
2. To detect if the string contains any new lines
3. To detect the preferred quote
4. To normalize the string
Edit: I integrated the detection of newlines into the preferred quote detection so that we only iterate the string three time.
We can probably do better by merging the implicit string continuation with the quote detection and new line detection by iterating till the end of the string part and returning the offset. We then use our simple tokenizer to skip over any comments or whitespace until we find the first non trivia token. From there we keep continue doing this in a loop until we reach the end o the string. I'll leave this improvement for later.
## Summary
In https://github.com/astral-sh/RustPython-Parser/pull/8, we modified
RustPython to include ranges for any identifiers that aren't
`Expr::Name` (which already has an identifier).
For example, the `e` in `except ValueError as e` was previously
un-ranged. To extract its range, we had to do some lexing of our own.
This change should improve performance and let us remove a bunch of
code.
## Test Plan
`cargo test`
## Summary
This PR upgrade RustPython to pull in the changes to `Arguments` (zip
defaults with their identifiers) and all the renames to `CmpOp` and
friends.
## Summary
Maintain consistency while deserializing Jupyter notebook to JSON. The
following changes were made:
1. Use string array to store the source value as that's the default
(5781720423/nbformat/v4/nbjson.py (L56-L57))
2. Remove unused structs and enums
3. Reorder the keys in alphabetical order as that's the default.
(5781720423/nbformat/v4/nbjson.py (L51))
### Side effect
Removing the `preserve_order` feature means that the order of keys in
JSON output (`--format json`) will be in alphabetical order. This is
because the value is represented using `serde_json::Value` which
internally is a `BTreeMap`, thus sorting it as per the string key. For
posterity if this turns out to be not ideal, then we could define a
struct representing the JSON object and the order of struct fields will
determine the order in the JSON string.
## Test Plan
Add a test case to assert the raw JSON string.
## Summary
This changes the caching design from one cache file per source file, to
one cache file per package. This greatly reduces the amount of cache
files that are opened and written, while maintaining roughly the same
(combined) size as bincode is very compact.
Below are some very much not scientific performance tests. It uses
projects/sources to check:
* small.py: single, 31 bytes Python file with 2 errors.
* test.py: single, 43k Python file with 8 errors.
* fastapi: FastAPI repo, 1134 files checked, 0 errors.
Source | Before # files | After # files | Before size | After size
-------|-------|-------|-------|-------
small.py | 1 | 1 | 20 K | 20 K
test.py | 1 | 1 | 60 K | 60 K
fastapi | 1134 | 518 | 4.5 M | 2.3 M
One question that might come up is why fastapi still has 518 cache files
and not 1? That is because this is using the existing package
resolution, which sees examples, docs, etc. as separate from the "main"
source code (in the fastapi directory in the repo). In this future it
might be worth consider switching to a one cache file per repo strategy.
This new design is not perfect and does have a number of known issues.
First, like the old design it doesn't remove the cache for a source file
that has been (re)moved until `ruff clean` is called.
Second, this currently uses a large mutex around the mutation of the
package cache (e.g. inserting result). This could be (or become) a
bottleneck. It's future work to test and improve this (if needed).
Third, currently the packages and opened and stored in a sequential
loop, this could be done parallel. This is also future work.
## Test Plan
Run `ruff check` (with caching enabled) twice on any Python source code
and it should produce the same results.
## Summary
We want to ensure that once formatted content stays the same when
formatted again, which is known as formatter stability or formatter
idempotency, and that the formatter prints syntactically valid code. As
our test cases cover only a limited amount of code, this allows checking
entire repositories.
This adds a new subcommand to `ruff_dev` which can be invoked as `cargo
run --bin ruff_dev -- check-formatter-stability <repo>`. While initially
only intended to check stability, it has also found cases where the
formatter printed invalid syntax or panicked.
## Test Plan
Running this on cpython is already identifying bugs
(https://github.com/astral-sh/ruff/pull/5089)
## Summary
Add support for applying auto-fixes in Jupyter Notebook.
### Solution
Cell offsets are the boundaries for each cell in the concatenated source
code. They are represented using `TextSize`. It includes the start and
end offset as well, thus creating a range for each cell. These offsets
are updated using the `SourceMap` markers.
### SourceMap
`SourceMap` contains markers constructed from each edits which tracks
the original source code position to the transformed positions. The
following drawing might make it clear:

The center column where the dotted lines are present are the markers
included in the `SourceMap`. The `Notebook` looks at these markers and
updates the cell offsets after each linter loop. If you notice closely,
the destination takes into account all of the markers before it.
The index is constructed only when required as it's only used to render
the diagnostics. So, a `OnceCell` is used for this purpose. The cell
offsets, cell content and the index will be updated after each iteration
of linting in the mentioned order. The order is important here as the
content is updated as per the new offsets and index is updated as per
the new content.
## Limitations
### 1
Styling rules such as the ones in `pycodestyle` will not be applicable
everywhere in Jupyter notebook, especially at the cell boundaries. Let's
take an example where a rule suggests to have 2 blank lines before a
function and the cells contains the following code:
```python
import something
# ---
def first():
pass
def second():
pass
```
(Again, the comment is only to visualize cell boundaries.)
In the concatenated source code, the 2 blank lines will be added but it
shouldn't actually be added when we look in terms of Jupyter notebook.
It's as if the function `first` is at the start of a file.
`nbqa` solves this by recording newlines before and after running
`autopep8`, then running the tool and restoring the newlines at the end
(refer https://github.com/nbQA-dev/nbQA/pull/807).
## Test Plan
Three commands were run in order with common flags (`--select=ALL
--no-cache --isolated`) to isolate which stage the problem is occurring:
1. Only diagnostics
2. Fix with diff (`--fix --diff`)
3. Fix (`--fix`)
### https://github.com/facebookresearch/segment-anything
```
-------------------------------------------------------------------------------
Jupyter Notebooks 3 0 0 0 0
|- Markdown 3 98 0 94 4
|- Python 3 513 468 4 41
(Total) 611 468 98 45
-------------------------------------------------------------------------------
```
```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/segment-anything/**/*.ipynb --fix
...
Found 180 errors (89 fixed, 91 remaining).
```
### https://github.com/openai/openai-cookbook
```
-------------------------------------------------------------------------------
Jupyter Notebooks 65 0 0 0 0
|- Markdown 64 3475 12 2507 956
|- Python 65 9700 7362 1101 1237
(Total) 13175 7374 3608 2193
===============================================================================
```
```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/openai-cookbook/**/*.ipynb --fix
error: Failed to parse /path/to/openai-cookbook/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb:cell 4:29:18: unexpected token '-'
...
Found 4227 errors (2165 fixed, 2062 remaining).
```
### https://github.com/tensorflow/docs
```
-------------------------------------------------------------------------------
Jupyter Notebooks 150 0 0 0 0
|- Markdown 1 55 0 46 9
|- Python 1 402 289 60 53
(Total) 457 289 106 62
-------------------------------------------------------------------------------
```
```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-docs/**/*.ipynb --fix
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/extension_type.ipynb:cell 80:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/eager/custom_layers.ipynb:cell 20:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/data.ipynb:cell 175:5:14: unindent does not match any outer indentation level
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/representation/unicode.ipynb:cell 30:1:1: unexpected token Indent
...
Found 12726 errors (5140 fixed, 7586 remaining).
```
### https://github.com/tensorflow/models
```
-------------------------------------------------------------------------------
Jupyter Notebooks 46 0 0 0 0
|- Markdown 1 11 0 6 5
|- Python 1 328 249 19 60
(Total) 339 249 25 65
-------------------------------------------------------------------------------
```
```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-models/**/*.ipynb --fix
...
Found 4856 errors (2690 fixed, 2166 remaining).
```
resolves: #1218fixes: #4556
## Summary
`ruff_newlines` becomes `ruff_python_whitespace`, and includes the
existing "universal newline" handlers alongside the Python
whitespace-specific utilities.
* Use phf for confusables to reduce llvm lines
## Summary
This replaces FxHashMap for the confusables with a perfect hash map from the [phf crate](https://github.com/rust-phf/rust-phf) to reduce the generated llvm instructions.
A perfect hash function is one that doesn't have any collisions. We can build one because we know all keys at compile time. This improves hashmap efficiency, even though this is likely not noticeable in our case (except someone has a large non-english crate to test on).
The original hashmap contained a lot of duplicates, which i had to remove when phf_map complained, i did so by sorting the keys.
The important part that it reduces the llvm instructions generated (#3808, `RUSTFLAGS="-Csymbol-mangling-version=v0" cargo llvm-lines -p ruff --lib | head -20`):
```
Lines Copies Function name
----- ------ -------------
1740502 38973 (TOTAL)
27423 (1.6%, 1.6%) 1 (0.0%, 0.0%) ruff[cef4c65d96248843]::rules::ruff::rules::confusables::CONFUSABLES::{closure#0}
10193 (0.6%, 2.2%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::codes::RuleCodePrefix>::iter
8107 (0.5%, 2.6%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::codes::Rule>::noqa_code
7345 (0.4%, 3.0%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::checkers::ast::Checker as ruff_python_ast[3778b140caf21545]::visitor::Visitor>::visit_stmt
6412 (0.4%, 3.4%) 1 (0.0%, 0.0%) <<ruff[cef4c65d96248843]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:spanned::SpannedDeserializer<toml_edit[7e3a6c5e67260672]:🇩🇪:value::ValueDeserializer>>
6412 (0.4%, 3.8%) 1 (0.0%, 0.0%) <<ruff[cef4c65d96248843]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:table::TableMapAccess>
6409 (0.4%, 4.2%) 1 (0.0%, 0.0%) <<ruff[cef4c65d96248843]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:datetime::DatetimeDeserializer>
5696 (0.3%, 4.5%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::checkers::ast::Checker as ruff_python_ast[3778b140caf21545]::visitor::Visitor>::visit_expr
4448 (0.3%, 4.7%) 1 (0.0%, 0.0%) ruff[cef4c65d96248843]::flake8_to_ruff::converter::convert
3702 (0.2%, 4.9%) 1 (0.0%, 0.0%) <&ruff[cef4c65d96248843]::registry::Linter as core[da82827a87f140f9]::iter::traits::collect::IntoIterator>::into_iter
3349 (0.2%, 5.1%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::registry::Linter>::code_for_rule
3132 (0.2%, 5.3%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::codes::Rule as core[da82827a87f140f9]::fmt::Debug>::fmt
3130 (0.2%, 5.5%) 1 (0.0%, 0.0%) <&str as core[da82827a87f140f9]::convert::From<&ruff[cef4c65d96248843]::codes::Rule>>::from
3130 (0.2%, 5.7%) 1 (0.0%, 0.0%) <&str as core[da82827a87f140f9]::convert::From<ruff[cef4c65d96248843]::codes::Rule>>::from
3130 (0.2%, 5.9%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::codes::Rule as core[da82827a87f140f9]::convert::AsRef<str>>::as_ref
3128 (0.2%, 6.0%) 1 (0.0%, 0.0%) <ruff[cef4c65d96248843]::codes::RuleIter>::get
2669 (0.2%, 6.2%) 1 (0.0%, 0.0%) <<ruff[cef4c65d96248843]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_seq::<toml_edit[7e3a6c5e67260672]:🇩🇪:array::ArraySeqAccess>
```
After:
```
Lines Copies Function name
----- ------ -------------
1710487 38900 (TOTAL)
10193 (0.6%, 0.6%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::codes::RuleCodePrefix>::iter
8107 (0.5%, 1.1%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::codes::Rule>::noqa_code
7345 (0.4%, 1.5%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::checkers::ast::Checker as ruff_python_ast[5588cd60041c8605]::visitor::Visitor>::visit_stmt
6412 (0.4%, 1.9%) 1 (0.0%, 0.0%) <<ruff[52408f46d2058296]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:spanned::SpannedDeserializer<toml_edit[7e3a6c5e67260672]:🇩🇪:value::ValueDeserializer>>
6412 (0.4%, 2.2%) 1 (0.0%, 0.0%) <<ruff[52408f46d2058296]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:table::TableMapAccess>
6409 (0.4%, 2.6%) 1 (0.0%, 0.0%) <<ruff[52408f46d2058296]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_map::<toml_edit[7e3a6c5e67260672]:🇩🇪:datetime::DatetimeDeserializer>
5696 (0.3%, 3.0%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::checkers::ast::Checker as ruff_python_ast[5588cd60041c8605]::visitor::Visitor>::visit_expr
4448 (0.3%, 3.2%) 1 (0.0%, 0.0%) ruff[52408f46d2058296]::flake8_to_ruff::converter::convert
3702 (0.2%, 3.4%) 1 (0.0%, 0.0%) <&ruff[52408f46d2058296]::registry::Linter as core[da82827a87f140f9]::iter::traits::collect::IntoIterator>::into_iter
3349 (0.2%, 3.6%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::registry::Linter>::code_for_rule
3132 (0.2%, 3.8%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::codes::Rule as core[da82827a87f140f9]::fmt::Debug>::fmt
3130 (0.2%, 4.0%) 1 (0.0%, 0.0%) <&str as core[da82827a87f140f9]::convert::From<&ruff[52408f46d2058296]::codes::Rule>>::from
3130 (0.2%, 4.2%) 1 (0.0%, 0.0%) <&str as core[da82827a87f140f9]::convert::From<ruff[52408f46d2058296]::codes::Rule>>::from
3130 (0.2%, 4.4%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::codes::Rule as core[da82827a87f140f9]::convert::AsRef<str>>::as_ref
3128 (0.2%, 4.5%) 1 (0.0%, 0.0%) <ruff[52408f46d2058296]::codes::RuleIter>::get
2669 (0.2%, 4.7%) 1 (0.0%, 0.0%) <<ruff[52408f46d2058296]::settings::options::Options as serde[d89b1b632568f5a3]:🇩🇪:Deserialize>::deserialize::__Visitor as serde[d89b1b632568f5a3]:🇩🇪:Visitor>::visit_seq::<toml_edit[7e3a6c5e67260672]:🇩🇪:array::ArraySeqAccess>
2659 (0.2%, 4.9%) 1 (0.0%, 0.0%) <&ruff[52408f46d2058296]::codes::Pylint as core[da82827a87f140f9]::iter::traits::collect::IntoIterator>::into_iter
```
I'd assume this has a positive effect both on compile time and on runtime, but i don't know the actual effect on compile times and can't really measure.
## Test plan
Check CI for any performance regressions.
This should fix#3808 if we merge it.
* clippy
* Update update_ambiguous_characters.py
* Use dummy verbatim formatter for all nodes
* Use new formatter infrastructure in CLI and test
* Expose the new formatter in the CLI
* Merge import blocks
This adds a new rule `InvalidPyprojectToml` that lints pyproject.toml by checking if https://github.com/PyO3/pyproject-toml-rs can parse it. This means the linting is currently very basic, e.g. we don't check whether the name is actually a valid python project name or appropriately normalized. It does catch errors e.g. with invalid dependency requirements or problems withs the license specifications. It is open to be extended in the future (validate name, SPDX expressions, classifiers, ...), either in ruff or in pyproject-toml-rs.
Test plan:
```
scripts/ecosystem_all_check.sh check --select RUF200
```
This lead to a bunch of
```
RUF200 Failed to parse pyproject.toml: missing field `name`
```
(e.g. https://github.com/amitsk/fastapi-todos/blob/main/pyproject.toml) which is indeed invalid (https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#specification).
Filtering those out, the following other problems were found by `cd target/ecosystem_all_results/ && rg RUF200`:
```
UCL-ARC:rred-reports.stdout.txt
1:pyproject.toml:27:16: RUF200 Failed to parse pyproject.toml: Version specifier `>='3.9'` doesn't match PEP 440 rules
EndlessTrax:python-start-project.stdout.txt
1:pyproject.toml:14:16: RUF200 Failed to parse pyproject.toml: Expected package name starting with an alphanumeric character, found '#'
redjax:gardening-api.stdout.txt
1:pyproject.toml:7:11: RUF200 Failed to parse pyproject.toml: Version `` doesn't match PEP 440 rules
ajslater:codex.stdout.txt
2: 3:17 RUF200 Failed to parse pyproject.toml: invalid type: sequence, expected a string
LDmitriy7:404_AvatarsBot.stdout.txt
1:pyproject.toml:3:11: RUF200 Failed to parse pyproject.toml: Version `` doesn't match PEP 440 rules
ajslater:comicbox.stdout.txt
1:pyproject.toml:3:17: RUF200 Failed to parse pyproject.toml: invalid type: sequence, expected a string
manueldevillena:forecast-earnings.stdout.txt
1:pyproject.toml:24:12: RUF200 Failed to parse pyproject.toml: Expected one of `@`, `(`, `<`, `=`, `>`, `~`, `!`, `;`, found `^`
redjax:ohio_utility_scraper.stdout.txt
1:pyproject.toml:11:11: RUF200 Failed to parse pyproject.toml: Version `` doesn't match PEP 440 rules
agronholm:typeguard.stdout.txt
1:pyproject.toml:40:8: RUF200 Failed to parse pyproject.toml: Expected a valid marker name, found 'python_implementation'
cyuss:decathlon-turnover.stdout.txt
1:pyproject.toml:7:12: RUF200 Failed to parse pyproject.toml: invalid type: string "Youcef", expected a table with 'name' and 'email' keys
ajslater:boilerplate.stdout.txt
1:pyproject.toml:3:17: RUF200 Failed to parse pyproject.toml: invalid type: sequence, expected a string
kaparoo:lightning-project-template.stdout.txt
1:pyproject.toml:56:16: RUF200 Failed to parse pyproject.toml: You can't mix a >= operator with a local version (`+cu117`)
dijital20:pytexas2023-decorators.stdout.txt
1:pyproject.toml:5:11: RUF200 Failed to parse pyproject.toml: Version `` doesn't match PEP 440 rules
pfouque:django-anymail-history.stdout.txt
1:pyproject.toml:137:12: RUF200 Failed to parse pyproject.toml: Version specifier `> = 1.2.0` doesn't match PEP 440 rules
pfouque:django-fakemessages.stdout.txt
1:pyproject.toml:130:12: RUF200 Failed to parse pyproject.toml: Version specifier `> = 1.2.0` doesn't match PEP 440 rules
pypa:build.stdout.txt
1:tests/packages/test-invalid-requirements/pyproject.toml:2:12: RUF200 Failed to parse pyproject.toml: Expected one of `@`, `(`, `<`, `=`, `>`, `~`, `!`, `;`, found `i`
4:tests/packages/test-no-requires/pyproject.toml:1:1: RUF200 Failed to parse pyproject.toml: missing field `requires`
UnoYakshi:FRAAND.stdout.txt
2: 3:11 RUF200 Failed to parse pyproject.toml: Version `` doesn't match PEP 440 rules
DHolmanCoding:python-template.stdout.txt
1:pyproject.toml:22:1: RUF200 Failed to parse pyproject.toml: missing field `requires`
```
Overall, this emitted errors in 43 out of 3408 projects (`rg -c RUF200 target/ecosystem_all_results/ | wc -l`)
Co-authored-by: Micha Reiser <micha@reiser.io>