Commit Graph

49 Commits

Author SHA1 Message Date
Micha Reiser 515de2d062
Move `Token`, `TokenKind` and `Tokens` to `ruff-python-ast` (#21760) 2025-12-02 20:10:46 +01:00
Brent Westbrook 38c074e67d
Catch syntax errors in nested interpolations before Python 3.12 (#20949)
Summary
--

This PR fixes the issue I added in #20867 and noticed in #20930. Cases
like this
cause an error on any Python version:

```py
f"{1:""}"
```

which gave me a false sense of security before. Cases like this are
still
invalid only before 3.12 and weren't flagged after the changes in
#20867:

```py
f'{1: abcd "{'aa'}" }'
           # ^  reused quote
f'{1: abcd "{"\n"}" }'
            # ^  backslash
```

I didn't recognize these as nested interpolations that also need to be
checked
for invalid expressions, so filtering out the whole format spec wasn't
quite
right. And `elements.interpolations()` only iterates over the outermost 
interpolations, not the nested ones.

There's basically no code change in this PR, I just moved the existing
check
from `parse_interpolated_string`, which parses the entire string, to
`parse_interpolated_element`. This kind of seems more natural anyway and
avoids
having to try to recursively visit nested elements after the fact in
`parse_interpolated_string`. So viewing the diff with something like

```
git diff --color-moved --ignore-space-change --color-moved-ws=allow-indentation-change main
```

should make this more clear.

Test Plan
--

New tests
2025-10-20 09:03:13 -04:00
Brent Westbrook 8b9ab48ac6
Fix syntax error false positives for escapes and quotes in f-strings (#20867)
Summary
--

Fixes #20844 by refining the unsupported syntax error check for [PEP
701]
f-strings before Python 3.12 to allow backslash escapes and escaped
outer quotes
in the format spec part of f-strings. These are only disallowed within
the
f-string expression part on earlier versions. Using the examples from
the PR:

```pycon
>>> f"{1:\x64}"
'1'
>>> f"{1:\"d\"}"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Invalid format specifier '"d"' for object of type 'int'
```

Note that the second case is a runtime error, but this is actually
avoidable if
you override `__format__`, so despite being pretty weird, this could
actually be
a valid use case.

```pycon
>>> class C:
...     def __format__(*args, **kwargs): return "<C>"
...
>>> f"{C():\"d\"}"
'<C>'
```

At first I thought narrowing the range we check to exclude the format
spec would
only work for escapes, but it turns out that cases like `f"{1:""}"` are
already
covered by an existing `ParseError`, so we can just narrow the range of
both our
escape and quote checks.

Our comment check also seems to be working correctly because it's based
on the
actual tokens. A case like
[this](https://play.ruff.rs/9f1c2ff2-cd8e-4ad7-9f40-56c0a524209f):

```python
f"""{1:# }"""
```

doesn't include a comment token, instead the `#` is part of an
`InterpolatedStringLiteralElement`.

Test Plan
--

New inline parser tests

[PEP 701]: https://peps.python.org/pep-0701/
2025-10-15 09:23:16 -04:00
Micha Reiser 4fc7dd300c
Improved error recovery for unclosed strings (including f- and t-strings) (#20848) 2025-10-15 09:50:56 +02:00
Ibraheem Ahmed 7abc41727b
[ty] Shrink size of `AstNodeRef` (#20028)
## Summary

Removes the `module_ptr` field from `AstNodeRef` in release mode, and
change `NodeIndex` to a `NonZeroU32` to reduce the size of
`Option<AstNodeRef<_>>` fields.

I believe CI runs in debug mode, so this won't show up in the memory
report, but this reduces memory by ~2% in release mode.
2025-08-22 17:03:22 -04:00
Dylan 008bbfdf5a
Disallow implicit concatenation of t-strings and other string types (#19485)
As of [this cpython PR](https://github.com/python/cpython/pull/135996),
it is not allowed to concatenate t-strings with non-t-strings,
implicitly or explicitly. Expressions such as `"foo" t"{bar}"` are now
syntax errors.

This PR updates some AST nodes and parsing to reflect this change.

The structural change is that `TStringPart` is no longer needed, since,
as in the case of `BytesStringLiteral`, the only possibilities are that
we have a single `TString` or a vector of such (representing an implicit
concatenation of t-strings). This removes a level of nesting from many
AST expressions (which is what all the snapshot changes reflect), and
simplifies some logic in the implementation of visitors, for example.

The other change of note is in the parser. When we meet an implicit
concatenation of string-like literals, we now count the number of
t-string literals. If these do not exhaust the total number of
implicitly concatenated pieces, then we emit a syntax error. To recover
from this syntax error, we encode any t-string pieces as _invalid_
string literals (which means we flag them as invalid, record their
range, and record the value as `""`). Note that if at least one of the
pieces is an f-string we prefer to parse the entire string as an
f-string; otherwise we parse it as a string.

This logic is exactly the same as how we currently treat
`BytesStringLiteral` parsing and error recovery - and carries with it
the same pros and cons.

Finally, note that I have not implemented any changes in the
implementation of the formatter. As far as I can tell, none are needed.
I did change a few of the fixtures so that we are always concatenating
t-strings with t-strings.
2025-07-27 12:41:03 +00:00
Dylan 53fc0614da
Fix `unreachable` panic in parser (#19183)
Parsing the (invalid) expression `f"{\t"i}"` caused a panic because the
`TStringMiddle` character was "unreachable" due the way the parser
recovered from the line continuation (it ate the t-string start).

The cause of the issue is as follows: 

The parser begins parsing the f-string and expects to see a list of
objects, essentially alternating between _interpolated elements_ and
ordinary strings. It is happy to see the first left brace, but then
there is a lexical error caused by the line-continuation character. So
instead of the parser seeing a list of elements with just one member, it
sees a list that starts like this:

- Interpolated element with an invalid token, stored as a `Name`
- Something else built from tokens beginning with `TStringStart` and
`TStringMiddle`

When it sees the `TStringStart` error recovery says "that's a list
element I don't know what to do with, let's skip it". When it sees
`TStringMiddle` it says "oh, that looks like the middle of _some
interpolated string_ so let's try to parse it as one of the literal
elements of my `FString`". Unfortunately, the function being used to
parse individual list elements thinks (arguably correctly) that it's not
possible to have a `TStringMiddle` sitting in your `FString`, and hits
`unreachable`.

Two potential ways (among many) to solve this issue are:

1. Allow a `TStringMiddle` as a valid "literal" part of an f-string
during parsing (with the hope/understanding that this would only occur
in an invalid context)
2. Skip the `TStringMiddle` as an "unexpected/invalid list item" in the
same way that we skipped `TStringStart`.

I have opted for the second approach since it seems somehow more morally
correct, even though it loses more information. To implement this, the
recovery context needs to know whether we are in an f-string or t-string
- hence the changes to that enum. As a bonus we get slightly more
specific error messages in some cases.

Closes #18860
2025-07-20 22:04:14 +00:00
Dylan c5b58187da
Add syntax error when conversion flag does not immediately follow exclamation mark (#18706)
Closes #18671

Note that while this has, I believe, always been invalid syntax, it was
reported as a different syntax error until Python 3.12:

Python 3.11:

```pycon
>>> x = 1
>>> f"{x! s}"
  File "<stdin>", line 1
    f"{x! s}"
             ^
SyntaxError: f-string: invalid conversion character: expected 's', 'r', or 'a'
```

Python 3.12:

```pycon
>>> x = 1
>>> f"{x! s}"
  File "<stdin>", line 1
    f"{x! s}"
        ^^^
SyntaxError: f-string: conversion type must come right after the exclamanation mark
```
2025-06-16 11:44:42 -05:00
Dylan 1889a5e6eb
[syntax-errors] Raise unsupported syntax error for template strings prior to Python 3.14 (#18664)
Closes #18662

One question is whether we would like the range to exclude the quotes?
2025-06-13 14:04:37 -05:00
Ibraheem Ahmed c9dff5c7d5
[ty] AST garbage collection (#18482)
## Summary

Garbage collect ASTs once we are done checking a given file. Queries
with a cross-file dependency on the AST will reparse the file on demand.
This reduces ty's peak memory usage by ~20-30%.

The primary change of this PR is adding a `node_index` field to every
AST node, that is assigned by the parser. `ParsedModule` can use this to
create a flat index of AST nodes any time the file is parsed (or
reparsed). This allows `AstNodeRef` to simply index into the current
instance of the `ParsedModule`, instead of storing a pointer directly.

The indices are somewhat hackily (using an atomic integer) assigned by
the `parsed_module` query instead of by the parser directly. Assigning
the indices in source-order in the (recursive) parser turns out to be
difficult, and collecting the nodes during semantic indexing is
impossible as `SemanticIndex` does not hold onto a specific
`ParsedModuleRef`, which the pointers in the flat AST are tied to. This
means that we have to do an extra AST traversal to assign and collect
the nodes into a flat index, but the small performance impact (~3% on
cold runs) seems worth it for the memory savings.

Part of https://github.com/astral-sh/ty/issues/214.
2025-06-13 08:40:11 -04:00
Dylan 9bbf4987e8
Implement template strings (#17851)
This PR implements template strings (t-strings) in the parser and
formatter for Ruff.

Minimal changes necessary to compile were made in other parts of the code (e.g. ty, the linter, etc.). These will be covered properly in follow-up PRs.
2025-05-30 15:00:56 -05:00
Micha Reiser 9ae698fe30
Switch to Rust 2024 edition (#18129) 2025-05-16 13:25:28 +02:00
Abhijeet Prasad Bodas f5096f2050
[parser] Flag single unparenthesized generator expr with trailing comma in arguments. (#17893)
Fixes #17867

## Summary

The CPython parser does not allow generator expressions which are the
sole arguments in an argument list to have a trailing comma.
With this change, we start flagging such instances.

## Test Plan

Added new inline tests.
2025-05-07 14:11:35 -04:00
Micha Reiser 8a4158c5f8
Upgrade to Rust 1.86 and bump MSRV to 1.84 (#17171)
<!--
Thank you for contributing to Ruff! To help us out with reviewing,
please consider the following:

- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->

## Summary

I decided to disable the new
[`needless_continue`](https://rust-lang.github.io/rust-clippy/master/index.html#needless_continue)
rule because I often found the explicit `continue` more readable over an
empty block or having to invert the condition of an other branch.


## Test Plan

`cargo test`

---------

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
2025-04-03 15:59:44 +00:00
Brent Westbrook 2711e08eb8
[syntax-errors] Fix false positive for parenthesized tuple index (#16948)
Summary
--

Fixes #16943 by checking if the tuple is not parenthesized before
emitting an error.

Test Plan
--

New inline test based on the initial report
2025-03-24 10:34:38 -04:00
Junhson Jean-Baptiste 2a4d835132
Use the common `OperatorPrecedence` for the parser (#16747)
## Summary

This change continues to resolve #16071 (and continues the work started
in #16162). Specifically, this PR changes the code in the parser so that
it uses the `OperatorPrecedence` struct from `ruff_python_ast` instead
of its own version. This is part of an effort to get rid of the
redundant definitions of `OperatorPrecedence` throughout the codebase.

Note that this PR only makes this change for `ruff_python_parser` -- we
still want to make a similar change for the formatter (namely the
`OperatorPrecedence` defined in the expression part of the formatter,
the pattern one is different). I separated the work to keep the PRs
small and easily reviewable.

## Test Plan

Because this is an internal change, I didn't add any additional tests.
Existing tests do pass.
2025-03-21 09:40:37 +05:30
Brent Westbrook dcf31c9348
[syntax-errors] PEP 701 f-strings before Python 3.12 (#16543)
## Summary

This PR detects the use of PEP 701 f-strings before 3.12. This one
sounded difficult and ended up being pretty easy, so I think there's a
good chance I've over-simplified things. However, from experimenting in
the Python REPL and checking with [pyright], I think this is correct.
pyright actually doesn't even flag the comment case, but Python does.

I also checked pyright's implementation for
[quotes](98dc4469cc/packages/pyright-internal/src/analyzer/checker.ts (L1379-L1398))
and
[escapes](98dc4469cc/packages/pyright-internal/src/analyzer/checker.ts (L1365-L1377))
and think I've approximated how they do it.

Python's error messages also point to the simple approach of these
characters simply not being allowed:

```pycon
Python 3.11.11 (main, Feb 12 2025, 14:51:05) [Clang 19.1.6 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f'''multiline {
... expression # comment
... }'''
  File "<stdin>", line 3
    }'''
        ^
SyntaxError: f-string expression part cannot include '#'
>>> f'''{not a line \
... continuation}'''
  File "<stdin>", line 2
    continuation}'''
                    ^
SyntaxError: f-string expression part cannot include a backslash
>>> f'hello {'world'}'
  File "<stdin>", line 1
    f'hello {'world'}'
              ^^^^^
SyntaxError: f-string: expecting '}'
```

And since escapes aren't allowed, I don't think there are any tricky
cases where nested quotes or comments can sneak in.

It's also slightly annoying that the error is repeated for every nested
quote character, but that also mirrors pyright, although they highlight
the whole nested string, which is a little nicer. However, their check
is in the analysis phase, so I don't think we have such easy access to
the quoted range, at least without adding another mini visitor.

## Test Plan

New inline tests

[pyright]:
https://pyright-play.net/?pythonVersion=3.11&strict=true&code=EYQw5gBAvBAmCWBjALgCgO4gHaygRgEoAoEaCAIgBpyiiBiCLAUwGdknYIBHAVwHt2LIgDMA5AFlwSCJhwAuCAG8IoMAG1Rs2KIC6EAL6iIxosbPmLlq5foRWiEAAcmERAAsQAJxAomnltY2wuSKogA6WKIAdABWfPBYqCAE%2BuSBVqbpWVm2iHwAtvlMWMgB2ekiolUAgq4FjgA2TAAeEMieSADWCsoV5qoaqrrGDJ5MiDz%2B8ABuLqosAIREhlXlaybrmyYMXsDw7V4AnoysyAmQ5SIhwYo3d9cheADUeKlv5O%2BpQA
2025-03-18 11:12:15 -04:00
Brent Westbrook 3a32e56445
[syntax-errors] Unparenthesized assignment expressions in sets and indexes (#16404)
## Summary
This PR detects unparenthesized assignment expressions used in set
literals and comprehensions and in sequence indexes. The link to the
release notes in https://github.com/astral-sh/ruff/issues/6591 just has
this entry:
> * Assignment expressions can now be used unparenthesized within set
literals and set comprehensions, as well as in sequence indexes (but not
slices).

with no other information, so hopefully the test cases I came up with
cover all of the changes. I also tested these out in the Python REPL and
they actually worked in Python 3.9 too. I'm guessing this may be another
case that was "formally made part of the language spec in Python 3.10,
but usable -- and commonly used -- in Python >=3.9" as @AlexWaygood
added to the body of #6591 for context managers. So we may want to
change the version cutoff, but I've gone along with the release notes
for now.

## Test Plan

New inline parser tests and linter CLI tests.
2025-03-14 15:06:42 -04:00
Brent Westbrook 4f2851982d
[syntax-errors] Star expression in index before Python 3.11 (#16544)
Summary
--

This PR detects tuple unpacking expressions in index/subscript
expressions before Python 3.11.

Test Plan
--

New inline tests
2025-03-14 14:51:34 +00:00
Brent Westbrook 2382fe1f25
[syntax-errors] Tuple unpacking in `for` statement iterator clause before Python 3.9 (#16558)
Summary
--

This PR reuses a slightly modified version of the
`check_tuple_unpacking` method added for detecting unpacking in `return`
and `yield` statements to detect the same issue in the iterator clause
of `for` loops.

I ran into the same issue with a bare `for x in *rest: ...` example
(invalid even on Python 3.13) and added it as a comment on
https://github.com/astral-sh/ruff/issues/16520.

I considered just making this an additional `StarTupleKind` variant as
well, but this change was in a different version of Python, so I kept it
separate.

Test Plan
--

New inline tests.
2025-03-13 15:55:17 -04:00
Brent Westbrook b3c884f4f3
[syntax-errors] Parenthesized keyword argument names after Python 3.8 (#16482)
Summary
--

Unlike the other syntax errors detected so far, parenthesized keyword
arguments are only allowed *before* 3.8. It sounds like they were only
accidentally allowed before that [^1].

As an aside, you get a pretty confusing error from Python for this, so
it's nice that we can catch it:

```pycon
>>> def f(**kwargs): ...
... f((a)=1)
...
  File "<python-input-0>", line 2
    f((a)=1)
       ^^^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
>>>
```
Test Plan
--
Inline tests.

[^1]: https://github.com/python/cpython/issues/78822
2025-03-06 12:18:13 -05:00
Brent Westbrook 6c14225c66
[syntax-errors] Tuple unpacking in `return` and `yield` before Python 3.8 (#16485)
Summary
--

Checks for tuple unpacking in `return` and `yield` statements before
Python 3.8, as described [here].

Test Plan
--
Inline tests.

[here]: https://github.com/python/cpython/issues/76298
2025-03-06 11:57:20 -05:00
Brent Westbrook 318f503714
[syntax-errors] Named expressions in decorators before Python 3.9 (#16386)
Summary
--

This PR detects the relaxed grammar for decorators proposed in [PEP
614](https://peps.python.org/pep-0614/) on Python 3.8 and lower.

The 3.8 grammar for decorators is
[here](https://docs.python.org/3.8/reference/compound_stmts.html#grammar-token-decorators):

```
decorators                ::=  decorator+
decorator                 ::=  "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE
dotted_name               ::=  identifier ("." identifier)*
```

in contrast to the current grammar
[here](https://docs.python.org/3/reference/compound_stmts.html#grammar-token-python-grammar-decorators)

```
decorators                ::= decorator+
decorator                 ::= "@" assignment_expression NEWLINE
assignment_expression ::= [identifier ":="] expression
```

Test Plan
--

New inline parser tests.
2025-03-05 17:08:18 +00:00
Brent Westbrook e924ecbdac
[syntax-errors] `except*` before Python 3.11 (#16446)
Summary
--

One of the simpler ones, just detect the use of `except*` before 3.11.

Test Plan
--

New inline tests.
2025-03-02 18:20:18 +00:00
Brent Westbrook 4431978262
[syntax-errors] Assignment expressions before Python 3.8 (#16383)
## Summary
This PR is the first in a series derived from
https://github.com/astral-sh/ruff/pull/16308, each of which add support
for detecting one version-related syntax error from
https://github.com/astral-sh/ruff/issues/6591. This one should be
the largest because it also includes the addition of the 
`Parser::add_unsupported_syntax_error` method

Otherwise I think the general structure will be the same for each syntax
error:
* Detecting the error in the parser
* Inline parser tests for the new error
* New ruff CLI tests for the new error

## Test Plan
As noted above, there are new inline parser tests, as well as new ruff
CLI
tests. Once https://github.com/astral-sh/ruff/pull/16379 is resolved,
there should also be new mdtests for red-knot,
but this PR does not currently include those.
2025-02-28 17:13:46 -05:00
Brent Westbrook 97d0659ce3
Pass `ParserOptions` to the parser (#16220)
## Summary

This is part of the preparation for detecting syntax errors in the
parser from https://github.com/astral-sh/ruff/pull/16090/. As suggested
in [this
comment](https://github.com/astral-sh/ruff/pull/16090/#discussion_r1953084509),
I started working on a `ParseOptions` struct that could be stored in the
parser. For this initial refactor, I only made it hold the existing
`Mode` option, but for syntax errors, we will also need it to have a
`PythonVersion`. For that use case, I'm picturing something like a
`ParseOptions::with_python_version` method, so you can extend the
current calls to something like

```rust
ParseOptions::from(mode).with_python_version(settings.target_version)
```

But I thought it was worth adding `ParseOptions` alone without changing
any other behavior first.

Most of the diff is just updating call sites taking `Mode` to take
`ParseOptions::from(Mode)` or those taking `PySourceType`s to take
`ParseOptions::from(PySourceType)`. The interesting changes are in the
new `parser/options.rs` file and smaller parts of `parser/mod.rs` and
`ruff_python_parser/src/lib.rs`.

## Test Plan

Existing tests, this should not change any behavior.
2025-02-19 10:50:50 -05:00
Alex Waygood cb71393332
Simplify the `StringFlags` trait (#15944) 2025-02-04 18:14:28 +00:00
Micha Reiser 138e70bd5c
Upgrade to Rust 1.80 (#12586) 2024-07-30 19:18:08 +00:00
Dhruv Manilawala 978909fcf4
Raise syntax error for unparenthesized generator expr in multi-argument call (#12445)
## Summary

This PR fixes a bug to raise a syntax error when an unparenthesized
generator expression is used as an argument to a call when there are
more than one argument.

For reference, the grammar is:
```
primary:
    | ...
    | primary genexp 
    | primary '(' [arguments] ')' 
    | ...

genexp:
    | '(' ( assignment_expression | expression !':=') for_if_clauses ')' 
```

The `genexp` requires the parenthesis as mentioned in the grammar. So,
the grammar for a call expression is either a name followed by a
generator expression or a name followed by a list of argument. In the
former case, the parenthesis are excluded because the generator
expression provides them while in the later case, the parenthesis are
explicitly provided for a list of arguments which means that the
generator expression requires it's own parenthesis.

This was discovered in https://github.com/astral-sh/ruff/issues/12420.

## Test Plan

Add test cases for valid and invalid syntax.

Make sure that the parser from CPython also raises this at the parsing
step:
```console
$ python3.13 -m ast parser/_.py
  File "parser/_.py", line 1
    total(1, 2, x for x in range(5), 6)
                ^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

$ python3.13 -m ast parser/_.py
  File "parser/_.py", line 1
    sum(x for x in range(10), 10)
        ^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized
```
2024-07-22 14:44:20 +05:30
Micha Reiser 5109b50bb3
Use `CompactString` for `Identifier` (#12101) 2024-07-01 10:06:02 +02:00
Micha Reiser da78de0439
Remove allcation in `parse_identifier` (#12103) 2024-06-29 15:00:24 +02:00
renovate[bot] 53a80a5c11
Update Rust crate rustc-hash to v2 (#12001) 2024-06-23 20:46:42 -04:00
Dhruv Manilawala 96da136e6a
Move token and error structs into related modules (#11957)
## Summary

This PR does some housekeeping into moving certain structs into related
modules. Specifically,
1. Move `LexicalError` from `lexer.rs` to `error.rs` which also contains
the `ParseError`
2. Move `Token`, `TokenFlags` and `TokenValue` from `lexer.rs` to
`token.rs`
2024-06-21 10:07:19 +00:00
Dhruv Manilawala a525b4be3d
Separate terminator token for f-string elements kind (#11842)
## Summary

This PR separates the terminator token for f-string elements depending
on the context. A list of f-string element can occur either in a regular
f-string or a format spec of an f-string. The terminator token is
different depending on that context.

## Test Plan

`cargo insta test` and verify the updated snapshots.
2024-06-12 13:57:35 +05:30
Dhruv Manilawala 6c1fa1d440
Use speculative parsing for with-items (#11770)
## Summary

This PR updates the with-items parsing logic to use speculative parsing
instead.

### Existing logic

First, let's understand the previous logic:
1. The parser sees `(`, it doesn't know whether it's part of a
parenthesized with items or a parenthesized expression
2. Consider it a parenthesized with items and perform a hand-rolled
speculative parsing
3. Then, verify the assumption and if it's incorrect convert the parsed
with items into an appropriate expression which becomes part of the
first with item

Here, in (3) there are lots of edge cases which we've to deal with:
1. Trailing comma with a single element should be [converted to the
expression as
is](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2140-L2153))
2. Trailing comma with multiple elements should be [converted to a tuple
expression](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2155-L2178))
3. Limit the allowed expression based on whether it's
[(1)](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2144-L2152))
or
[(2)](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2157-L2171))
4. [Consider postfix
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2181-L2200))
after (3)
5. [Consider `if`
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2203-L2208))
after (3)
6. [Consider binary
expressions](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2210-L2228))
after (3)

Consider other cases like
* [Single generator
expression](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2020-L2035))
* [Expecting a
comma](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2122-L2130))

And, this is all possible only if we allow parsing these expressions in
the [with item parsing
logic](9b2cf569b2/crates/ruff_python_parser/src/parser/statement.rs (L2287-L2334)).

### Speculative parsing

With #11457 merged, we can simplify this logic by changing the step (3)
from above to just rewind the parser back to the `(` if our assumption
(parenthesized with-items) was incorrect and then continue parsing it
considering parenthesized expression.

This also behaves a lot similar to what a PEG parser does which is to
consider the first grammar rule and if it fails consider the second
grammar rule and so on.

resolves: #11639 

## Test Plan

- [x] Verify the updated snapshots
- [x] Run the fuzzer on around 3000 valid source code (locally)
2024-06-06 08:59:56 +00:00
Dhruv Manilawala bf5b62edac
Maintain synchronicity between the lexer and the parser (#11457)
## Summary

This PR updates the entire parser stack in multiple ways:

### Make the lexer lazy

* https://github.com/astral-sh/ruff/pull/11244
* https://github.com/astral-sh/ruff/pull/11473

Previously, Ruff's lexer would act as an iterator. The parser would
collect all the tokens in a vector first and then process the tokens to
create the syntax tree.

The first task in this project is to update the entire parsing flow to
make the lexer lazy. This includes the `Lexer`, `TokenSource`, and
`Parser`. For context, the `TokenSource` is a wrapper around the `Lexer`
to filter out the trivia tokens[^1]. Now, the parser will ask the token
source to get the next token and only then the lexer will continue and
emit the token. This means that the lexer needs to be aware of the
"current" token. When the `next_token` is called, the current token will
be updated with the newly lexed token.

The main motivation to make the lexer lazy is to allow re-lexing a token
in a different context. This is going to be really useful to make the
parser error resilience. For example, currently the emitted tokens
remains the same even if the parser can recover from an unclosed
parenthesis. This is important because the lexer emits a
`NonLogicalNewline` in parenthesized context while a normal `Newline` in
non-parenthesized context. This different kinds of newline is also used
to emit the indentation tokens which is important for the parser as it's
used to determine the start and end of a block.

Additionally, this allows us to implement the following functionalities:
1. Checkpoint - rewind infrastructure: The idea here is to create a
checkpoint and continue lexing. At a later point, this checkpoint can be
used to rewind the lexer back to the provided checkpoint.
2. Remove the `SoftKeywordTransformer` and instead use lookahead or
speculative parsing to determine whether a soft keyword is a keyword or
an identifier
3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted
by the lexer but it contains owned data which makes it expensive to
clone. The new `TokenKind` enum just represents the type of token which
is very cheap.

This brings up a question as to how will the parser get the owned value
which was stored on `Tok`. This will be solved by introducing a new
`TokenValue` enum which only contains a subset of token kinds which has
the owned value. This is stored on the lexer and is requested by the
parser when it wants to process the data. For example:
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L1260-L1262)

[^1]: Trivia tokens are `NonLogicalNewline` and `Comment`

### Remove `SoftKeywordTransformer`

* https://github.com/astral-sh/ruff/pull/11441
* https://github.com/astral-sh/ruff/pull/11459
* https://github.com/astral-sh/ruff/pull/11442
* https://github.com/astral-sh/ruff/pull/11443
* https://github.com/astral-sh/ruff/pull/11474

For context,
https://github.com/RustPython/RustPython/pull/4519/files#diff-5de40045e78e794aa5ab0b8aacf531aa477daf826d31ca129467703855408220
added support for soft keywords in the parser which uses infinite
lookahead to classify a soft keyword as a keyword or an identifier. This
is a brilliant idea as it basically wraps the existing Lexer and works
on top of it which means that the logic for lexing and re-lexing a soft
keyword remains separate. The change here is to remove
`SoftKeywordTransformer` and let the parser determine this based on
context, lookahead and speculative parsing.

* **Context:** The transformer needs to know the position of the lexer
between it being at a statement position or a simple statement position.
This is because a `match` token starts a compound statement while a
`type` token starts a simple statement. **The parser already knows
this.**
* **Lookahead:** Now that the parser knows the context it can perform
lookahead of up to two tokens to classify the soft keyword. The logic
for this is mentioned in the PR implementing it for `type` and `match
soft keyword.
* **Speculative parsing:** This is where the checkpoint - rewind
infrastructure helps. For `match` soft keyword, there are certain cases
for which we can't classify based on lookahead. The idea here is to
create a checkpoint and keep parsing. Based on whether the parsing was
successful and what tokens are ahead we can classify the remaining
cases. Refer to #11443 for more details.

If the soft keyword is being parsed in an identifier context, it'll be
converted to an identifier and the emitted token will be updated as
well. Refer
8196720f80/crates/ruff_python_parser/src/parser/expression.rs (L487-L491).

The `case` soft keyword doesn't require any special handling because
it'll be a keyword only in the context of a match statement.

### Update the parser API

* https://github.com/astral-sh/ruff/pull/11494
* https://github.com/astral-sh/ruff/pull/11505

Now that the lexer is in sync with the parser, and the parser helps to
determine whether a soft keyword is a keyword or an identifier, the
lexer cannot be used on its own. The reason being that it's not
sensitive to the context (which is correct). This means that the parser
API needs to be updated to not allow any access to the lexer.

Previously, there were multiple ways to parse the source code:
1. Passing the source code itself
2. Or, passing the tokens

Now that the lexer and parser are working together, the API
corresponding to (2) cannot exists. The final API is mentioned in this
PR description: https://github.com/astral-sh/ruff/pull/11494.

### Refactor the downstream tools (linter and formatter)

* https://github.com/astral-sh/ruff/pull/11511
* https://github.com/astral-sh/ruff/pull/11515
* https://github.com/astral-sh/ruff/pull/11529
* https://github.com/astral-sh/ruff/pull/11562
* https://github.com/astral-sh/ruff/pull/11592

And, the final set of changes involves updating all references of the
lexer and `Tok` enum. This was done in two-parts:
1. Update all the references in a way that doesn't require any changes
from this PR i.e., it can be done independently
	* https://github.com/astral-sh/ruff/pull/11402
	* https://github.com/astral-sh/ruff/pull/11406
	* https://github.com/astral-sh/ruff/pull/11418
	* https://github.com/astral-sh/ruff/pull/11419
	* https://github.com/astral-sh/ruff/pull/11420
	* https://github.com/astral-sh/ruff/pull/11424
2. Update all the remaining references to use the changes made in this
PR

For (2), there were various strategies used:
1. Introduce a new `Tokens` struct which wraps the token vector and add
methods to query a certain subset of tokens. These includes:
	1. `up_to_first_unknown` which replaces the `tokenize` function
2. `in_range` and `after` which replaces the `lex_starts_at` function
where the former returns the tokens within the given range while the
latter returns all the tokens after the given offset
2. Introduce a new `TokenFlags` which is a set of flags to query certain
information from a token. Currently, this information is only limited to
any string type token but can be expanded to include other information
in the future as needed. https://github.com/astral-sh/ruff/pull/11578
3. Move the `CommentRanges` to the parsed output because this
information is common to both the linter and the formatter. This removes
the need for `tokens_and_ranges` function.

## Test Plan

- [x] Update and verify the test snapshots
- [x] Make sure the entire test suite is passing
- [x] Make sure there are no changes in the ecosystem checks
- [x] Run the fuzzer on the parser
- [x] Run this change on dozens of open-source projects

### Running this change on dozens of open-source projects

Refer to the PR description to get the list of open source projects used
for testing.

Now, the following tests were done between `main` and this branch:
1. Compare the output of `--select=E999` (syntax errors)
2. Compare the output of default rule selection
3. Compare the output of `--select=ALL`

**Conclusion: all output were same**

## What's next?

The next step is to introduce re-lexing logic and update the parser to
feed the recovery information to the lexer so that it can emit the
correct token. This moves us one step closer to having error resilience
in the parser and provides Ruff the possibility to lint even if the
source code contains syntax errors.
2024-06-03 18:23:50 +05:30
Dhruv Manilawala e28e737296
Update `FStringElements` to deref to a slice (#11570)
Ref: https://github.com/astral-sh/ruff/pull/11400#discussion_r1615600354
2024-05-27 15:52:13 +00:00
Dhruv Manilawala 4b41e4de7f
Create a newtype wrapper around `Vec<FStringElement>` (#11400)
## Summary

This PR adds a newtype wrapper around `Vec<FStringElement>` that derefs
to a `&Vec<FStringElement>`.

Both f-string and format specifier are made up of `Vec<FStringElement>`.
By creating a newtype wrapper around it, we can share the methods for
both parent types.
2024-05-13 16:04:04 +00:00
Dimitri Papadopoulos Orfanos 3b0584449d
Fix a few typos found by codespell (#11404)
## Summary

Just fix typos.

## Test Plan

CI jobs.

---------

Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>
2024-05-13 13:22:35 +00:00
Dhruv Manilawala 6ecb4776de
Rename `AnyStringKind` -> `AnyStringFlags` (#11405)
## Summary

This PR renames `AnyStringKind` to `AnyStringFlags` and `AnyStringFlags`
to `AnyStringFlagsInner`.

The main motivation is to have consistent usage of "kind" and "flags".
For each string kind, it's "flags" like `StringLiteralFlags`,
`BytesLiteralFlags`, and `FStringFlags` but it was `AnyStringKind` for
the "any" variant.
2024-05-13 13:18:07 +00:00
Alex Waygood 6774f27f4b
Refactor the `ExprDict` node (#11267)
Co-authored-by: Micha Reiser <micha@reiser.io>
2024-05-07 11:46:10 +00:00
Dhruv Manilawala 38d2562f41
Refactor unary expression parsing (#11088)
## Summary

This PR refactors unary expression parsing with the following changes:
* Ability to get `OperatorPrecedence` from a unary operator (`UnaryOp`)
* Implement methods on `TokenKind`
	* Add `as_unary_operator` which returns an `Option<UnaryOp>`
* Add `as_unary_arithmetic_operator` which returns an `Option<UnaryOp>`
(used for pattern parsing)
* Rename `is_unary` to `is_unary_arithmetic_operator` (used in the
linter)

resolves: #10752 

## Test Plan

Verify that the existing test cases pass, no ecosystem changes, run the
Python based fuzzer on 3000 random inputs and run it on dozens of
open-source repositories.
2024-04-23 04:55:02 +00:00
Dhruv Manilawala 7eba967e16
Refactor binary expression parsing (#11073)
## Summary

This PR refactors the binary expression parsing in a way to make it
readable and easy to understand. It draws inspiration from the suggested
edits in the linked messages in #10752.

### Changes

* Ability to get the precedence of an operator
	* From a boolean operator (`BinOp`) to `OperatorPrecedence`
	* From a binary operator (`Operator`) to `OperatorPrecedence`
	* No comparison operator because all of them have the same precedence
* Implement methods on `TokenKind` to convert it to an appropriate
operator enum
	* Add `as_boolean_operator` which returns an `Option<BoolOp>`
	* Add `as_binary_operator` which returns an `Option<Operator>`
* No `as_comparison_operator` because it requires lookahead and I'm not
sure if `token.as_comparison_operator(peek)` is a good way to implement
it
* Introduce `BinaryLikeOperator`
	* Constructed from two tokens using the methods from the second point
* Add `precedence` method using the conversion methods mentioned in the
first point
* Make most of the functions in `TokenKind` private to the module
* Use `self` instead of `&self` for `TokenKind` 

fixes: #11072

## Test Plan

Refer #11088
2024-04-23 04:42:40 +00:00
Dhruv Manilawala 5b81b8368d
Make associativity a property of operator precedence (#11065)
## Summary

This PR does a few things but the main change is that is makes
associativity a property of operator precedence.

1. Rename `Precedence` -> `OperatorPrecedence`
2. Rename `parse_expression_with_precedence` ->
`parse_binary_expression_or_higher`
3. Move `current_binding_power` to `OperatorPrecedence::try_from_tokens`
[^1]
4. Add a `OperatorPrecedence::is_right_associative` method
5. Move from `increment_precedence` to using `<=` / `<` to check if the
parsing loop needs to stop [^2]

[^1]: Another alternative would be to have two separate methods to avoid
lookahead as it's required only for once case (`not in`). So,
`try_from_current_token(current).or_else(|| try_from_next_token(current,
peek))`
[^2]: This will allow us to easily make the refactors mentioned in
#10752

## Test Plan

Make sure the precedence parsing algorithm is still correct by running
the test suite, fuzz testing it and running it against a dozen or so
open-source repositories.
2024-04-23 04:28:46 +00:00
Dhruv Manilawala c30735d4a7
Add `ExpressionContext` for expression parsing (#11055)
## Summary

This PR adds a new `ExpressionContext` struct which is used in
expression parsing.

This solves the following problem:
1. Allowing starred expression with different precedence
2. Allowing yield expression in certain context
3. Remove ambiguity with `in` keyword when parsing a `for ... in`
statement

For context, (1) was solved by adding `parse_star_expression_list` and
`parse_star_expression_or_higher` in #10623, (2) was solved by by adding
`parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009.
All of the mentioned functions have been removed in favor of the context
flags.

As mentioned in #11009, an ideal solution would be to implement an
expression context which is what this PR implements. This is passed
around as function parameter and the call stack is used to automatically
reset the context.

### Recovery

How should the parser recover if the target expression is invalid when
an expression can consume the `in` keyword?

1. Should the `in` keyword be part of the target expression?
2. Or, should the expression parsing stop as soon as `in` keyword is
encountered, no matter the expression?

For example:
```python
for yield x in y: ...

# Here, should this be parsed as
for (yield x) in (y): ...
# Or
for (yield x in y): ...
# where the `in iter` part is missing
```

Or, for binary expression parsing:
```python
for x or y in z: ...

# Should this be parsed as
for (x or y) in z: ...
# Or
for (x or y in z): ...
# where the `in iter` part is missing
```

This need not be solved now, but is very easy to change. For context
this PR does the following:
* For binary, comparison, and unary expressions, stop at `in`
* For lambda, yield expressions, consume the `in`

## Test Plan

1. Add test cases for the `for ... in` statement and verify the
snapshots
2. Make sure the existing test suite pass
3. Run the fuzzer for around 3000 generated source code
4. Run the updated logic on a dozen or so open source repositories
(codename "parser-checkouts")
2024-04-23 04:19:05 +00:00
Dhruv Manilawala b7066e64e7
Consider binary expr for parenthesized with items parsing (#11012)
## Summary

This PR fixes the bug in with items parsing where it would fail to
recognize that the parenthesized expression is part of a large binary
expression.

## Test Plan

Add test cases and verified the snapshots.
2024-04-18 21:39:30 +05:30
Dhruv Manilawala 6c4d779140
Consider `if` expression for parenthesized with items parsing (#11010)
## Summary

This PR fixes the bug in parenthesized with items parsing where the `if`
expression would result into a syntax error.

The reason being that once we identify that the ambiguous left
parenthesis belongs to the context expression, the parser converts the
parsed with item into an equivalent expression. Then, the parser
continuous to parse any postfix expressions. Now, attribute, subscript,
and call are taken into account as they're grouped in
`parse_postfix_expression` but `if` expression has it's own parsing
function.

Use `parse_if_expression` once all postfix expressions have been parsed.
Ideally, I think that `if` could be included in postfix expression
parsing as they can be chained as well (`x if True else y if True else
z`).

## Test Plan

Add test cases and verified the snapshots.
2024-04-18 14:30:15 +00:00
Dhruv Manilawala 8020d486f6
Reset `FOR_TARGET` context for all kinds of parentheses (#11009)
## Summary

This PR fixes a bug in the new parser which involves the parser context
w.r.t. for statement. This is specifically around the `in` keyword which
can be present in the target expression and shouldn't be considered to
be part of the `for` statement header. Ideally it should use a context
which is passed between functions, thus using a call stack to set /
unset a specific variant which will be done in a follow-up PR as it
requires some amount of refactor.

## Test Plan

Add test cases and update the snapshots.
2024-04-18 19:37:50 +05:30
Dhruv Manilawala 13ffb5bc19
Replace LALRPOP parser with hand-written parser (#10036)
(Supersedes #9152, authored by @LaBatata101)

## Summary

This PR replaces the current parser generated from LALRPOP to a
hand-written recursive descent parser.

It also updates the grammar for [PEP
646](https://peps.python.org/pep-0646/) so that the parser outputs the
correct AST. For example, in `data[*x]`, the index expression is now a
tuple with a single starred expression instead of just a starred
expression.

Beyond the performance improvements, the parser is also error resilient
and can provide better error messages. The behavior as seen by any
downstream tools isn't changed. That is, the linter and formatter can
still assume that the parser will _stop_ at the first syntax error. This
will be updated in the following months.

For more details about the change here, refer to the PR corresponding to
the individual commits and the release blog post.

## Test Plan

Write _lots_ and _lots_ of tests for both valid and invalid syntax and
verify the output.

## Acknowledgements

- @MichaReiser for reviewing 100+ parser PRs and continuously providing
guidance throughout the project
- @LaBatata101 for initiating the transition to a hand-written parser in
#9152
- @addisoncrump for implementing the fuzzer which helped
[catch](https://github.com/astral-sh/ruff/pull/10903)
[a](https://github.com/astral-sh/ruff/pull/10910)
[lot](https://github.com/astral-sh/ruff/pull/10966)
[of](https://github.com/astral-sh/ruff/pull/10896)
[bugs](https://github.com/astral-sh/ruff/pull/10877)

---------

Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org>
Co-authored-by: Micha Reiser <micha@reiser.io>
2024-04-18 17:57:39 +05:30