mirror of https://github.com/astral-sh/ruff
14 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
96da136e6a
|
Move token and error structs into related modules (#11957)
## Summary This PR does some housekeeping into moving certain structs into related modules. Specifically, 1. Move `LexicalError` from `lexer.rs` to `error.rs` which also contains the `ParseError` 2. Move `Token`, `TokenFlags` and `TokenValue` from `lexer.rs` to `token.rs` |
|
|
|
1e0642fac8
|
Use re-lexing for normal list parsing (#11871)
## Summary This PR is a follow-up on #11845 to add the re-lexing logic for normal list parsing. A normal list parsing is basically parsing elements without any separator in between i.e., there can only be trivia tokens in between the two elements. Currently, this is only being used for parsing **assignment statement** and **f-string elements**. Assignment statements cannot be in a parenthesized context, but f-string can have curly braces so this PR is specifically for them. I don't think this is an ideal recovery but the problem is that both lexer and parser could add an error for f-strings. If the lexer adds an error it'll emit an `Unknown` token instead while the parser adds the error directly. I think we'd need to move all f-string errors to be emitted by the parser instead. This way the parser can correctly inform the lexer that it's out of an f-string and then the lexer can pop the current f-string context out of the stack. ## Test Plan Add test cases, update the snapshots, and run the fuzzer. |
|
|
|
8499abfa7f
|
Implement re-lexing logic for better error recovery (#11845)
## Summary This PR implements the re-lexing logic in the parser. This logic is only applied when recovering from an error during list parsing. The logic is as follows: 1. During list parsing, if an unexpected token is encountered and it detects that an outer context can understand it and thus recover from it, it invokes the re-lexing logic in the lexer 2. This logic first checks if the lexer is in a parenthesized context and returns if it's not. Thus, the logic is a no-op if the lexer isn't in a parenthesized context 3. It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect 4. Then, it tries to find last newline character going backwards from the current position of the lexer. This avoids any whitespaces but if it encounters any character other than newline or whitespace, it aborts. 5. Now, if there's a newline character, then it needs to be re-lexed in a logical context which means that the lexer needs to emit it as a `Newline` token instead of `NonLogicalNewline`. 6. If the re-lexing gives a different token than the current one, the token source needs to update it's token collection to remove all the tokens which comes after the new current position. It turns out that the list parsing isn't that happy with the results so it requires some re-arranging such that the following two errors are raised correctly: 1. Expected comma 2. Recovery context error For (1), the following scenarios needs to be considered: * Missing comma between two elements * Half parsed element because the grammar doesn't allow it (for example, named expressions) For (2), the following scenarios needs to be considered: 1. If the parser is at a comma which means that there's a missing element otherwise the comma would've been consumed by the first `eat` call above. And, the parser doesn't take the re-lexing route on a comma token. 2. If it's the first element and the current token is not a comma which means that it's an invalid element. resolves: #11640 ## Test Plan - [x] Update existing test snapshots and validate them - [x] Add additional test cases specific to the re-lexing logic and validate the snapshots - [x] Run the fuzzer on 3000+ valid inputs - [x] Run the fuzzer on invalid inputs - [x] Run the parser on various open source projects - [x] Make sure the ecosystem changes are none |
|
|
|
60ea72a6bc
|
Add list terminator kind for error recovery (#11843)
## Summary This PR adds a new enum to determine the kind of terminator token i.e., is it actually terminates the list or is it used for error recovery. This is important because the parser should take the error recovery route in case the terminator token is used for better error recovery. This will then try to re-lex the token if it's the case. I haven't updated any reference to use this new enum as otherwise it'll update the snapshots. I plan to do that in a follow-up PR so that it's easier to reason about. ## Test plan `cargo insta test` |
|
|
|
a525b4be3d
|
Separate terminator token for f-string elements kind (#11842)
## Summary This PR separates the terminator token for f-string elements depending on the context. A list of f-string element can occur either in a regular f-string or a format spec of an f-string. The terminator token is different depending on that context. ## Test Plan `cargo insta test` and verify the updated snapshots. |
|
|
|
549cc1e437
|
Build `CommentRanges` outside the parser (#11792)
## Summary This PR updates the parser to remove building the `CommentRanges` and instead it'll be built by the linter and the formatter when it's required. For the linter, it'll be built and owned by the `Indexer` while for the formatter it'll be built from the `Tokens` struct and passed as an argument. ## Test Plan `cargo insta test` |
|
|
|
1b7d08c2c9
|
Consider `:` to terminate parenthesized with items (#11775)
## Summary This PR is a follow-up to this discussion (https://github.com/astral-sh/ruff/pull/11770#discussion_r1628917209) which adds the `:` token in the terminator set for parenthesized with items. The main motivation is to avoid parsing too much in speculative mode. This is evident with the following _before_ and _after_ parsed with items list for the following code: ```py with (item1, item2: foo ``` <table> <tr> <th>Before (3 items)</th> <th>After (2 items)</th> </tr> <tr> <td> <pre> parsed_with_items: [ ParsedWithItem { item: WithItem { range: 6..11, context_expr: Name( ExprName { range: 6..11, id: "item1", ctx: Load, }, ), optional_vars: None, }, is_parenthesized: false, }, ParsedWithItem { item: WithItem { range: 13..18, context_expr: Name( ExprName { range: 13..18, id: "item2", ctx: Load, }, ), optional_vars: None, }, is_parenthesized: false, }, ParsedWithItem { item: WithItem { range: 24..27, context_expr: Name( ExprName { range: 24..27, id: "foo", ctx: Load, }, ), optional_vars: None, }, is_parenthesized: false, }, ] </pre> </td> <td> <pre> parsed_with_items: [ ParsedWithItem { item: WithItem { range: 6..11, context_expr: Name( ExprName { range: 6..11, id: "item1", ctx: Load, }, ), optional_vars: None, }, is_parenthesized: false, }, ParsedWithItem { item: WithItem { range: 13..18, context_expr: Name( ExprName { range: 13..18, id: "item2", ctx: Load, }, ), optional_vars: None, }, is_parenthesized: false, }, ] </pre> </td> </tr> </table> ## Test Plan `cargo insta test` |
|
|
|
6c1fa1d440
|
Use speculative parsing for with-items (#11770)
## Summary This PR updates the with-items parsing logic to use speculative parsing instead. ### Existing logic First, let's understand the previous logic: 1. The parser sees `(`, it doesn't know whether it's part of a parenthesized with items or a parenthesized expression 2. Consider it a parenthesized with items and perform a hand-rolled speculative parsing 3. Then, verify the assumption and if it's incorrect convert the parsed with items into an appropriate expression which becomes part of the first with item Here, in (3) there are lots of edge cases which we've to deal with: 1. Trailing comma with a single element should be [converted to the expression as is]( |
|
|
|
3b19df04d7
|
Use cursor offset for lexer checkpoint (#11734)
## Summary This PR updates the lexer checkpoint to store the cursor offset instead of cloning the cursor itself. This reduces the size of `LexerCheckpoint` from 136 to 112 bytes and also removes the need for lifetime. ## Test Plan `cargo insta test` |
|
|
|
bf5b62edac
|
Maintain synchronicity between the lexer and the parser (#11457)
## Summary This PR updates the entire parser stack in multiple ways: ### Make the lexer lazy * https://github.com/astral-sh/ruff/pull/11244 * https://github.com/astral-sh/ruff/pull/11473 Previously, Ruff's lexer would act as an iterator. The parser would collect all the tokens in a vector first and then process the tokens to create the syntax tree. The first task in this project is to update the entire parsing flow to make the lexer lazy. This includes the `Lexer`, `TokenSource`, and `Parser`. For context, the `TokenSource` is a wrapper around the `Lexer` to filter out the trivia tokens[^1]. Now, the parser will ask the token source to get the next token and only then the lexer will continue and emit the token. This means that the lexer needs to be aware of the "current" token. When the `next_token` is called, the current token will be updated with the newly lexed token. The main motivation to make the lexer lazy is to allow re-lexing a token in a different context. This is going to be really useful to make the parser error resilience. For example, currently the emitted tokens remains the same even if the parser can recover from an unclosed parenthesis. This is important because the lexer emits a `NonLogicalNewline` in parenthesized context while a normal `Newline` in non-parenthesized context. This different kinds of newline is also used to emit the indentation tokens which is important for the parser as it's used to determine the start and end of a block. Additionally, this allows us to implement the following functionalities: 1. Checkpoint - rewind infrastructure: The idea here is to create a checkpoint and continue lexing. At a later point, this checkpoint can be used to rewind the lexer back to the provided checkpoint. 2. Remove the `SoftKeywordTransformer` and instead use lookahead or speculative parsing to determine whether a soft keyword is a keyword or an identifier 3. Remove the `Tok` enum. The `Tok` enum represents the tokens emitted by the lexer but it contains owned data which makes it expensive to clone. The new `TokenKind` enum just represents the type of token which is very cheap. This brings up a question as to how will the parser get the owned value which was stored on `Tok`. This will be solved by introducing a new `TokenValue` enum which only contains a subset of token kinds which has the owned value. This is stored on the lexer and is requested by the parser when it wants to process the data. For example: |
|
|
|
04a922866a
|
Add basic docs for the parser crate (#11199)
## Summary This PR adds a basic README for the `ruff_python_parser` crate and updates the CONTRIBUTING docs with the fuzzer and benchmark section. Additionally, it also updates some inline documentation within the parser crate and splits the `parse_program` function into `parse_single_expression` and `parse_module` which will be called by matching against the `Mode`. This PR doesn't go into too much internal detail around the parser logic due to the following reasons: 1. Where should the docs go? Should it be as a module docs in `lib.rs` or in README? 2. The parser is still evolving and could include a lot of refactors with the future work (feedback loop and improved error recovery and resilience) --------- Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com> |
|
|
|
c30735d4a7
|
Add `ExpressionContext` for expression parsing (#11055)
## Summary This PR adds a new `ExpressionContext` struct which is used in expression parsing. This solves the following problem: 1. Allowing starred expression with different precedence 2. Allowing yield expression in certain context 3. Remove ambiguity with `in` keyword when parsing a `for ... in` statement For context, (1) was solved by adding `parse_star_expression_list` and `parse_star_expression_or_higher` in #10623, (2) was solved by by adding `parse_yield_expression_or_else` in #10809, and (3) was fixed in #11009. All of the mentioned functions have been removed in favor of the context flags. As mentioned in #11009, an ideal solution would be to implement an expression context which is what this PR implements. This is passed around as function parameter and the call stack is used to automatically reset the context. ### Recovery How should the parser recover if the target expression is invalid when an expression can consume the `in` keyword? 1. Should the `in` keyword be part of the target expression? 2. Or, should the expression parsing stop as soon as `in` keyword is encountered, no matter the expression? For example: ```python for yield x in y: ... # Here, should this be parsed as for (yield x) in (y): ... # Or for (yield x in y): ... # where the `in iter` part is missing ``` Or, for binary expression parsing: ```python for x or y in z: ... # Should this be parsed as for (x or y) in z: ... # Or for (x or y in z): ... # where the `in iter` part is missing ``` This need not be solved now, but is very easy to change. For context this PR does the following: * For binary, comparison, and unary expressions, stop at `in` * For lambda, yield expressions, consume the `in` ## Test Plan 1. Add test cases for the `for ... in` statement and verify the snapshots 2. Make sure the existing test suite pass 3. Run the fuzzer for around 3000 generated source code 4. Run the updated logic on a dozen or so open source repositories (codename "parser-checkouts") |
|
|
|
d3cd61f804
|
Use empty range when there's "gap" in token source (#11032)
## Summary This fixes a bug where the parser would panic when there is a "gap" in the token source. What's a gap? The reason it's `<=` instead of just `==` is because there could be whitespaces between the two tokens. For example: ```python # last token end # | current token (newline) start # v v def foo \n # ^ # assume there's trailing whitespace here ``` Or, there could tokens that are considered "trivia" and thus aren't emitted by the token source. These are comments and non-logical newlines. For example: ```python # last token end # v def foo # comment\n # ^ current token (newline) start ``` In either of the above cases, there's a "gap" between the end of the last token and start of the current token. ## Test Plan Add test cases and update the snapshots. |
|
|
|
13ffb5bc19
|
Replace LALRPOP parser with hand-written parser (#10036)
(Supersedes #9152, authored by @LaBatata101) ## Summary This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser. It also updates the grammar for [PEP 646](https://peps.python.org/pep-0646/) so that the parser outputs the correct AST. For example, in `data[*x]`, the index expression is now a tuple with a single starred expression instead of just a starred expression. Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will _stop_ at the first syntax error. This will be updated in the following months. For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post. ## Test Plan Write _lots_ and _lots_ of tests for both valid and invalid syntax and verify the output. ## Acknowledgements - @MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project - @LaBatata101 for initiating the transition to a hand-written parser in #9152 - @addisoncrump for implementing the fuzzer which helped [catch](https://github.com/astral-sh/ruff/pull/10903) [a](https://github.com/astral-sh/ruff/pull/10910) [lot](https://github.com/astral-sh/ruff/pull/10966) [of](https://github.com/astral-sh/ruff/pull/10896) [bugs](https://github.com/astral-sh/ruff/pull/10877) --------- Co-authored-by: Victor Hugo Gomes <labatata101@linuxmail.org> Co-authored-by: Micha Reiser <micha@reiser.io> |