ruff/crates/ruff_python_parser/src/lib.rs

543 lines
18 KiB
Rust

//! This crate can be used to parse Python source code into an Abstract
//! Syntax Tree.
//!
//! ## Overview
//!
//! The process by which source code is parsed into an AST can be broken down
//! into two general stages: [lexical analysis] and [parsing].
//!
//! During lexical analysis, the source code is converted into a stream of lexical
//! tokens that represent the smallest meaningful units of the language. For example,
//! the source code `print("Hello world")` would _roughly_ be converted into the following
//! stream of tokens:
//!
//! ```text
//! Name("print"), LeftParen, String("Hello world"), RightParen
//! ```
//!
//! These tokens are then consumed by the `ruff_python_parser`, which matches them against a set of
//! grammar rules to verify that the source code is syntactically valid and to construct
//! an AST that represents the source code.
//!
//! During parsing, the `ruff_python_parser` consumes the tokens generated by the lexer and constructs
//! a tree representation of the source code. The tree is made up of nodes that represent
//! the different syntactic constructs of the language. If the source code is syntactically
//! invalid, parsing fails and an error is returned. After a successful parse, the AST can
//! be used to perform further analysis on the source code. Continuing with the example
//! above, the AST generated by the `ruff_python_parser` would _roughly_ look something like this:
//!
//! ```text
//! node: Expr {
//! value: {
//! node: Call {
//! func: {
//! node: Name {
//! id: "print",
//! ctx: Load,
//! },
//! },
//! args: [
//! node: Constant {
//! value: Str("Hello World"),
//! kind: None,
//! },
//! ],
//! keywords: [],
//! },
//! },
//! },
//!```
//!
//! **Note:** The Tokens/ASTs shown above are not the exact tokens/ASTs generated by the `ruff_python_parser`.
//! Refer to the [playground](https://play.ruff.rs) for the correct representation.
//!
//! ## Source code layout
//!
//! The functionality of this crate is split into several modules:
//!
//! - token: This module contains the definition of the tokens that are generated by the lexer.
//! - [lexer]: This module contains the lexer and is responsible for generating the tokens.
//! - parser: This module contains an interface to the [Parsed] and is responsible for generating the AST.
//! - mode: This module contains the definition of the different modes that the `ruff_python_parser` can be in.
//!
//! [lexical analysis]: https://en.wikipedia.org/wiki/Lexical_analysis
//! [parsing]: https://en.wikipedia.org/wiki/Parsing
//! [lexer]: crate::lexer
pub use crate::error::{
InterpolatedStringErrorType, LexicalErrorType, ParseError, ParseErrorType,
UnsupportedSyntaxError, UnsupportedSyntaxErrorKind,
};
pub use crate::parser::ParseOptions;
use crate::parser::Parser;
use ruff_python_ast::token::Tokens;
use ruff_python_ast::{
Expr, Mod, ModExpression, ModModule, PySourceType, StringFlags, StringLiteral, Suite,
};
use ruff_text_size::{Ranged, TextRange};
mod error;
pub mod lexer;
mod parser;
pub mod semantic_errors;
mod string;
mod token;
mod token_set;
mod token_source;
pub mod typing;
/// Parse a full Python module usually consisting of multiple lines.
///
/// This is a convenience function that can be used to parse a full Python program without having to
/// specify the [`Mode`] or the location. It is probably what you want to use most of the time.
///
/// # Example
///
/// For example, parsing a simple function definition and a call to that function:
///
/// ```
/// use ruff_python_parser::parse_module;
///
/// let source = r#"
/// def foo():
/// return 42
///
/// print(foo())
/// "#;
///
/// let module = parse_module(source);
/// assert!(module.is_ok());
/// ```
pub fn parse_module(source: &str) -> Result<Parsed<ModModule>, ParseError> {
Parser::new(source, ParseOptions::from(Mode::Module))
.parse()
.try_into_module()
.unwrap()
.into_result()
}
/// Parses a single Python expression.
///
/// This convenience function can be used to parse a single expression without having to
/// specify the Mode or the location.
///
/// # Example
///
/// For example, parsing a single expression denoting the addition of two numbers:
///
/// ```
/// use ruff_python_parser::parse_expression;
///
/// let expr = parse_expression("1 + 2");
/// assert!(expr.is_ok());
/// ```
pub fn parse_expression(source: &str) -> Result<Parsed<ModExpression>, ParseError> {
Parser::new(source, ParseOptions::from(Mode::Expression))
.parse()
.try_into_expression()
.unwrap()
.into_result()
}
/// Parses a Python expression for the given range in the source.
///
/// This function allows to specify the range of the expression in the source code, other than
/// that, it behaves exactly like [`parse_expression`].
///
/// # Example
///
/// Parsing one of the numeric literal which is part of an addition expression:
///
/// ```
/// use ruff_python_parser::parse_expression_range;
/// # use ruff_text_size::{TextRange, TextSize};
///
/// let parsed = parse_expression_range("11 + 22 + 33", TextRange::new(TextSize::new(5), TextSize::new(7)));
/// assert!(parsed.is_ok());
/// ```
pub fn parse_expression_range(
source: &str,
range: TextRange,
) -> Result<Parsed<ModExpression>, ParseError> {
let source = &source[..range.end().to_usize()];
Parser::new_starts_at(source, range.start(), ParseOptions::from(Mode::Expression))
.parse()
.try_into_expression()
.unwrap()
.into_result()
}
/// Parses a Python expression as if it is parenthesized.
///
/// It behaves similarly to [`parse_expression_range`] but allows what would be valid within parenthesis
///
/// # Example
///
/// Parsing an expression that would be valid within parenthesis:
///
/// ```
/// use ruff_python_parser::parse_parenthesized_expression_range;
/// # use ruff_text_size::{TextRange, TextSize};
///
/// let parsed = parse_parenthesized_expression_range("'''\n int | str'''", TextRange::new(TextSize::new(3), TextSize::new(14)));
/// assert!(parsed.is_ok());
pub fn parse_parenthesized_expression_range(
source: &str,
range: TextRange,
) -> Result<Parsed<ModExpression>, ParseError> {
let source = &source[..range.end().to_usize()];
let parsed = Parser::new_starts_at(
source,
range.start(),
ParseOptions::from(Mode::ParenthesizedExpression),
)
.parse();
parsed.try_into_expression().unwrap().into_result()
}
/// Parses a Python expression from a string annotation.
///
/// # Example
///
/// Parsing a string annotation:
///
/// ```
/// use ruff_python_parser::parse_string_annotation;
/// use ruff_python_ast::{StringLiteral, StringLiteralFlags, AtomicNodeIndex};
/// use ruff_text_size::{TextRange, TextSize};
///
/// let string = StringLiteral {
/// value: "'''\n int | str'''".to_string().into_boxed_str(),
/// flags: StringLiteralFlags::empty(),
/// range: TextRange::new(TextSize::new(0), TextSize::new(16)),
/// node_index: AtomicNodeIndex::NONE
/// };
/// let parsed = parse_string_annotation("'''\n int | str'''", &string);
/// assert!(!parsed.is_ok());
/// ```
pub fn parse_string_annotation(
source: &str,
string: &StringLiteral,
) -> Result<Parsed<ModExpression>, ParseError> {
let range = string
.range()
.add_start(string.flags.opener_len())
.sub_end(string.flags.closer_len());
let source = &source[..range.end().to_usize()];
if string.flags.is_triple_quoted() {
parse_parenthesized_expression_range(source, range)
} else {
parse_expression_range(source, range)
}
}
/// Parse the given Python source code using the specified [`ParseOptions`].
///
/// This function is the most general function to parse Python code. Based on the [`Mode`] supplied
/// via the [`ParseOptions`], it can be used to parse a single expression, a full Python program,
/// an interactive expression or a Python program containing IPython escape commands.
///
/// # Example
///
/// If we want to parse a simple expression, we can use the [`Mode::Expression`] mode during
/// parsing:
///
/// ```
/// use ruff_python_parser::{parse, Mode, ParseOptions};
///
/// let parsed = parse("1 + 2", ParseOptions::from(Mode::Expression));
/// assert!(parsed.is_ok());
/// ```
///
/// Alternatively, we can parse a full Python program consisting of multiple lines:
///
/// ```
/// use ruff_python_parser::{parse, Mode, ParseOptions};
///
/// let source = r#"
/// class Greeter:
///
/// def greet(self):
/// print("Hello, world!")
/// "#;
/// let parsed = parse(source, ParseOptions::from(Mode::Module));
/// assert!(parsed.is_ok());
/// ```
///
/// Additionally, we can parse a Python program containing IPython escapes:
///
/// ```
/// use ruff_python_parser::{parse, Mode, ParseOptions};
///
/// let source = r#"
/// %timeit 1 + 2
/// ?str.replace
/// !ls
/// "#;
/// let parsed = parse(source, ParseOptions::from(Mode::Ipython));
/// assert!(parsed.is_ok());
/// ```
pub fn parse(source: &str, options: ParseOptions) -> Result<Parsed<Mod>, ParseError> {
parse_unchecked(source, options).into_result()
}
/// Parse the given Python source code using the specified [`ParseOptions`].
///
/// This is same as the [`parse`] function except that it doesn't check for any [`ParseError`]
/// and returns the [`Parsed`] as is.
pub fn parse_unchecked(source: &str, options: ParseOptions) -> Parsed<Mod> {
Parser::new(source, options).parse()
}
/// Parse the given Python source code using the specified [`PySourceType`].
pub fn parse_unchecked_source(source: &str, source_type: PySourceType) -> Parsed<ModModule> {
// SAFETY: Safe because `PySourceType` always parses to a `ModModule`
Parser::new(source, ParseOptions::from(source_type))
.parse()
.try_into_module()
.unwrap()
}
/// Represents the parsed source code.
#[derive(Debug, PartialEq, Clone, get_size2::GetSize)]
pub struct Parsed<T> {
syntax: T,
tokens: Tokens,
errors: Vec<ParseError>,
unsupported_syntax_errors: Vec<UnsupportedSyntaxError>,
}
impl<T> Parsed<T> {
/// Returns the syntax node represented by this parsed output.
pub fn syntax(&self) -> &T {
&self.syntax
}
/// Returns all the tokens for the parsed output.
pub fn tokens(&self) -> &Tokens {
&self.tokens
}
/// Returns a list of syntax errors found during parsing.
pub fn errors(&self) -> &[ParseError] {
&self.errors
}
/// Returns a list of version-related syntax errors found during parsing.
pub fn unsupported_syntax_errors(&self) -> &[UnsupportedSyntaxError] {
&self.unsupported_syntax_errors
}
/// Consumes the [`Parsed`] output and returns the contained syntax node.
pub fn into_syntax(self) -> T {
self.syntax
}
/// Consumes the [`Parsed`] output and returns a list of syntax errors found during parsing.
pub fn into_errors(self) -> Vec<ParseError> {
self.errors
}
/// Returns `true` if the parsed source code is valid i.e., it has no [`ParseError`]s.
///
/// Note that this does not include version-related [`UnsupportedSyntaxError`]s.
///
/// See [`Parsed::has_no_syntax_errors`] for a version that takes these into account.
pub fn has_valid_syntax(&self) -> bool {
self.errors.is_empty()
}
/// Returns `true` if the parsed source code is invalid i.e., it has [`ParseError`]s.
///
/// Note that this does not include version-related [`UnsupportedSyntaxError`]s.
///
/// See [`Parsed::has_no_syntax_errors`] for a version that takes these into account.
pub fn has_invalid_syntax(&self) -> bool {
!self.has_valid_syntax()
}
/// Returns `true` if the parsed source code does not contain any [`ParseError`]s *or*
/// [`UnsupportedSyntaxError`]s.
///
/// See [`Parsed::has_valid_syntax`] for a version specific to [`ParseError`]s.
pub fn has_no_syntax_errors(&self) -> bool {
self.has_valid_syntax() && self.unsupported_syntax_errors.is_empty()
}
/// Returns `true` if the parsed source code contains any [`ParseError`]s *or*
/// [`UnsupportedSyntaxError`]s.
///
/// See [`Parsed::has_invalid_syntax`] for a version specific to [`ParseError`]s.
pub fn has_syntax_errors(&self) -> bool {
!self.has_no_syntax_errors()
}
/// Returns the [`Parsed`] output as a [`Result`], returning [`Ok`] if it has no syntax errors,
/// or [`Err`] containing the first [`ParseError`] encountered.
///
/// Note that any [`unsupported_syntax_errors`](Parsed::unsupported_syntax_errors) will not
/// cause [`Err`] to be returned.
pub fn as_result(&self) -> Result<&Parsed<T>, &[ParseError]> {
if self.has_valid_syntax() {
Ok(self)
} else {
Err(&self.errors)
}
}
/// Consumes the [`Parsed`] output and returns a [`Result`] which is [`Ok`] if it has no syntax
/// errors, or [`Err`] containing the first [`ParseError`] encountered.
///
/// Note that any [`unsupported_syntax_errors`](Parsed::unsupported_syntax_errors) will not
/// cause [`Err`] to be returned.
pub(crate) fn into_result(self) -> Result<Parsed<T>, ParseError> {
if self.has_valid_syntax() {
Ok(self)
} else {
Err(self.into_errors().into_iter().next().unwrap())
}
}
}
impl Parsed<Mod> {
/// Attempts to convert the [`Parsed<Mod>`] into a [`Parsed<ModModule>`].
///
/// This method checks if the `syntax` field of the output is a [`Mod::Module`]. If it is, the
/// method returns [`Some(Parsed<ModModule>)`] with the contained module. Otherwise, it
/// returns [`None`].
///
/// [`Some(Parsed<ModModule>)`]: Some
pub fn try_into_module(self) -> Option<Parsed<ModModule>> {
match self.syntax {
Mod::Module(module) => Some(Parsed {
syntax: module,
tokens: self.tokens,
errors: self.errors,
unsupported_syntax_errors: self.unsupported_syntax_errors,
}),
Mod::Expression(_) => None,
}
}
/// Attempts to convert the [`Parsed<Mod>`] into a [`Parsed<ModExpression>`].
///
/// This method checks if the `syntax` field of the output is a [`Mod::Expression`]. If it is,
/// the method returns [`Some(Parsed<ModExpression>)`] with the contained expression.
/// Otherwise, it returns [`None`].
///
/// [`Some(Parsed<ModExpression>)`]: Some
pub fn try_into_expression(self) -> Option<Parsed<ModExpression>> {
match self.syntax {
Mod::Module(_) => None,
Mod::Expression(expression) => Some(Parsed {
syntax: expression,
tokens: self.tokens,
errors: self.errors,
unsupported_syntax_errors: self.unsupported_syntax_errors,
}),
}
}
}
impl Parsed<ModModule> {
/// Returns the module body contained in this parsed output as a [`Suite`].
pub fn suite(&self) -> &Suite {
&self.syntax.body
}
/// Consumes the [`Parsed`] output and returns the module body as a [`Suite`].
pub fn into_suite(self) -> Suite {
self.syntax.body
}
}
impl Parsed<ModExpression> {
/// Returns the expression contained in this parsed output.
pub fn expr(&self) -> &Expr {
&self.syntax.body
}
/// Returns a mutable reference to the expression contained in this parsed output.
pub fn expr_mut(&mut self) -> &mut Expr {
&mut self.syntax.body
}
/// Consumes the [`Parsed`] output and returns the contained [`Expr`].
pub fn into_expr(self) -> Expr {
*self.syntax.body
}
}
/// Control in the different modes by which a source file can be parsed.
///
/// The mode argument specifies in what way code must be parsed.
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq)]
pub enum Mode {
/// The code consists of a sequence of statements.
Module,
/// The code consists of a single expression.
Expression,
/// The code consists of a single expression and is parsed as if it is parenthesized. The parentheses themselves aren't required.
/// This allows for having valid multiline expression without the need of parentheses
/// and is specifically useful for parsing string annotations.
ParenthesizedExpression,
/// The code consists of a sequence of statements which can include the
/// escape commands that are part of IPython syntax.
///
/// ## Supported escape commands:
///
/// - [Magic command system] which is limited to [line magics] and can start
/// with `?` or `??`.
/// - [Dynamic object information] which can start with `?` or `??`.
/// - [System shell access] which can start with `!` or `!!`.
/// - [Automatic parentheses and quotes] which can start with `/`, `;`, or `,`.
///
/// [Magic command system]: https://ipython.readthedocs.io/en/stable/interactive/reference.html#magic-command-system
/// [line magics]: https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics
/// [Dynamic object information]: https://ipython.readthedocs.io/en/stable/interactive/reference.html#dynamic-object-information
/// [System shell access]: https://ipython.readthedocs.io/en/stable/interactive/reference.html#system-shell-access
/// [Automatic parentheses and quotes]: https://ipython.readthedocs.io/en/stable/interactive/reference.html#automatic-parentheses-and-quotes
Ipython,
}
impl std::str::FromStr for Mode {
type Err = ModeParseError;
fn from_str(s: &str) -> Result<Self, ModeParseError> {
match s {
"exec" | "single" => Ok(Mode::Module),
"eval" => Ok(Mode::Expression),
"ipython" => Ok(Mode::Ipython),
_ => Err(ModeParseError),
}
}
}
/// A type that can be represented as [Mode].
pub trait AsMode {
fn as_mode(&self) -> Mode;
}
impl AsMode for PySourceType {
fn as_mode(&self) -> Mode {
match self {
PySourceType::Python | PySourceType::Stub => Mode::Module,
PySourceType::Ipynb => Mode::Ipython,
}
}
}
/// Returned when a given mode is not valid.
#[derive(Debug)]
pub struct ModeParseError;
impl std::fmt::Display for ModeParseError {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, r#"mode must be "exec", "eval", "ipython", or "single""#)
}
}