|
A parsing expression grammar, or PEG, is a type of analytic formal grammar that describes a formal language in terms of a set of rules for recognizing strings in the language. A parsing expression grammar essentially represents a recursive descent parser in a pure schematic form that only expresses syntax and is independent of the way an actual parser might be implemented or what it might be used for. Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation. In computer science a formal grammar is an abstract structure that describes a formal language precisely, i. ...
In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language â that is, of a set of strings. ...
In mathematics, logic, and computer science, a formal language is a language that is defined by precise mathematical or machine processable formulas. ...
In computer programming and formal language theory, (and other branches of mathematics), a string is an ordered sequence of symbols. ...
A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V â w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ...
The Backus-Naur form (BNF) (also known as Backus normal form) is a metasyntax used to express context-free grammars: that is, a formal way to describe formal languages. ...
Unlike CFGs, PEGs are not ambiguous; if a string parses, it has exactly one valid parse tree. This suits PEGs well to parsing computer languages, but not natural languages. In computer science, a grammar is said to be an ambiguous grammar if there is some string that it can generate in more than one way (i. ...
A parse tree or concrete syntax tree is a tree that represents the syntactic structure of a string according to some formal grammar. ...
The term natural language is used to distinguish languages spoken and signed (by hand signals and facial expressions) by humans for general-purpose communication from constructs such as writing, computer-programming languages or the languages used in the study of formal logic, especially mathematical logic. ...
Definition
Formally, a parsing expression grammar consists of: Each parsing rule in P has the form A ← e, where A is a nonterminal symbol and e is a parsing expression. A parsing expression is a hierarchical expression similar to a regular expression, which is constructed in the following fashion: A nonterminal symbol is that symbol which has the capability of being further defined in terms of terminals and/or non-terminals. ...
A terminal symbol, in BNF jargon, is a symbol that represents a constant value. ...
An expression is a combination of numbers, operators, grouping symbols (such as brackets and parentheses) and/or free variables and bound variables arranged in a meaningful way which can be evaluated. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
- An atomic parsing expression consists of:
- any terminal symbol,
- any nonterminal symbol, or
- the empty string ε.
- Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following operators:
- Sequence: e1 e2
- Ordered choice: e1 / e2
- Zero-or-more: e*
- One-or-more: e+
- Optional: e?
- And-predicate: &e
- Not-predicate: !e
Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be exactly one rule in the grammar having a given nonterminal on its left-hand side. That is, rules act as definitions in a PEG, and each nonterminal must have one and only one definition. In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V â w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ...
It has been suggested that this article or section be merged with Generative linguistics. ...
Interpretation of parsing expressions Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results: Graph of example function, The mathematical concept of a function expresses the intuitive idea of deterministic dependence between two quantities, one of which is viewed as primary (the independent variable, argument of the function, or its input) and the other as secondary (the value of the function, or output). A...
A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. ...
- success, in which the function may optionally move forward or "consume" one or more characters of the input string supplied to it, or
- failure, in which case no input is consumed.
A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure. An atomic parsing expression consisting of a single terminal succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal A represents a recursive call to the nonterminal-function A. A visual form of recursion known as the Droste effect. ...
The sequence operator e1 e2 first invokes e1, and if e1 succeeds, subsequently invokes e2 on the remainder of the input string left unconsumed by e1, and returns the result. If either e1 or e2 fails, then the sequence expression e1 e2 fails. The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result. Backtracking is a type of algorithm that is a refinement of brute force search. ...
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. (Regular expressions start by matching greedily, but they backtrack and try shorter matches if they fail to match.) For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match. In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V â w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
The greedy algorithm determines the minimum number of US coins to give while making change. ...
Finally, the and-predicate and not-predicate operators implement syntactic predicates. The expression &e invokes the sub-expression e, and then succeeds if e succeeds and fails if e fails, but in either case never consumes any input. Conversely, the expression !e succeeds if e fails and fails if e succeeds, again consuming no input in either case. Because these can use an arbitrarily complex sub-expression e to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility. A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. ...
Lookahead is a tool in algorithms for looking ahead a few more input items before making a cost effective decision at one stage of the algorithm. ...
Examples This is a PEG that recognizes mathematical formulas that apply the basic four operations to non-negative integers. - Value ← [0-9]+ / '(' Expr ')'
- Product ← Value (('*' / '/') Value)*
- Sum ← Product (('+' / '-') Product)*
- Expr ← Sum
In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as '(' and ')'. The range [0-9] is also a shortcut for ten characters, indicating any one of the digits 0 through 9. (This range syntax is the same as the syntax used by regular expressions.) The nonterminal symbols are the ones that expand to other rules: Value, Product, Sum, and Expr. In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
The examples below drop quotation marks in order to be easier to read. Lowercase letters are terminal symbols, while capital letters in italics are nonterminals. Actual PEG parsers would require the lowercase letters to be in quotes. The parsing expression (a/b)* matches and consumes an arbitrary-length sequence of a's and b's. The rule S ← a S? b describes the simple context-free "matching language" . The following parsing expression grammar describes the classic non-context-free language : The introduction to this article provides insufficient context for those unfamiliar with the subject matter. ...
- S ← &(A !b) a+ B !c
- A ← a A? b
- B ← b B? c
The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.) In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V â w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ...
The dangling else is a well-known problem in computer programming in which a seemingly well-defined grammar can become ambiguous. ...
- Emo Philips A word, phrase, sentence, or other communication is called ambiguous if it can be reasonably interpreted in more than one way. ...
- S ← if C then S else S / if C then S
The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !(bar) matches the text "foo" but only if it is not followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b. The following recursive rule matches Pascal-style nested comment syntax, (* which can (* nest *) like this *). The comment symbols appear in double quotes to distinguish them from PEG operators. - Begin ← "(*"
- End ← "*)"
- C ← Begin N* End
- N ← C / (!Begin !End Z)
- Z ← any single character
Implementing parsers from parsing expression grammars Any parsing expression grammar can be converted directly into a recursive descent parser[citation needed]. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case. A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. ...
Lookahead is a tool in algorithms for looking ahead a few more input items before making a cost effective decision at one stage of the algorithm. ...
In complexity theory, exponential time is the computation time of a problem where the time to complete the computation, m(n), is bounded by an exponential function of the problem size, n (i. ...
By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a packrat parser, which always runs in linear time at the cost of substantially greater storage space requirements. Memoization is a technique used to speed up computer programs by storing the results of functions for later reuse, rather than recomputing them. ...
In computational complexity, an algorithm is said to take linear time, or O(n) time, if the time it requires is proportional to the size of the input, which is usually denoted n. ...
A packrat parser[1] is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. Because of this memoization, a packrat parser has the ability to parse many context-free grammars and any parsing expression grammar (including some that do not represent context-free languages) in linear time. An example of parsing a mathematical expression. ...
A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. ...
Memoization is a technique used to speed up computer programs by storing the results of functions for later reuse, rather than recomputing them. ...
Mutual recursion is a form of recursion where two mathematical or computational functions are defined in terms of each other. ...
Memoization is a technique used to speed up computer programs by storing the results of functions for later reuse, rather than recomputing them. ...
In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V â w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ...
In computational complexity, an algorithm is said to take linear time, or O(n) time, if the time it requires is proportional to the size of the input, which is usually denoted n. ...
It is also possible to build LL parsers and LR parsers from parsing expression grammars, but the unlimited lookahead capability of the grammar formalism is lost in this case. An LL parser is a top-down parser for a subset of the context-free grammars. ...
In computer science, an LR parser is a parser for context-free grammars that reads input from Left to right and produces a Rightmost derivation. ...
Advantages PEGs make a good replacement for regular expressions, because they are strictly more powerful. For example, a regular expression inherently cannot find matched pairs of parentheses, because it is not recursive, but a PEG can. A regular expression (abbreviated as regexp, regex or regxp) is a string that describes or matches a set of strings, according to certain syntax rules. ...
Any PEG can be parsed in linear time by using a packrat parser, as described above. Parsers for languages expressed as a CFG, such as LR parsers, require a separate [[toke--61.2.50.96 12:28, 4 September 2007 (UTC)nization]] step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers use lookahead to parse CFGs that meet certain requirements in linear time. PEGs do not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rule. In computer science, an LR parser is a parser for context-free grammars that reads input from Left to right and produces a Rightmost derivation. ...
Many CFGs contain inherent ambiguities, even when they're intended to describe unambiguous languages. The "dangling else" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization. The dangling else is a well-known problem in computer programming in which a seemingly well-defined grammar can become ambiguous. ...
Disadvantages PEGs are new and not widely implemented. In contrast, regular expressions and CFGs have been around for decades, the code to parse them has been extensively optimized, and many programmers are familiar with how to use them. PEGs cannot express left-recursive rules, where a rule refers to itself without moving forward in the string. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line: In computer science, left recursion is a special case of recursion. ...
- Value ← [0-9.]+ / '(' Expr ')'
- Product ← Expr (('*' / '/') Expr)*
- Sum ← Expr (('+' / '-') Expr)*
- Expr ← Product / Sum / Value
The problem with this is that it says that to match an Expr, you need to first see if a Product matches there, and to match a Product, you need to first see if an Expr matches there. This is not possible. However, left-recursive rules can always be rewritten to eliminate left-recursion. For example, a left-recursive rule can repeat a certain expression indefinitely, as in the CFG rule: - string-of-a ← string-of-a 'a' | 'a'
This can be rewritten in a PEG using the plus operator: - string-of-a ← 'a'+
PEGs are also associated with packrat parsing, which requires storage proportional to the total input size, rather than the depth of the parse tree as with LR parsers.[1]
PEG parser generators There are several programs that can be used to create a PEG parser: - PGE, a grammar engine that emits Perl 6-style rules as Parrot virtual machine bytecode.
- Pappy, a parser generator for Haskell.
- Frisby, another parser generator for Haskell.
- Rats!, a parser generator for Java.
- Narwhal, a parser generator for C.
- grammar::peg, a TCL library.
- cl-peg, a Common Lisp library.
- packrat, a packrat parser library for CHICKEN Scheme
- Lpeg, a Lua library.
- PyPy rlib parsing, a packrat parser generator for Python. (source)
- peg and leg, parser generators for C with a yacc-like interface.
- [1], YARD a public-domain PEG parsing framework that uses C++ templates to generate parsers at compile-time.
Some of these parser generators build semantic actions into the PEG syntax: after an expression, you can write a segment of code that specifies what value that expression should have. Others output an abstract syntax tree, which you can manipulate to arrive at a value. The Parser Grammar Engine (originally Parrot Grammar Engine) or PGE is a compiler and runtime for a Perl 6 rules for the Parrot virtual machine. ...
Perl 6 rules are Perl 6s regular expression, pattern matching and general-purpose parsing facility, and are a core part of the language. ...
Parrot is a register-based virtual machine being developed using the C programming language and intended to run dynamic languages efficiently. ...
Bytecode is a binary representation of an executable program designed to be executed by a virtual machine rather than by dedicated hardware. ...
yacc is a computer program that serves as the standard parser generator on Unix systems. ...
In computer science, an abstract syntax tree (AST) is a finite, labeled, directed tree, where the internal nodes are labeled by operators, and the leaf nodes represent the operands of the node operators. ...
See also In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language â that is, of a set of strings. ...
A regular expression (abbreviated as regexp, regex or regxp) is a string that describes or matches a set of strings, according to certain syntax rules. ...
REBOL, the Relative Expression Based Object Language (pronounced [rebl]), is a data exchange and programming language designed specifically for network communications and distributed computing. ...
Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. ...
References - ^ a b Ford, Bryan (September 2002). Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking. Massachusetts Institute of Technology. Retrieved on 2007-07-27.
Year 2007 (MMVII) is the current year, a common year starting on Monday of the Gregorian calendar and the AD/CE era. ...
is the 208th day of the year (209th in leap years) in the Gregorian calendar. ...
External links |