FACTOID # 150: The number of tourists in San Marino is almost 19 times the resident population.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RELATED ARTICLES
People who viewed "Parsing" also viewed:
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS   

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Parsing

An example of parsing a mathematical expression.
An example of parsing a mathematical expression.

In computer science and linguistics, parsing (more formally syntactic analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. A parser is the component of a compiler that carries out this task. Image File history File links Parsing-example. ... Image File history File links Parsing-example. ... Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. ... Linguistics is the scientific study of language, which can be theoretical or applied. ... It has been suggested that this article or section be merged with Tokenizing. ... In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language — that is, of a set of strings. ... A diagram of the operation of a typical multi-language, multi-target compiler. ...


Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Lexical analysis creates tokens from a sequence of input characters and it is these tokens that are processed by a parser to build a data structure such as parse tree or abstract syntax trees. Lexical analysis is the processing of an input sequence of characters (such as the source code of a computer program) to produce, as output, a sequence of symbols called lexical tokens, or just tokens. For example, lexers for many programming languages convert the character sequence 123 abc into two tokens... A parse tree is a tree that represents the syntactic structure of a string according to some formal grammar. ... In computer science, abstract syntax tree (AST) is a data structure representing something which has been parsed, often used as a compiler or interpreters internal representation of a computer program while it is being optimized and from which code generation is performed. ...


Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. Inflection of the Spanish lexeme for cat, with blue representing the masculine gender, pink representing the feminine gender, grey representing the form used for mixed-gender, and green representing the plural number. ... The Romance languages (sometimes referred to as Romanic languages) are a branch of the Indo-European language family, comprising all the languages that descend from Latin, the language of the Roman Empire. ... Latin is an ancient Indo-European language originally spoken in Latium, the region immediately surrounding Rome. ...


There are tools, called parser generators, that can automatically generate a parser (in some programming language) from a grammar written in Backus-Naur form (e.g. Yacc - yet another compiler compiler). The Backus-Naur form (BNF) (also known as Backus normal form) is a metasyntax used to express context-free grammars: that is, a formal way to describe formal languages. ... Yacc is a piece of computer software that serves as the standard parser generator on Unix systems. ...

Contents

Human languages

In some machine translation and natural language processing systems, human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language. In order to parse natural language data, researchers must first agree on the grammar to be used. The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete. Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank. Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is dependency grammar parsing. Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. ... Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. ... Syntactic ambiguity is a property of sentences which may be parsed in more that one way. ... For the topic in theoretical computer science, see Formal grammar Grammar is the study of rules governing the use of language. ... Broadly conceived, linguistics is the study of human language, and a linguist is someone who engages in this study. ... Lexical functional grammar (LFG) is a reaction to the direction research in the area of transformational grammar began to take in the 1970s. ... In complexity theory, the NP-complete problems are the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. The reason is that if you could find a way to solve an NP-complete problem quickly, then you could use... The Head-driven phrase structure grammar (HPSG) is a non-derivational generative grammar theory developed by Carl Pollard and Ivan Sag (1985). ... A treebank is a text corpus in which each sentence has been annotated with syntactic structure. ... Shallow parsing (light parsing) is an analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence. ... Dependency grammar (DG) is a class of syntactic theories separate from generative grammar. ...


Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. (See machine learning.) Approaches which have been used include straightforward PCFGs (probabilistic context free grammars), maximum entropy, and neural nets. Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech). However such systems are vulnerable to overfitting and require some kind of smoothing to be effective. A graph of a normal bell curve showing statistics used in educational assessment and comparing various grading methods. ... As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ... A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar in which each production is augmented with a probability. ... The principle of maximum entropy is a method for analyzing the available information in order to determine a unique epistemic probability distribution. ... A neural network is an interconnected group of neurons. ... In grammar, a part of speech or word class is defined as the role that a word (or sometimes a phrase) plays in a sentence. ... Noisy (roughly linear) data is fit to both linear and polynomial functions. ...


Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages. As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time. (See chart parsing.) However some systems trade speed for accuracy using, eg, linear-time versions of the shift-reduce algorithm. A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option. Context Free is a small language for design grammars and an application for developing these grammars. ... The Cocke-Younger-Kasami (CYK) algorithm (alternatively called CKY) determines whether a string can be generated by a given context-free grammar and, if so, how it can be generated. ... In computer science, besides the common use as rule of thumb (see heuristic), the term heuristic has two well-defined technical meanings. ... A chart parser is a type of parser commonly used for natural languages that uses a data-driven approach based on a set of grammatical rules and a dictionary with each of the possible grammatical senses of each word indicated. ...


Programming languages

The most common use of a parser is as a component of a compiler. This parses the source code of a computer programming language to create some form of internal representation. Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them. Parsers are usually not written by hand but are generated by parser generators. A diagram of the operation of a typical multi-language, multi-target compiler. ... Source code (commonly just source or code) is any series of statements written in some human-readable computer programming language. ... An alternate rewrite has been has been proposed. ... In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ... A compiler-compiler or parser generator is a utility for generating the source code of a parser, interpreter or compiler from an annotated language description in the form of a grammar (usually in BNF) plus code that is associated with each of the rules of the grammar that should be...


Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.


Overview of process

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.


The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^ and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated. Lexical analysis is the processing of an input sequence of characters (such as the source code of a computer program) to produce, as output, a sequence of symbols called lexical tokens, or just tokens. For example, lexers for many programming languages convert the character sequence 123 abc into two tokens... In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...


The next stage is syntactic parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars. In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ... This article is in need of attention from an expert on the subject. ...


The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator, the action is to evaluate the expression; a compiler, on the other hand, would generate code. Attribute grammars can also be used to define these actions. In computer science, semantic analysis is a pass by a compiler that adds semantical information to the parse tree and performs certain checks based on this information. ...


Types of parsers

The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:

  • Top-down parsing - A parser can start with the start symbol and try to transform it to the input. Intuitively, the parser starts from the largest elements and breaks them down into incrementally smaller parts. LL parsers are examples of top-down parsers.
  • Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse). This article or section does not cite its references or sources. ... An LL parser is a top-down parser for a subset of the context-free grammars. ... Bottom-up parsing is a strategy for analyzing unknown data relationships that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. ... In computer science, an LR parser is a parser for context-free grammars that reads input from Left to right and produces a Rightmost derivation. ... In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. ... There are several meanings of derivation: A derivation in abstract algebra is a linear map that satisfies Leibniz law. ...


Examples of parsers

Top-down parsers

Some of the parsers that use top-down parsing include: This article or section does not cite its references or sources. ...

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. ... An LL parser is a top-down parser for a subset of the context-free grammars. ... In computing, a packrat parser is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. ... There are very few or no other articles that link to this one. ... The Earley parser is a type of chart parser mainly used for parsing in computational linguistics, named after its inventor, Jay Earley. ...

Bottom-up parsers

Some of the parsers that use bottom-up parsing include: Bottom-up parsing is a strategy for analyzing unknown data relationships that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. ...

An operator precedence parser is a computer program that interprets an operator-precedence grammar. ... The introduction of this article does not provide enough context for readers unfamiliar with the subject. ... In computer science, an LR parser is a parser for context-free grammars that reads input from Left to right and produces a Rightmost derivation. ... A Simple LR parser or SLR parser is an LR parser for which the parsing tables are generated as for an LR(0) parser except that it only performs a reduction with a grammar rule A → w if the next symbol on the input stream is in the follow... Look-Ahead LR parsers or LALR parsers are a specialized form of LR parsers that can deal with more context-free grammars than Simple LR parsers but less than LR(1) parsers can. ... A canonical LR parser or LR(1) parser is an LR parser whose parsing tables are constructed in a similar way as with LR(0) parsers except that the items in the item sets also contain a follow, i. ... In computer science, a GLR parser (Generalized LR parser) is an extension of an LR parser algorithm to handle nondeterministic and ambiguous grammars. ... The Cocke-Younger-Kasami (CYK) algorithm (alternatively called CKY) determines whether a string can be generated by a given context-free grammar and, if so, how it can be generated. ...

See also

Parsing concepts

A chart parser is a type of parser commonly used for natural languages that uses a data-driven approach based on a set of grammatical rules and a dictionary with each of the possible grammatical senses of each word indicated. ... A compiler-compiler or parser generator is a utility for generating the source code of a parser, interpreter or compiler from an annotated language description in the form of a grammar (usually in BNF) plus code that is associated with each of the rules of the grammar that should be... In natural language processing, deterministic parsing refers to parsing algorithms that are deterministic in the sense that given the input, the parser always knows which action to take. ... Lexical analysis is the process of taking an input string of characters (such as the source code of a computer program) and producing a sequence of symbols called lexical tokens, or just tokens, which may be handled more easily by a parser. ... Shallow parsing (light parsing) is an analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence. ...

Parser Development Software

See also: List of Parsers comparison table. This is a list of notable parsing systems. ...


Free Software

The introduction to this article provides insufficient context for those unfamiliar with the subject matter. ... GNU bison is a free parser generator computer program written for the GNU project, and available for virtually all common operating systems. ... Coco/R is a compiler generator (Compiler-compiler), which takes an attributed grammar of a source language and generates a scanner and a parser for this language. ... GOLD is a freeware parsing system that was designed to support multiple programming languages. ... JavaCC (Java Compiler Compiler) is a parser generator for the Java programming language. ... Lemon is an LALR parser generator for C or C++. It does the same job as GNU bison and yacc. ... lex is a program that generates lexical analyzers (scanners or lexers). Lex is commonly used with the yacc parser generator. ... REBOL, the Relative Expression Based Object Language (pronounced [rebl]), is a data exchange and programming language designed specifically for network communications and distributed computing. ... SableCC is a framework for creating compilers and interpreters in Java. ... The Spirit Parser Framework is an object oriented recursive descent parser generator framework implemented using template metaprogramming techniques. ... yacc is a computer program that serves as the standard parser generator on Unix systems. ...

Wikimedia

Commercial

  • LRGen
  • Visual BNF

References


  Results from FactBites:
 
Parsing Techniques (2009 words)
Parsing a sentence means to compute the structural description (descriptions) of the sentence assigned by a grammar, assuming, of course, that the sentence is well-formed.
Thus in a TAG the real parse of a sentence is the so-called derivation tree, which is a record of how the elementary trees of a TAG are put together by the operations of substitution and adjoining in order to obtain the derived tree whose yield is the string being parsed.
For a CCG the parse of a sentence is the proof tree of the derivation.
Parsing - Wikipedia (1214 words)
In computer science, parsing is the process of analyzing an input sequence (read from a file or a keyboard, for example) in order to determine its grammatical structure with respect to a given formal grammar.
Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input.
Parsing is also an earlier term for diagraming of sentences in grammar of the English language, and is still used to digram the grammar of inflected languages, such as the Romance languages or Latin.
  More results at FactBites »

 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your location
Your comments
Please enter the 5-letter protection code


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.