Parsing

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Introduction

Test

Theory

Noam Chomsky (whose politics I strongly disagree with) did brilliant work in helping us understand, mathematically, the nature of language.

There are really two problems (but ultimately, they are the same problem) with languages.

  1. Determining whether the expression is a part of the language.
  2. Deciphering the meaning of the expression.

The first step is called "parsing". The end result is something that describes how the language maps to the expression. You might be familiar with this in your grammar classes in school. Parsing only identifies the paragraphs, sentences; subject, object, verb of a sentence; nouns, pronouns, adjectives, verbs, adverbs; letters within a word; prefixes and postfixes and conjugations of a word; and so on and so forth.

The second step is called "interpretation". The end result is something meaningful. Really, you are translating one expression to another.

NOTE: In general, you can consider all of computer programming to be "interpretation"---writing down rules to transform one thing in one form to another thing in another form.

Definitions

Now, a bit of definition for the words I used above.

Language
A language is a set of expressions. If you imagined every possible expression, a Language, ultimately, is the set of all expressions that belong to that language.
Grammar
A grammar' is the set of rules that describe a language. Since listing out every possible expression in a language may be impossible, we have to write rules and those rules become the grammar, the definition of what is and is not part of the language. Grammars can be reduced into a set of simple rules that say things like "A → B". These rules say "The thing on the left (A) can be understood as being equivalent to the the thing on the right." In general, we parse an expression by trying to reduce it by working backwards until we reach a starting symbol.
Expression
An expression is something---anything---that can be expressed. In reality, everything is an expression of some kind. The sounds we hear, the things we see on paper or on billboards or on our monitors, even the motions and behaviors of people. Even genetic code is an expression, as are the shapes, sounds, colors, smells, and everything we can observe or even think about. In the realm of computer science, we reduce all expressions into a sequence of numbers. Each number represents a letter of the language or something else useful. Images are reduced to bitmaps. Sounds are recorded digitally. But ultimately, everything can be reduced to a sequence of integers, and those integers can be further reduces to a sequence of bits.
Terminal Symbols (Terminals)
In grammars, a terminal symbol is a symbol that is only found on the right hand side. That is, it cannot be further reduced.
Non-Terminal Symbols (Non-Terminals)
In grammars, a non-terminal symbol is a symbol that is found on the left. That is, it can, and must, be further reduced.
Starting Symbol
In grammars, a starting symbol is a non-terminal symbol that is never found on the right. All expressions in a language can be reduced to one of the starting symbols. Usually, there is only one starting symbol. If there is not, then just invent a new one that can be reduced to any of the previous starting symbols.

Chomsky Types

Chomsky classified grammars into different types based on how hard they were to describe and parse.

  • Type-0 grammars contain any language that can be described by a grammar. Note that we may never be able to fully discover the entire grammar of the languages, nor may we ever be able to parse an expression according to those rules into a determination whether it is or is not an expression of the language. (The Turing Halting problem comes into play here.) Every language we can imagine belongs in this category, because no language outside of this can be described. (The description itself is a grammar.)
  • Type-1 (Context-Sensitive) grammars are a sub-set of Type-0 grammars. They have the special property that all their rules look like:
A B C → a B c
where B is not the empty symbol.
  • Type-2 (Context-Free) grammars are the vast majority of programming languages. They are Type-1 Grammars with the special property that all their rules look like:
A → b
  • Type-3 (Regular) Grammars are expressed with simple regular expressions. (PCREs have features that make them non-regular.) They have rules that look like:
A → b C
where b is a terminal.
  • Type-4 Grammars simply list what is allowed. They have rules like:
A → b
where b is a terminal. There are no non-terminals on the right hand side of any rule. These are enums, lists of possible expressions.

Types of Parsers

  • RE
  • LR
  • LALR

Implementations

Python

See ALso