Formal Grammar (Draft)


Formal Grammar

Notes:

According to Microsoft, a grammar is a structured list of words and phrases that are governed by rules, and a vocabulary is the set of words used in the grammar.  A speech grammar is a grammar that recognizes speech inputs.

Grammar as we know it is a vague abstraction of the form that governs language.  However, in computational linguistics, grammars are things, encoded templates attempting, in myriad ways, to dissect natural language.  In effect, grammars represent different techniques, or algorithms, for the dissection of natural language.  These grammars as things are know as formal grammars.  Computational linguistics is considered an interdisciplinary field dealing with the computational modeling of natural language.  Psycholinguistics is the psychological study of natural language, in terms of cognitive processes and grammatical structures.  Cognitive linguistics attempts to interpret natural language in terms of the concepts, sometimes universal, sometimes specific to a particular tongue.  Probably there is no perfect natural language grammar, because natural language is dynamic and relative; so, the study of grammars appears chaotic, akin to the mythical Tower of Babel, perhaps explaining why *natural* is always emphasized in the study of “natural language”.  By contrast, controlled language is a subset of natural language, which restricts grammar and vocabulary in order to reduce or eliminate ambiguity and complexity.

So, there is an imperfect knowledge of natural language.  There is an imperfect knowledge of psychology.  And, there is an incomplete knowledge of cognition, in particular how cognition may arise from language and the psyche.  Somehow, somewhere, grammar fits into this picture.  And so, the encoding of formal grammar templates is to date a very imprecise science.  Traditionally there have been two main ways of attacking natural language, manually constructed rule-based models and automated statistical language models.  Today, there are also combinations of the two, such as in IBM Watson.  Watson actually uses many different models, or algorithms, and then scores them on probability, maybe more like how the human mind works.  IBM Watson uses logical form alignment to score on grammatical relationships, deep semantic relationships or both.  The LF alignment algorithm first establishes tentative lexical correspondences between nodes in the source and target LFs using translation pairs from a bilingual lexicon.  This is in contrast to the 1954 precursor of IBM Watson, the “Georgetown-IBM experiment” used six grammar rules and had 250 items in its vocabulary.

In fact, the very meaning of words lies in their relationship.  This is called semantics, the study of meaning, or significance, in the relationship of words.  Computational linguistics is largely concerned with the analysis and modeling of natural language.  In natural language programming, modeling is generally comprised of rule-based and/or statistical language models.  Generally, but not always, rule-based language models are manually constructed, or coded by hand, and statistical language models are automated.  Rule-based language modeling of grammar is one of the major components of natural language processing.  However, rule-based language models learned from a collection of transcriptions have been shown to outperform traditional n-gram language models.  Statistical language modeling provides a much easier way of dealing with the complexities of natural language in computer.  The N-gram language model currently dominates statistical language modeling.  As the complexity of a rule-based language model, and thus the effort to manually create such a model, increases significantly with the vocabulary size, statistical language models are used in large-vocabulary systems.

Units of Language

In linguistics, the lexicon (or wordstock) of a language is its vocabulary, including its words and expressions. A lexicon is also a synonym of the word thesaurus. More formally, it is a language’s inventory of lexemes.  Grammar is the set of structural rules that govern the composition of clauses, phrases, and words.  Morphology is the identification, analysis and description of the structure of a given language’s morphemes and other linguistic units.  A morpheme is the smallest semantically meaningful unit in a language.  In 1961, the logician Haskell Curry made a distinction between two levels of grammar, which he called “tectogrammatics” and “phenogrammatics”, where the first concerns “the study of grammatical structure in itself” and the latter – how grammatical structure is represented in terms of expressions.

Rule-based Language Modeling (constituency relation?)

In natural language programming, an unordered collection of words, disregarding grammar, is referred to as the “bag of words” model.  Grammatical rules need to perform non-trivial semantic operations.  For instance, a “precision grammar” is a formal grammar designed to distinguish ungrammatical from grammatical sentences.  AIML is an example of a grammar in XML form.  Limitations of current human-computer interactive applications are mostly due to the use of constrained grammars.  Grammatical Framework is a programming language for writing grammars of natural languages.  Multimodality Grammatical Framework can be used to write multi-modal grammars.  Grammatical Framework is a type-theoretical grammar.  Constructive type theory is derived from constructive mathematics, based on concepts of proof object and context.

Context-free grammars provide a mechanism for describing methods by which phrases are built.  Context-free phrase-structure grammars (PSGs) have long been popular.  Mixed-initiative grammars allow the caller to fill in multiple voice recognition fields with a single utterance.  The Speech Recognition Grammar Specification (SRGS) is a standard for how speech recognition grammars are specified.  Speech recognition grammars include JSGF, the Java Speech Grammar Format (aka JSpeech Grammar Format), and GSL, Nuance’s Grammar Specification Language.  Grammars in Regulus are specified with features requiring unification, but are compiled into context-free grammars for use with the Nuance speech recognition.  The Nuance Adaptive Grammar Engine (AGE) uses SmartListener technology to convert SRGS grammars into adaptive grammars that improve out-of-grammar recognition capabilities.  Template grammar offers several other advantages over the n-gram approach, including fewer inference steps (which results in fewer database queries).  So-called template grammars are used for “defined grammar recognition”.

Statistical Language Modeling (dependency relation?)

A statistical language model assigns probabilities to sequences of words, for instance sequential bi-grams or tri-grams, called n-grams.  N-grams are probably the common statistical language models.  There are also concgrams, which represent the relation of any non-sequential n-words in a given corpus.  Concgrams allow us to find all the words associated grammatically and semantically with the keyword by association.  Construction grammar models string construction, which is simpler than conventional syntax.  Semantic grammar is contrasted with conventional grammars as it relies predominantly on semantic rather than syntactic categories.  Dependency grammars are distinct from phrase structure grammars, in the lack of phrasal nodes.  Link grammar is a theory of syntax which builds relations between pairs of words.

ConcGramsN-gram Dialog SystemsN-gram GrammarsN-gram Transducers (NGT)N-grams & Tag Clouds ]

Combined Rule-based & Statistical Language Models

Grammar induction is the machine learning process of learning a formal grammar from a set of observations.  A stochastic context-free grammar (SCFG), also called a probabilistic context-free grammar (PCFG), is a context-free grammar in which each production is augmented with a probability.

Parsing

Parsing, otherwise known as syntactic analysis, comes before annotation.  Syntactic parsers are based on grammatical rules, or grammars.  Parsing expression grammar is closer to how string recognition tends to be done in practice.   Grammar splitting makes partial parsing possible because each constituent can be parsed independently.  GLR, or generalized left to right, parsers are an extension of the LR, or left to right, parser algorithm for handling non-deterministic and more ambiguous grammars.   Serialization is the process of converting a data structure or object state into a format that can be stored.

APP (Apple Pie Parser)CCG Parsers 2011Grammar Parsers & Dialog SystemsHPSG ParsersLALR ParserMaltParserOntology ParsersSentence Parsers & Dialog Systems ]

Annotation

Annotation may consist of part-of-speech (POS) tagging or tokenization.  The first step in generating the domain-specific part of a grammar is to enrich the underlying ontology with linguistic information.  Grammatical, syntactic or semantic tags are a form of semantic annotation, or metadata.  Named-entity recognition (NER) begins with entity identification and annotation (Named Entity Annotation) and culminates with entity extraction.   Part-of-speech tagging is often seen as the first stage of a more comprehensive syntactic annotation. The resulting parsed corpora are known as treebanks.  There are two main types of treebanks: treebanks that annotate phrase structure, and those that annotate dependency structure, called dependency treebanks.

Grammars in Dialog Systems

Finite-state transducers (FST) use grammars (for instance FSTGrammar) to control applications in computational linguistics.   A dialog system is most basically a transducer or interpreter.   The grammars in a directed dialog system will normally be small with little complexity.  The differences between spoken and written media make most available written corpora inappropriate for speech recognition in spoken dialogue systems.  Grammar-based linguistic realizers are systems like SRI ‘s Gemini Parser and Generator, Exemplars and Real Pro from CogenTex or the openCCG Realizer.  The NASA Clarissa voice-operated procedure browser was built using the Regulus Grammar Compiler.  The Clarissa system uses a rule-based language modeling approach for speech recognition.  Experience with rule-based spoken dialogue systems shows that there is usually a substantial overlap between the structures of grammars for different domains.  Multilingual grammars formalize the idea that the same grammatical categories (such as noun phrases and verb phrases) and syntax rules (such as predication) can appear in different languages.  The multimodal dialog definition system is based on a hybrid multi-modal grammar that puts formal grammars and logical calculus together.  In 2000, Hagen and Popowich described a dialogue system based on a grammar of dialogue acts.  A dialog tree is one form of rule-based language modeling used to derive utterances in dialog systems.  Tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.

Monrai Cypher is a dialog system which translates natural language input into RDF and SeRQL (Sesame RDF Query Language) according to grammars.  Matching grammatical structures with OWL individuals is done by translating the natural language utterance into a SPARQL query.  The real meaning of Web 3.0 is: Words in XML .. Grammar in RDF (scheme) and OWL .. Rules in SWRL (Semantic Web Rule Language) .. Search with SPARQL.

Conversation Trees | Dialog Trees 2011 | Grammar Trees & Dialog Systems ]

Conclusion

Firstly, I’m looking for a simple filter, a yes/no grammar filter for feeds, stop for incorrect and go for correct.  Secondly, I’d like a sentence extractor for feeds, simply to extract correct English sentences.  The problem is that most simple sentence extractors work off of punctuation; what is needed here is a robust grammar checking sentence extractor (which may not exist yet).

Sentence Boundary Disambiguation & Dialog SystemsSentence ExtractorSentence GrammaticalitySentence Parsers & Dialog SystemsSentence RecognizerSentence Splitter 2011 ]


Resources:

  • ABL (Alignment-Based Learning) .. a grammatical inference system that learns structure from plain sequences by comparing them
  • AGG (Automated grammar generator) .. Patent US20050154580 .. a phrase chunking module operable to generate automatically at least one phrase
  • ALE (Attribute-Logic Engine) .. freeware logic programming and grammar parsing and generation system, written in Prolog
  • AMALGAM .. Automatic Mapping Among Lexico-Grammatical Annotation Models, part-of-speech (POS) tagger
  • Apoema LangBot .. defunct grammar checking API .. by @brnsantanna
  • ssw.jku.at/coco .. a compiler generator, which takes an attributed grammar of a source language and generates a scanner and a parser
  • eeggi.com .. (engineered, encyclopedic, global and grammatical identities) is “the world’s first information engine” .. (private Beta)
  • gingersoftware.com .. grammar and spell checker .. by @gingersoftware
  • ghotit.com .. grammar & spell checker .. by @ghotit
  • Grammar analyzer based on a part-of-speech tagged (POST) parser .. Patent US20030154066 .. the final grammar tree is then used to select the grammar tree with largest probability
  • wintertree-software.com/app/gramxp .. Grammar Expert Plus grammar checker
  • GRM Library .. software library for creating and modifying statistical language models
  • KPML .. system offers a robust, mature platform for large-scale grammar engineering for natural language generation
  • Link Grammar Parser .. assigns a set of labelled links connecting pairs of words, now maintained by AbiWord (formerly Carnergie Mellon link grammar parser) .. (see RelEx extension)
  • Machinese Syntax .. provides a full analysis of texts by showing how words and concepts relate to each other in sentences (Functional Dependency Grammar parser)
  • nugram.nuecho.com .. @nugram platform is “the first – and only – complete speech grammar solution”
  • openccg.sourceforge.net .. open source natural language processing library written in Java, based on Combinatory Categorial Grammar
  • paperrater.com .. (wrote asking for grammar checker API)
  • Pâté .. a visual and interactive tool for parsing and transforming grammars
  • proofreadbot.com/api .. free online grammar and style checker with rest server that returns reports in several formats
  • QTag .. a freely available, language independent POS-Tagger, implemented in Java
  • Regulus Grammar Compiler .. an open source platform for compiling unification grammars into Nuance compatible context free GSL grammars
  • whitesmoke.com .. English grammar checker software .. by @whitesmokeapp
  • ucrel.lancs.ac.uk/wmatrix .. a software tool for corpus analysis and comparison, providing a web interface to the UCRELUSAS and CLAWS corpus annotation tools
  • XLE-Web .. interface for parsing and generating Lexical Functional Grammars, a rich GUI for writing and debugging
  • XTAG .. on-going project to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism

Wikipedia:

  • Adaptive grammar .. a formal grammar that explicitly provides mechanisms within the formalism to allow its own production rules
  • Attribute grammar .. a formal way to define attributes for the production of a formal grammar, by associating attributes to values
  • Bottom-up parsing .. attempts to identify the most fundamental units first, and then to infer higher-order structures
  • Case grammar .. a system of linguistic analysis, focusing on the link between the valence, or number of subjects, objects, etc., of a verb
  • Category:Grammar .. listing of all topics under grammar
  • Category:Grammar frameworks .. theories of grammar used to describe natural languages
  • Category:Parsing algorithms .. list of known parser types
  • Chomsky hierarchy .. a containment hierarchy of classes of formal grammars
  • Cognitive linguistics .. interprets language in terms of the concepts, sometimes universal, sometimes specific to a particular tongue
  • Combinatory categorial grammar .. generates constituency-based structures (as opposed to dependency-based ones) and is therefore a type of phrase structure grammar
  • Computational linguistics .. an interdisciplinary field dealing with the statistical or rule-based modeling of natural language
  • Constraint Grammar .. context dependent rules are compiled into a grammar that assigns grammatical tags
  • Construction grammar .. based on the idea that the primary unit of grammar is the grammatical construction rather than the atomic syntactic unit
  • Context-free grammar .. important in linguistics for describing the structure of sentences and words in natural language
  • Context-free grammars and parsers .. addresses difference between LR parsing and LL parsing within the history of compiler construction
  • Context-sensitive grammar .. a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context
  • Controlled natural language .. subsets of natural languages, obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity
  • Definite clause grammar .. way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog
  • Dependency grammar .. distinct from phrase structure grammars, since dependency grammars lack phrasal nodes
  • Formal grammar .. a set of formation rules for strings in a formal language
  • Functional discourse grammar .. top-level unit of analysis in functional discourse grammar is the discourse move, not the sentence or the clause
  • Generalized phrase structure grammar .. augments syntactic descriptions with semantic annotations that can be used to compute the compositional meaning
  • GLR parser ..an extension of an LR parser algorithm to handle nondeterministic and ambiguous grammars
  • Grammar checker .. a program, or part of a program, that attempts to verify written text for grammatical correctness
  • Grammar induction .. also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of learning a formal grammar
  • Grammatical Framework .. a programming language for writing grammars of natural languages
  • Head-driven phrase structure grammar .. the basic type HPSG deals with is the “sign” .. words and phrases are two different subtypes of “sign”
  • Higher order grammar .. HOG maintains Haskell Curry’s distinction between tectogrammatical structure (deep or abstract syntax) and phenogrammatical structure (concrete syntax)
  • JSGF .. a textual representation of grammars for use in speech recognition
  • Language model .. a statistical language model assigns a probability to a sequence of m words
  • Lexical functional grammar .. mainly focuses on syntax, including its relation with morphology and semantics
  • Lexicon .. the vocabulary of a language, including its words and expressions
  • Link grammar .. builds relations between pairs of words, rather than constructing constituents in a tree-like hierarchy
  • Mental space .. building mental spaces and mappings between those mental spaces are two main processes involved in construction of meaning
  • Operator-precedence grammar .. allows precedence relations to be defined between the terminals of the grammar
  • Optimality theory .. models grammars as systems that provide mappings from inputs to outputs
  • Parsing .. the process of analyzing a text to determine its grammatical structure with respect to a given formal grammar
  • Parsing expression grammar .. a set of rules for recognizing strings in the language
  • Phrase structure grammar .. adherence to the constituency relation, as opposed to the dependency relation
  • Production rule .. a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences
  • Psycholinguistics .. makes use of biology, neuroscience, cognitive science, linguistics, and information theory to study how the brain processes language
  • Serialization .. the process of converting a data structure or object state into a format that can be stored
  • Simple precedence grammar .. a context-free formal grammar that can be parsed with a simple precedence parser
  • Speech Recognition Grammar Specification .. standard for how speech recognition grammars are specified
  • Stochastic context-free grammar .. a context-free grammar in which each production is augmented with a probability
  • Syntax .. the study of the principles and rules for constructing phrases and sentences
  • Systemic functional grammar .. refers to language as a network of systems for making meaning, and functional refers the view that language is as it is because of what it evolved to do
  • Tokenization .. process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements
  • Tree-adjoining grammar .. rules for rewriting the nodes of trees as other trees

References:

2012

2011

2010

2009

2008

  • Reusable grammatical resources for spatial language processing (2008) .. by Robert J. Ross

2007

2006

2005

2004

2003

2002

  • Embedded grammar tags: advancing natural language interaction on the Web (2002) .. by Yaser Yacoob etc

2000

1990s

See also:

AIML & Grammars | CFG (Context-free Grammar) ParsersConstruction Grammar & Dialog Systems | FSG (Finite State Grammar) & Dialog SystemsGrammar Checkers & Dialog SystemsGrammar Compilers & Dialog SystemsGrammar Parsers & Dialog Systems | Grammar Parsing & Natural Language GenerationGrammar Trees & Dialog SystemsHPSG (Head-Driven Phrase Structure Grammar) & Dialog SystemsLink Grammar & Dialog SystemsModular Grammar & Dialog SystemsPCFG (Probabilistic Context Free Grammar) & Dialog SystemsSemantic Grammars & Dialog Systems | Speech Grammars 2011 | SpeechBuilder GrammarsSyntactic Grammars & Dialog SystemsTemplate Grammars & Dialog SystemsXML Grammar & Dialog Systems