Notes:
According to Microsoft, a grammar is a structured list of words and phrases that are governed by rules, and a vocabulary is the set of words used in the grammar. A speech grammar is a grammar that recognizes speech inputs.
Grammar as we know it is a vague abstraction of the form that governs language. However, in computational linguistics, grammars are things, encoded templates attempting, in myriad ways, to dissect natural language. In effect, grammars represent different techniques, or algorithms, for the dissection of natural language. These grammars as things are know as formal grammars. Computational linguistics is considered an interdisciplinary field dealing with the computational modeling of natural language. Psycholinguistics is the psychological study of natural language, in terms of cognitive processes and grammatical structures. Cognitive linguistics attempts to interpret natural language in terms of the concepts, sometimes universal, sometimes specific to a particular tongue. Probably there is no perfect natural language grammar, because natural language is dynamic and relative; so, the study of grammars appears chaotic, akin to the mythical Tower of Babel, perhaps explaining why *natural* is always emphasized in the study of “natural language”. By contrast, controlled language is a subset of natural language, which restricts grammar and vocabulary in order to reduce or eliminate ambiguity and complexity.
So, there is an imperfect knowledge of natural language. There is an imperfect knowledge of psychology. And, there is an incomplete knowledge of cognition, in particular how cognition may arise from language and the psyche. Somehow, somewhere, grammar fits into this picture. And so, the encoding of formal grammar templates is to date a very imprecise science. Traditionally there have been two main ways of attacking natural language, manually constructed rule-based models and automated statistical language models. Today, there are also combinations of the two, such as in IBM Watson. Watson actually uses many different models, or algorithms, and then scores them on probability, maybe more like how the human mind works. IBM Watson uses logical form alignment to score on grammatical relationships, deep semantic relationships or both. The LF alignment algorithm first establishes tentative lexical correspondences between nodes in the source and target LFs using translation pairs from a bilingual lexicon. This is in contrast to the 1954 precursor of IBM Watson, the “Georgetown-IBM experiment” used six grammar rules and had 250 items in its vocabulary.
In fact, the very meaning of words lies in their relationship. This is called semantics, the study of meaning, or significance, in the relationship of words. Computational linguistics is largely concerned with the analysis and modeling of natural language. In natural language programming, modeling is generally comprised of rule-based and/or statistical language models. Generally, but not always, rule-based language models are manually constructed, or coded by hand, and statistical language models are automated. Rule-based language modeling of grammar is one of the major components of natural language processing. However, rule-based language models learned from a collection of transcriptions have been shown to outperform traditional n-gram language models. Statistical language modeling provides a much easier way of dealing with the complexities of natural language in computer. The N-gram language model currently dominates statistical language modeling. As the complexity of a rule-based language model, and thus the effort to manually create such a model, increases significantly with the vocabulary size, statistical language models are used in large-vocabulary systems.
Units of Language
In linguistics, the lexicon (or wordstock) of a language is its vocabulary, including its words and expressions. A lexicon is also a synonym of the word thesaurus. More formally, it is a language’s inventory of lexemes. Grammar is the set of structural rules that govern the composition of clauses, phrases, and words. Morphology is the identification, analysis and description of the structure of a given language’s morphemes and other linguistic units. A morpheme is the smallest semantically meaningful unit in a language. In 1961, the logician Haskell Curry made a distinction between two levels of grammar, which he called “tectogrammatics” and “phenogrammatics”, where the first concerns “the study of grammatical structure in itself” and the latter – how grammatical structure is represented in terms of expressions.
Rule-based Language Modeling (constituency relation?)
In natural language programming, an unordered collection of words, disregarding grammar, is referred to as the “bag of words” model. Grammatical rules need to perform non-trivial semantic operations. For instance, a “precision grammar” is a formal grammar designed to distinguish ungrammatical from grammatical sentences. AIML is an example of a grammar in XML form. Limitations of current human-computer interactive applications are mostly due to the use of constrained grammars. Grammatical Framework is a programming language for writing grammars of natural languages. Multimodality Grammatical Framework can be used to write multi-modal grammars. Grammatical Framework is a type-theoretical grammar. Constructive type theory is derived from constructive mathematics, based on concepts of proof object and context.
Context-free grammars provide a mechanism for describing methods by which phrases are built. Context-free phrase-structure grammars (PSGs) have long been popular. Mixed-initiative grammars allow the caller to fill in multiple voice recognition fields with a single utterance. The Speech Recognition Grammar Specification (SRGS) is a standard for how speech recognition grammars are specified. Speech recognition grammars include JSGF, the Java Speech Grammar Format (aka JSpeech Grammar Format), and GSL, Nuance’s Grammar Specification Language. Grammars in Regulus are specified with features requiring unification, but are compiled into context-free grammars for use with the Nuance speech recognition. The Nuance Adaptive Grammar Engine (AGE) uses SmartListener technology to convert SRGS grammars into adaptive grammars that improve out-of-grammar recognition capabilities. Template grammar offers several other advantages over the n-gram approach, including fewer inference steps (which results in fewer database queries). So-called template grammars are used for “defined grammar recognition”.
Statistical Language Modeling (dependency relation?)
A statistical language model assigns probabilities to sequences of words, for instance sequential bi-grams or tri-grams, called n-grams. N-grams are probably the common statistical language models. There are also concgrams, which represent the relation of any non-sequential n-words in a given corpus. Concgrams allow us to find all the words associated grammatically and semantically with the keyword by association. Construction grammar models string construction, which is simpler than conventional syntax. Semantic grammar is contrasted with conventional grammars as it relies predominantly on semantic rather than syntactic categories. Dependency grammars are distinct from phrase structure grammars, in the lack of phrasal nodes. Link grammar is a theory of syntax which builds relations between pairs of words.
[ ConcGrams | N-gram Dialog Systems | N-gram Grammars | N-gram Transducers (NGT) | N-grams & Tag Clouds ]
Combined Rule-based & Statistical Language Models
Grammar induction is the machine learning process of learning a formal grammar from a set of observations. A stochastic context-free grammar (SCFG), also called a probabilistic context-free grammar (PCFG), is a context-free grammar in which each production is augmented with a probability.
Parsing
Parsing, otherwise known as syntactic analysis, comes before annotation. Syntactic parsers are based on grammatical rules, or grammars. Parsing expression grammar is closer to how string recognition tends to be done in practice. Grammar splitting makes partial parsing possible because each constituent can be parsed independently. GLR, or generalized left to right, parsers are an extension of the LR, or left to right, parser algorithm for handling non-deterministic and more ambiguous grammars. Serialization is the process of converting a data structure or object state into a format that can be stored.
[ APP (Apple Pie Parser) | CCG Parsers 2011 | Grammar Parsers & Dialog Systems | HPSG Parsers | LALR Parser | MaltParser | Ontology Parsers | Sentence Parsers & Dialog Systems ]
Annotation
Annotation may consist of part-of-speech (POS) tagging or tokenization. The first step in generating the domain-specific part of a grammar is to enrich the underlying ontology with linguistic information. Grammatical, syntactic or semantic tags are a form of semantic annotation, or metadata. Named-entity recognition (NER) begins with entity identification and annotation (Named Entity Annotation) and culminates with entity extraction. Part-of-speech tagging is often seen as the first stage of a more comprehensive syntactic annotation. The resulting parsed corpora are known as treebanks. There are two main types of treebanks: treebanks that annotate phrase structure, and those that annotate dependency structure, called dependency treebanks.
Grammars in Dialog Systems
Finite-state transducers (FST) use grammars (for instance FSTGrammar) to control applications in computational linguistics. A dialog system is most basically a transducer or interpreter. The grammars in a directed dialog system will normally be small with little complexity. The differences between spoken and written media make most available written corpora inappropriate for speech recognition in spoken dialogue systems. Grammar-based linguistic realizers are systems like SRI ‘s Gemini Parser and Generator, Exemplars and Real Pro from CogenTex or the openCCG Realizer. The NASA Clarissa voice-operated procedure browser was built using the Regulus Grammar Compiler. The Clarissa system uses a rule-based language modeling approach for speech recognition. Experience with rule-based spoken dialogue systems shows that there is usually a substantial overlap between the structures of grammars for different domains. Multilingual grammars formalize the idea that the same grammatical categories (such as noun phrases and verb phrases) and syntax rules (such as predication) can appear in different languages. The multimodal dialog definition system is based on a hybrid multi-modal grammar that puts formal grammars and logical calculus together. In 2000, Hagen and Popowich described a dialogue system based on a grammar of dialogue acts. A dialog tree is one form of rule-based language modeling used to derive utterances in dialog systems. Tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.
Monrai Cypher is a dialog system which translates natural language input into RDF and SeRQL (Sesame RDF Query Language) according to grammars. Matching grammatical structures with OWL individuals is done by translating the natural language utterance into a SPARQL query. The real meaning of Web 3.0 is: Words in XML .. Grammar in RDF (scheme) and OWL .. Rules in SWRL (Semantic Web Rule Language) .. Search with SPARQL.
[ Conversation Trees | Dialog Trees 2011 | Grammar Trees & Dialog Systems ]
Conclusion
Firstly, I’m looking for a simple filter, a yes/no grammar filter for feeds, stop for incorrect and go for correct. Secondly, I’d like a sentence extractor for feeds, simply to extract correct English sentences. The problem is that most simple sentence extractors work off of punctuation; what is needed here is a robust grammar checking sentence extractor (which may not exist yet).
[ Sentence Boundary Disambiguation & Dialog Systems | Sentence Extractor | Sentence Grammaticality | Sentence Parsers & Dialog Systems | Sentence Recognizer | Sentence Splitter 2011 ]
Resources:
- ABL (Alignment-Based Learning) .. a grammatical inference system that learns structure from plain sequences by comparing them
- AGG (Automated grammar generator) .. Patent US20050154580 .. a phrase chunking module operable to generate automatically at least one phrase
- ALE (Attribute-Logic Engine) .. freeware logic programming and grammar parsing and generation system, written in Prolog
- AMALGAM .. Automatic Mapping Among Lexico-Grammatical Annotation Models, part-of-speech (POS) tagger
- Apoema LangBot .. defunct grammar checking API .. by @brnsantanna
- ssw.jku.at/coco .. a compiler generator, which takes an attributed grammar of a source language and generates a scanner and a parser
- eeggi.com .. (engineered, encyclopedic, global and grammatical identities) is “the world’s first information engine” .. (private Beta)
- gingersoftware.com .. grammar and spell checker .. by @gingersoftware
- ghotit.com .. grammar & spell checker .. by @ghotit
- Grammar analyzer based on a part-of-speech tagged (POST) parser .. Patent US20030154066 .. the final grammar tree is then used to select the grammar tree with largest probability
- wintertree-software.com/app/gramxp .. Grammar Expert Plus grammar checker
- GRM Library .. software library for creating and modifying statistical language models
- KPML .. system offers a robust, mature platform for large-scale grammar engineering for natural language generation
- Link Grammar Parser .. assigns a set of labelled links connecting pairs of words, now maintained by AbiWord (formerly Carnergie Mellon link grammar parser) .. (see RelEx extension)
- Machinese Syntax .. provides a full analysis of texts by showing how words and concepts relate to each other in sentences (Functional Dependency Grammar parser)
- nugram.nuecho.com .. @nugram platform is “the first – and only – complete speech grammar solution”
- openccg.sourceforge.net .. open source natural language processing library written in Java, based on Combinatory Categorial Grammar
- paperrater.com .. (wrote asking for grammar checker API)
- Pâté .. a visual and interactive tool for parsing and transforming grammars
- proofreadbot.com/api .. free online grammar and style checker with rest server that returns reports in several formats
- QTag .. a freely available, language independent POS-Tagger, implemented in Java
- Regulus Grammar Compiler .. an open source platform for compiling unification grammars into Nuance compatible context free GSL grammars
- whitesmoke.com .. English grammar checker software .. by @whitesmokeapp
- ucrel.lancs.ac.uk/wmatrix .. a software tool for corpus analysis and comparison, providing a web interface to the UCRELUSAS and CLAWS corpus annotation tools
- XLE-Web .. interface for parsing and generating Lexical Functional Grammars, a rich GUI for writing and debugging
- XTAG .. on-going project to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism
Wikipedia:
- Adaptive grammar .. a formal grammar that explicitly provides mechanisms within the formalism to allow its own production rules
- Attribute grammar .. a formal way to define attributes for the production of a formal grammar, by associating attributes to values
- Bottom-up parsing .. attempts to identify the most fundamental units first, and then to infer higher-order structures
- Case grammar .. a system of linguistic analysis, focusing on the link between the valence, or number of subjects, objects, etc., of a verb
- Category:Grammar .. listing of all topics under grammar
- Category:Grammar frameworks .. theories of grammar used to describe natural languages
- Category:Parsing algorithms .. list of known parser types
- Chomsky hierarchy .. a containment hierarchy of classes of formal grammars
- Cognitive linguistics .. interprets language in terms of the concepts, sometimes universal, sometimes specific to a particular tongue
- Combinatory categorial grammar .. generates constituency-based structures (as opposed to dependency-based ones) and is therefore a type of phrase structure grammar
- Computational linguistics .. an interdisciplinary field dealing with the statistical or rule-based modeling of natural language
- Constraint Grammar .. context dependent rules are compiled into a grammar that assigns grammatical tags
- Construction grammar .. based on the idea that the primary unit of grammar is the grammatical construction rather than the atomic syntactic unit
- Context-free grammar .. important in linguistics for describing the structure of sentences and words in natural language
- Context-free grammars and parsers .. addresses difference between LR parsing and LL parsing within the history of compiler construction
- Context-sensitive grammar .. a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context
- Controlled natural language .. subsets of natural languages, obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity
- Definite clause grammar .. way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog
- Dependency grammar .. distinct from phrase structure grammars, since dependency grammars lack phrasal nodes
- Formal grammar .. a set of formation rules for strings in a formal language
- Functional discourse grammar .. top-level unit of analysis in functional discourse grammar is the discourse move, not the sentence or the clause
- Generalized phrase structure grammar .. augments syntactic descriptions with semantic annotations that can be used to compute the compositional meaning
- GLR parser ..an extension of an LR parser algorithm to handle nondeterministic and ambiguous grammars
- Grammar checker .. a program, or part of a program, that attempts to verify written text for grammatical correctness
- Grammar induction .. also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of learning a formal grammar
- Grammatical Framework .. a programming language for writing grammars of natural languages
- Head-driven phrase structure grammar .. the basic type HPSG deals with is the “sign” .. words and phrases are two different subtypes of “sign”
- Higher order grammar .. HOG maintains Haskell Curry’s distinction between tectogrammatical structure (deep or abstract syntax) and phenogrammatical structure (concrete syntax)
- JSGF .. a textual representation of grammars for use in speech recognition
- Language model .. a statistical language model assigns a probability to a sequence of m words
- Lexical functional grammar .. mainly focuses on syntax, including its relation with morphology and semantics
- Lexicon .. the vocabulary of a language, including its words and expressions
- Link grammar .. builds relations between pairs of words, rather than constructing constituents in a tree-like hierarchy
- Mental space .. building mental spaces and mappings between those mental spaces are two main processes involved in construction of meaning
- Operator-precedence grammar .. allows precedence relations to be defined between the terminals of the grammar
- Optimality theory .. models grammars as systems that provide mappings from inputs to outputs
- Parsing .. the process of analyzing a text to determine its grammatical structure with respect to a given formal grammar
- Parsing expression grammar .. a set of rules for recognizing strings in the language
- Phrase structure grammar .. adherence to the constituency relation, as opposed to the dependency relation
- Production rule .. a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences
- Psycholinguistics .. makes use of biology, neuroscience, cognitive science, linguistics, and information theory to study how the brain processes language
- Serialization .. the process of converting a data structure or object state into a format that can be stored
- Simple precedence grammar .. a context-free formal grammar that can be parsed with a simple precedence parser
- Speech Recognition Grammar Specification .. standard for how speech recognition grammars are specified
- Stochastic context-free grammar .. a context-free grammar in which each production is augmented with a probability
- Syntax .. the study of the principles and rules for constructing phrases and sentences
- Systemic functional grammar .. refers to language as a network of systems for making meaning, and functional refers the view that language is as it is because of what it evolved to do
- Tokenization .. process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements
- Tree-adjoining grammar .. rules for rewriting the nodes of trees as other trees
References:
2012
- Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers (2012) [PDF] .. thesis by Joachim Wagner
2011
- Grammatical error simulation for computer-assisted language learning (2011) .. by Sungjin Lee etc [Google Scholar]
- Grammatical framework: programming with multilingual grammars
- (2011) .. book by Aarne Ranta
- GRASP: grammar-and syntax-based pattern-finder in CALL (2011) [PDF] .. by Chung-Chi Huang etc
- Judging grammaticality with tree substitution grammar derivations (2011) [PDF] .. by Matt Post
- Quasi-Synchronous Phrase Dependency Grammars for Machine Translation (2011) [PDF] .. by Kevin Gimpel etc
- Speech Translation with Grammar Driven Probabilistic Phrasal Bilexica Extraction (2011) [PDF] .. by Markus Saers etc
2010
- A Knowledge-based Method for Grammatical Knowledge Extraction Process (2010) [PDF] .. by Jesús Cardeñosa etc
- Automated Grammatical Error Detection for Language Learners (2010) .. book by Martin Chodorow etc
- Generating LTAG grammars from a lexicon-ontology interface (2010) [PDF] .. by Christina Unger etc
- Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars (2010) [PDF] .. by Adrià de Gispert etc
- Modeling perspective using adaptor grammars (2010) [PDF] .. by Eric Hardisty etc
- Modular Grammars for Speech Recognition in Ontology-Based Dialogue Systems (2010) [PDF] .. by Hanne Marie Kosinowski
- The NuGram approach to dynamic grammars (2010) .. {slideshare} .. by @NuEcho .. (for both IVR and text parsing)
2009
- Are ontologies involved in natural language processing? (2009) [PDF] .. words and the grammar are given as pictograms .. by Maryvonne Abraham
- Automatic correction of grammatical errors in non-native English text (2009) [PDF] .. thesis by John Sie Yuen Lee
- Creating a Semantic Wiki using a Link Grammar-based Algorithm for Relation Extraction (2009) [PDF] .. thesis by Alan Akbik
- Grammar Engineering for CCG using Ant and XSLT (2009) [PDF] .. by Rajakrishnan Rajkumar etc
- GREAT: a finite-state machine translation toolkit implementing a Grammatical Inference Approach for Transducer Inference (GIATI) (2009) [PDF] .. by Jorge Gonzalez etc
- Improving data-driven dependency parsing using large-scale LFG grammars (2009) [PDF] .. by Lilja Øvrelid etc
- Interactive pedagogical programs based on constraint grammar (2009) [PDF] .. by Lene Antonsen etc
- Realistic Grammar Error Simulation using Markov Logic (2009) [PDF] .. by Sungjin Lee etc
- Using construction grammar in conversational systems (2009) .. {slideshare} .. PhD thesis high level overview by Marie-Claire Jenkins @missmcj
- Wanderlust: Extracting Semantic Relations from Natural Language Text Using Dependency Grammar Patterns (2009) .. {videolectures} .. by Alan Akbik
- Words, Grammar, Text: Revisiting the Work of John Sinclair (2009) .. book edited by Rosamund Moon
2008
- Reusable grammatical resources for spatial language processing (2008) .. by Robert J. Ross
2007
- Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models (2007) [PDF] .. by Mark Johnson etc
- Combining Supertagging and Lexicalized Tree-Adjoining Grammar Parsing (2007) [PDF] .. by Anoop Sarkar @anpsrkr
- Finding Metaphor in Grammar and Usage: A Methodological Analysis of Theory and Research (2007) .. book by Gerard Steen
- Rapid development of dialogue systems by grammar compilation (2007) [PDF] .. by Björn Bringert
2006
- A graph grammar based framework for automated concept generation (2006) .. by Matthew I. Campbell etc [Google Scholar]
- Grammatical metaphor and lexical metaphor: Different perspectives on semantic variation (2006) [PDF] .. by Miriam Taverniers
- Putting linguistics into speech recognition: the Regulus grammar compiler (2006) .. book by Manny Rayner etc
2005
- Comparison of Grammar-Based and Statistical Language Models Trained on the Same Data (2005) [PDF] .. by Beth Ann Hockey etc
- Mental spaces in grammar: conditional constructions (2005) .. book by Barbara Dancygier & Eve Sweetser
2004
- Arboretum: Using a precision grammar for grammar checking in CALL (2004) [PDF] .. by Emily M. Bender etc
- Translingual Grammar Induction for Conversational Systems (2004) [PDF] .. an induction algorithm to semi-automate grammar authoring .. thesis by John Sie Yuen Lee
2003
- Treebanks: Building and Using Parsed Corpora (2003) .. book edited by Anne Abeille
- Intrinsic versus Extrinsic Evaluations of Parsing Systems (2003) [PDF] .. by Diego Mollá Aliod etc
2002
- Embedded grammar tags: advancing natural language interaction on the Web (2002) .. by Yaser Yacoob etc
2000
- A Unified Context-Free Grammar And N-Gram Model For Spoken Language (2000) [PDF] .. by Ye-Yi Wang etc
- Flexible speech act based dialogue management (2000) [PDF] .. by Eli Hagen etc
- Object-oriented techniques in grammar and ontology specification (2000) [PDF] .. by Matthias Denecke
1990s
- A language model combining trigrams and stochastic context-free grammars (1998) [PDF] .. by Wayne Ward etc
- Robust grammatical analysis for spoken dialogue systems (1995) .. {attached below} .. by Gertjan van Noord etc [Google Scholar]
- A Cognitive Model of Sentence Interpretation: the Construction Grammar approach (1993) [PDF] .. by Daniel Jurafsky
- Hybrid grammar-bigram speech recognition system with first-order dependence model (1992) .. by Jeremy H. Wright etc [Google Scholar]
- XTAG – A Graphical Workbench for Developing Tree-Adjoining Grammars (1992) [PDF] .. by Patrick Paroubek etc
See also:
AIML & Grammars | CFG (Context-free Grammar) Parsers | Construction Grammar & Dialog Systems | FSG (Finite State Grammar) & Dialog Systems | Grammar Checkers & Dialog Systems | Grammar Compilers & Dialog Systems | Grammar Parsers & Dialog Systems | Grammar Parsing & Natural Language Generation | Grammar Trees & Dialog Systems | HPSG (Head-Driven Phrase Structure Grammar) & Dialog Systems | Link Grammar & Dialog Systems | Modular Grammar & Dialog Systems | PCFG (Probabilistic Context Free Grammar) & Dialog Systems | Semantic Grammars & Dialog Systems | Speech Grammars 2011 | SpeechBuilder Grammars | Syntactic Grammars & Dialog Systems | Template Grammars & Dialog Systems | XML Grammar & Dialog Systems
- CSG (Constraint-Based Synchronous Grammar)
- Dialog Grammars
- Grammar Engines
- Grammar Recognition
- Grammar Recognizers
- Grammar Transformers
- GRM Library (Grammar Library)
- GSL (Grammar Specification Language)
- Metagrammar
- N-gram Grammars
- Precision Grammars