SRILM Toolkit & Dialog Systems

Notes:

SRILM (SRI Language Modeling Toolkit) is an open source, extensible language modeling toolkit. SRILM is a C++-based toolkit for language modeling. Language models are built and interpolated using the SRILM. SRILM can be used for building local language models. SRILM is used to estimate n-grams Language Models (LM). SRILM has an API for computing word language model probabilities. Disambig is one module in SRILM. Perplexity values can be computed with SRILM. There is a standard script for “compute – best – mix” in the SRILM package. The LM weighted using SRILM has been used to train language models. With an SRILM extension, efficient estimation of maximum entropy language models with n-gram features can be achieved. Even with relatively small language models, SRILM can be used to prune the language models using an entropy criterion. N-gram models may be estimated for all of the possible combinations using SRILM. SRILM can be used to build n-gram ARPA format language models. SRILM reads and writes to a standard ARPA (Advanced Research Projects Agency) file format for n-gram models. Standard n-gram language models may be trained with the SRILM using interpolated modified Kneser-Ney smoothing. SRILM can be used to build bigram language models from various corpora, such as the English Gigaword corpus. SRILM can be used on a monolingual training corpus of 48,000,000 sentences, for example.

A bigram language model used in recognition systems was generated using the SRILM with the modified Kneser-Ney back-off discounting. Trigram LMs may be estimated using the SRILM employing the default Good-Turing discounting method. The language model is a capitalization-invariant tri-gram language model with Good-Turing discounting acquired from the training corpus using the SRI language modeling toolkit. Modified KN models may be estimated on training set count files and applied to the test set using SRILM. A 4-gram target LM with unmodified Kneser-Ney backoff discounting was generated using the SRILM. SRILM was used to train a 5-gram language model on the English sentences of FBIS (Foreign Broadcast Information Service) corpus. A 5-gram language model generated by the SRILM can be used in the cube-pruning process. VMM (variable memory modeling) may be implemented within SRILM and compared to default N-Gram models. SRILM can also be used to train a 7-gram model on training set. For instance, SRILM may be used to estimate individual language models for truthful and deceptive opinions. Translation models and generation models may be trained by the Moses toolkit. IRSTLM is another, similar language modeling toolkit. N-gram language models may be scored using z-scores. For example z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the “background” vector).

ZMERT

Resources:

SRILM .. SRI Language Modeling Toolkit {related:}

Wikipedia:

Code-switching
Grammar induction (aka grammatical inference)
Language model
Minimal recursion semantics (aka MRS)
Moses (machine translation)
Syntactic pattern recognition (aka structural pattern recognition)

References: