IRSTLM (IRST Language Modeling) Toolkit 2016


Notes:

IRSTLM is a tool for the estimation, representation, and computation of statistical language models.  A statistical language model is a probability distribution over sequences of words.

The former Istituto per la Ricerca Scientifica e Tecnologica (IRST) is now Bruno Kessler Foundation (FBK), sometimes referred to as FBK-IRST (FBK IRSTLM Toolkit).

Resources:

Wikipedia:

See also:

Language Modeling & Dialog Systems 2016 | Rule-based Language Modeling | SRILM (SRI Language Modeling Toolkit) 2016


Improving the performance of translation process in a statistical machine translator using sequence IRSTLM translation parameters and pruning
T Mantoro, J Asian, MA Ayu – Informatics and Computing (ICIC) …, 2016 – ieeexplore.ieee.org
Abstract: A translation process, as a critical part of a machine translation, can be put simply as a process of decoding the meaning of the source text and re-encoding the meaning into the target language. Unfortunately, most of the translation process requires complex

PROMT translation systems for WMT 2016 translation tasks
A Molchanov, F Bykov – Proceedings of the First Conference on Machine …, 2016 – aclweb.org
… The IRSTLM toolkit (Federico et al., 2008) is used to build language models, which are scored using KenLM (Heafield, 2011) in the decoding process. … 2008. Irstlm: an open source toolkit for han- dling large scale language models. …

Normalized Log-Linear Interpolation of Backoff Language Models is Efficient
K Heafield, C Geigle, S Massung, L Schwartz – Urbana, 2016 – research.ed.ac.uk
… IRSTLM (Federico et al., 2008) asks the user to specify a common large vocabulary size. … Nonetheless, in our experiments we ex- tend IRSTLM’s approach by training models with a common vocabulary size, rather than retrofitting it at query time. 3.2 Offline Linear Interpolation …

Persian-Spanish Low-resourced Statistical Machine Translation System Through English as Pivot Language
B AHMADNIA, J SERRANO, G HAFFARI – academia.edu
… Our language models for all systems are 4- grams and they are built using the IRSTLM toolkit (Federico et al., 2008). … Federico M., Bertoldi, N., Cettolo, M. (2008). Irstlm: An open source toolkit for handling large scale language models. In Proceedings of ISCA, pages 1618–1621. …

Domain adaptation for statistical machine translation
X Wang, C Zhu, S Li, T Zhao… – … , Fuzzy Systems and …, 2016 – ieeexplore.ieee.org
… We take BTEC (1.5M) and HIT (5M) as two different domains. Testing data is additional HIT (50K). IRSTLM toolkit1 is used to train a 3-gram model. … of Model_i; ?i is the weight of Model_i. 1http://sourceforge.net/projects/irstlm/ 2http://www.phontron.com/pialign/ Page 5. 1656 …

A Novel Approach by Injecting CCG Supertags into an Arabic–English Factored Translation Machine
HA Rajeh, Z Li, AM Ayedh – Arabian Journal for Science and Engineering, 2016 – Springer
… 123 Page 7. Arab J Sci Eng (2016) 41:3071–3080 3077 Phrase Extrac on and Scoring Phrase-model Word Alignment GIZA++ N-gram Train IRST-LM Transla on MOSES Transla on Model Target Lang. Model Parallel Corpora Target Lang. Corpora Source Lang. Sentence …

An Arabic-Hebrew parallel corpus of TED talks
M Cettolo – arXiv preprint arXiv:1610.00572, 2016 – arxiv.org
… SMT systems were developed with the MMT toolkit,10 which builds engines on the Moses de- coder (Koehn et al., 2007), IRSTLM (Federico et al., 2008) and fast align (Dyer et al., 2013). … 2008. IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models. …

Machine translation based data augmentation for Cantonese keyword spotting
G Huang, A Gorin, JL Gauvain… – Acoustics, Speech and …, 2016 – ieeexplore.ieee.org
… We use the Moses toolkit in connection with GIZA++ for word alignment and IRSTLM [17] for target language modelling. … 177–180. [17] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo, “IRSTLM: an open source toolkit for handling large scale language models,” 2008. …

Grammatical error correction using neural machine translation.
Z Yuan, T Briscoe – HLT-NAACL, 2016 – aclweb.org
… The IRSTLM Toolkit (Federico et al., 2008) is used to buid a 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995). 4.4 NMT training details … 2008. IRSTLM: an open source toolkit for handling large scale language models. …

A Continuous Space Rule Selection Model for Syntax-based Statistical Machine Translation.
J Zhang, M Utiyama, E Sumita, G Neubig… – ACL (1 …, 2016 – pdfs.semanticscholar.org
… GIZA++ (Och and Ney, 2003) was used for word alignment. A 5-gram language model was trained on the target side of the train- ing corpus using the IRST-LM Toolkit7 with mod- ified Kneser-Ney smoothing. … 6http://sourceforge.net/projects/mecab/files/ 7http://hlt.fbk.eu/en/irstlm …

Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author’s Manuscript)
G Huang, A Gorin, JL Gauvain, L Lamel – 2016 – dtic.mil
… We use the Moses toolkit in connection with GIZA++ for word alignment and IRSTLM [17] for target language modelling. … 177–180. [17] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo, “IRSTLM: an open source toolkit for handling large scale language models,” 2008. …

Online Automatic Post-Editing across Domains
R Chatterjee, G Gebremelak, M Negri, M Turchi – CLiC it, 2016 – researchgate.net
… 2 Only those train- ing instances that have similarity score above a certain threshold (decided over a held-out devel- opment set) are used to build: i) a tri-gram local language model over the target side of the training corpus with the IRSTLM toolkit (Federico et al., 2008); ii) the …

English-Dogri Translation System using MOSES
A Singh, A Kour, SS Jamwal – 2016 – pdfs.semanticscholar.org
… Nayan Jyoti Kalita and Baharul Islam [6] used Moses for Bengali to Assamese machine translation. Other translation tools like IRSTLM for language model and GIZA for translation model are utilized within this framework which is accessible in Linux situations. …

Deriving Phonetic Transcriptions and Discovering Word Segmentations for Speech-to-Speech Translation in Low-Resource Settings.
A Wilkinson, T Zhao, AW Black – INTERSPEECH, 2016 – pdfs.semanticscholar.org
… All sentences used were over 10 tokens in length. All MT models were trained using Moses, in a standard phrase-based approach incorporating IRSTLM and mgiza. Where conventional orthography was present, sentences were truecased and tokenized. 3.1. …

Statistical Machine Translation System for Indian Languages
BNVN Raju, MB Raju – Advanced Computing (IACC), 2016 …, 2016 – ieeexplore.ieee.org
… In this, Language Model will make use of IRSTLM or SRILM, Translation Model makes use of GIZA++ and decoder will make use of Moses for designing a Translation Model [6]. Long sentences are filtered due to GIZA++ require long time to train them. …

FPGA-based low-power speech recognition with recurrent neural networks
M Lee, K Hwang, J Park, S Choi… – … Systems (SiPS), 2016 …, 2016 – ieeexplore.ieee.org
… The statistical tri-gram LM is generated with the IRSTLM [27] toolkit included in the KALDI speech recognition tool [28]. build-lm.sh and compile-lm in IRSTLM toolkit is used to generate a standard advanced research project …

Using Kazakh Morphology Information to Improve Word Alignment for SMT
A Kartbayev – Proceedings of the Second International Afro …, 2016 – Springer
… All 5-gram language models were trained with the IRSTLM toolkit [22] and then were converted to binary form using KenLM for a faster execution [23]. … Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: An open source toolkit for handling large scale language models. …

UGENT-LT3 SCATE Submission for WMT16 Shared Task on Quality Estimation
A Tezcan, V Hoste, L Macken – First Conference on Machine …, 2016 – biblio.ugent.be
… For building the PoS LM, we used TreeTagger (Schmid, 1995) to obtain the PoS tags on the target (DE) data and IRSTLM (Federico et al., 2008) for building the LM. … 2008. Irstlm: an open source toolkit for han- dling large scale language models. …

A hybrid strategy method for patent machine translation
Z Liu, K Zhao – Asian Language Processing (IALP), 2016 …, 2016 – ieeexplore.ieee.org
… performance. We train the 5-gram language model with IRSTLM tools, align the words and train the translation model with GIZA++ tools, ensure the best translation quality with Kneser-Ney smoothing method. IV. IMPLEMENTATION …

Post Editing System for Statistical Machine Translation
H Singh, V Goyal, A Kumar – ijiset.com
… corpus in the required language pair. B.) Find the probability of p(e) using a language model toolkit like SRILM,IRSTLM (c) Find the probability of p(f|e) using translation modeling toolkit like GIZA,GIZA++. (D) Use the decoder to …

On the evaluation of adaptive machine translation for human post-editing
L Bentivogli, N Bertoldi, M Cettolo, M Federico… – IEEE/ACM Transactions …, 2016 – dl.acm.org
… from monolingual texts, translation models rely on word-aligned [27] parallel corpora from which word-pair and phrase-pair statistics are extracted.3 Open source software is available to cover the whole processing pipeline: MGIZA++ [29] for word alignment, IRSTLM [30] for …

Developing a unit selection voice given audio without corresponding text
T Godambe, SK Rallabandi… – EURASIP …, 2016 – asmp-eurasipjournals.springeropen …
… We used scripts provided with the Kaldi toolkit [19] for training DNN-based ASR systems and the IRSTLM tool [20] for building language models. … 3.1.3 Language modeling. We used the IRSTLM toolkit [20] for training language models. …

Hierarchical pre-reordering model for patent machine translation
R Hu, K Zhao, H Li, Y Zhu, Y Jin – Asian Language Processing …, 2016 – ieeexplore.ieee.org
… About the configuration, we use grow-diag- final-and strategy for word alignment, IRSTLM to build a 5-gram language model to ensure fluent outputs, and improved Kneser-Ney for smoothing. We configure the minimum error rate when training the translation system. …

Direct translation vs. Pivot language translation for Persian-Spanish low-resourced Statistical Machine Translation System
B Ahmadnia, J Serrano – waset.org
… Our language models for each system are 4-grams and they are built using the IRSTLM toolkit [11]. We use a maximum phrase length of 6 to account. … [11] M. Federico, N. Bertoldi and M. Cettolo, Irstlm: an open source toolkit for handling large scale language models, 2008. …

Online Adaptation of Statistical Machine Translation with Sparse Features
P Mathur, FBKFB Kessler – hlt-mt.fbk.eu
… reordering models. On the target side, we built 5-gram LMs for WAT and Legal tasks and a 6- gram LM for the IT task, using IRSTLM (Federico et al., 2008) with improved Kneser-Ney smooth- ing (Chen and Goodman, 1998). The …

Candidate re-ranking for SMT-based grammatical error correction.
Z Yuan, T Briscoe, M Felice – BEA@ NAACL-HLT, 2016 – aclweb.org
… outputs fluent English sentences. The IRSTLM Toolkit (Federico et al., 2008) is used to build n- gram language models (up to 5-grams) with modi- fied Kneser-Ney smoothing (Kneser and Ney, 1995). Previous work has shown …

SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task.
R Del Gaudio, G Labaka, E Agirre, P Osenova… – WMT, 2016 – aclweb.org
… For the creation of the language models, IRSTLM was used to train a 5-gram language model with Kneser-Ney smooth- ing on the monolingual part of the training cor- pora. For the TectoMT system, the analysis of Dutch input uses the Alpino system (Noord, 2006), a …

Building a Bidirectional English-Vietnamese Statistical Machine Translation System by Using MOSES
NQ Phuoc, Y Quan, CY Ock – International Journal of …, 2016 – search.proquest.com
… It is written in C++ and Perl and released under the LGPL license with both source code and binaries available. The system implements GIZA++ used for word alignment and IRSTLM used to build 3-gram and 4-gram language model. …

Learning local word reorderings for hierarchical phrase-based statistical machine translation
J Zhang, M Utiyama, E Sumita, H Zhao, G Neubig… – Machine …, 2016 – Springer
… We used the default parameters for Moses. A 5-gram language model was trained on the target side of the training corpus using the IRST LM Toolkit 6 with improved Kneser-Ney smoothing (Chen and Goodman 1999). Since …

Comparative study of factored SMT with baseline SMT for English to Kannada
KM Shivakumar, N Shivaraju… – Inventive …, 2016 – ieeexplore.ieee.org
… a Language Model for Kannada factored sentences. To create a Language model we can use any one LM tool among IRSTLM, SRILM or KENLM. For a factored SMT training the LM tool creates two files namely surface.lm and PoS.lm. …

Human-centric point of view for a robot partner: a cooperative project between France and Japan
M Jacquemont, J Woo, J Botzheim… – … on Research and …, 2016 – ieeexplore.ieee.org
… Fig. 5. Performance of the tested language models We noticed that the 4 grams language model created with IRSTLM toolkit gives the best result, better than the language model available on CMU Sphinx website, close from the 5 grams one. …

Improving DNN-Based Automatic Recognition of Non-native Children’s Speech with Adult Speech
YQXWK Evanini, D Suendermann-Oeft – suendermann.com
… the CMU dictionary to create a new pronunciation dictionary. A trigram LM is trained from the transcriptions of the AsrTrain set using the IRSTLM toolkit [25]. 4.2. Improving Speech Recognition for Children using Adult Data Despite …

Efficient n-gram, skipgram and flexgram modelling with Colibri Core
M van Gompel… – Journal of …, 2016 – openresearchsoftware.metajnl.com
… Software that springs from such studies is widespread in the field. Examples, by no means exhaustive, are SRILM [13], IRSTLM [2], and KenLM [5]. Focus on efficient modelling with regards to memory consumption and look-up speed is an important component in such studies. …

Enhancing the Performance of Audio Visual Speech Recognition Using Deep Learning Techniques
A Dutta, GR SharadaValiveti – csjournals.com
… V. TOOLS AND LIBRARIES In this section, the tools and libraries associated with • NVIDIA CUDA 7.5 (System should have a • Toolkit for implementing our Automatic Toolkit is usedfor the said implementation installed [20]: – OpenFst: Most of the compilation – IRSTLM : It is a …

Machine Translation Development for Indian Languages and its Approaches
A Godase, S Govilkar – Date accessed, 2016 – academia.edu
… 2. Model for English- Urdu Statistical Machine Translation [31] 2013 English-Urdu General Statistical Approach The model is trained on TrainSet using Moses with language modeling toolkit IRSTLM. TestSet gives the BLEU score of 32.11. …

Addis Ababa University College of Natural sciences
KH AMARE – 2016 – etd.aau.edu.et
… Generally, this system developed after reviewed literatures and related work, and selected the appropriate tools and data source such as Moses, GIZA++ and IRSTLM as tools and different … needed such as Moses, GIZA++ and IRSTLM. Data Sources …

Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core
M Gompel, APJ van den Bosch – 2016 – repository.ubn.ru.nl
… field. Examples, by no means exhaus- tive, are SRILM [13], IRSTLM [2], and KenLM [5]. Focus on efficient modelling with regards to memory consump- tion and look-up speed is an important component in such studies. Others …

The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions.
J Daiber, R van der Goot – LREC, 2016 – let.rug.nl
… An n-gram language model is built on the English side of the news-commentary data set using IRSTLM (Fed- erico and Cettolo, 2007). Model weights are estimated us- ing MERT (Och, 2003). All experiments are performed on the development part of our dataset. …

Instance Selection for Online Automatic Post-Editing in a Multi-domain Scenario
R Chatterjee, M Arcan, M Negri, M Turchi – AMTA 2016, Vol., 2016 – amtaweb.org
… The first is the language model: A tri-gram local language model is built over the target side of the training corpus with the IRSTLM toolkit (Federico et al., 2008). … Irstlm: an open source toolkit for handling large scale language models. …

Self-Adaptive DNN for Improving Spoken Language Proficiency Assessment.
Y Qian, X Wang, K Evanini… – …, 2016 – pdfs.semanticscholar.org
… Two trigram LMs are trained from the transcriptions of the AsrTrain set in the monolog corpus and the AsrAdapt set in the dialog corpus by the IRSTLM toolkit [23], separately. Linear interpolation is used to combine these two LMs. …

DNN adaptation for recognition of children speech through automatic utterance selection
M Matassoni, D Falavigna… – … Workshop (SLT), 2016 …, 2016 – ieeexplore.ieee.org
… around 2 Gigawords. Training texts were collected from different sources, mainly in the domain of news, such as journals and news websites. The LM was estimated with the IRSTLM open source toolkit [58]. Before training, texts …

Enriching Phrase Tables for Statistical Machine Translation Using Mixed Embeddings.
P Passban, Q Liu, A Way – COLING, 2016 – aclweb.org
… All sentences are randomly selected from the En–Fr part of the Europarl (Koehn, 2005) collection. In our models we use 5-gram language models trained using the IRSTLM toolkit (Stolcke, 2002) and we tune models via MERT (Och, 2003). …

Statistical Morphological Disambiguation for Kazakh Language
D Azamat – 2016 – nur.nu.edu.kz
Page 1. NAZARBAYEV UNIVERSITY SCHOOL OF SCIENCE AND TECHNOLOGY Daiana Azamat Statistical Morphological Disambiguation for Kazakh Language Mathematics Major Capstone project Advisor: Zh. Assylbekov, PhD Second reader: A. Makazhanov, MSc …

A fast and compact language model implementation using double-array structures
JY Norimatsu, M Yasuhara, T Tanaka… – ACM Transactions on …, 2016 – dl.acm.org
… Since the bit length of each fingerprint in their experiments is short, the query sometimes fails. This means that it is a “lossy” language model. Federico et al. [2008] have proposed a language model implementation, IRSTLM, that is well suited to large-scale language models. …

Modernising historical Slovene words
Y Scherrer, T Erjavec – Natural Language Engineering, 2016 – cambridge.org
… 4 https://code.google.com/p/giza-pp/. 5 http://www.statmt.org/moses/. 6 http://hlt.fbk.eu/technologies/ irstlm-irst-language-modelling-toolkit. https:/www.cambridge.org/core/terms. https://doi.org/10.1017/ S1351324915000236 Downloaded from https:/www.cambridge.org/core. …

Selection of correction candidates for the normalization of Spanish user-generated content
M Melero, MR Costa-Jussà, P Lambert… – Natural Language …, 2016 – cambridge.org
Page 1. Natural Language Engineering 22 (1): 135–161. c Cambridge University Press 2014 doi:10.1017/S1351324914000011 135 Selection of correction candidates for the normalization of Spanish user-generated content …

Language modeling for automatic speech recognition of inflective languages: an applications-oriented approach using lexical data
G Donaj, Z Ka?i? – 2016 – books.google.com
… Examples are Good-Turing, Witten-Bell, Knesser-Ney and modified Knesser-Ney [7]. Often language modeling toolkits like SRILM and IRSTLM have several of those smoothing techniques are implemented into them. When …

Investigate more robust featuresfor Speech Recognition usingDeep Learning
T Deniaux – 2016 – diva-portal.org
Page 1. IN DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2016 Investigate more robust features for Speech Recognition using Deep Learning TIPHANIE DENIAUX …

N-gram language models for massively parallel devices.
N Bogoychev, A Lopez – ACL (1), 2016 – homepages.inf.ed.ac.uk
… The MIT Press, 3rd edition. M. Federico, N. Bertoldi, and M. Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In Proceedings of Inter- speech, pages 1618–1621. ISCA. S. Green, D. Cer, and C. Manning. 2014. …

Similar Word Model for Unfrequent Word Enhancement in Speech Recognition
X Ma, D Wang, J Tejedor, X Ma, D Wang… – IEEE/ACM Transactions …, 2016 – dl.acm.org
… particular short history [3]. Due to their simplicity and efficiency, n-gram LMs have been widely used in modern large-scale ASR systems [4], [5]. There are many free-available tools that can be used to build and manipulate n-gram LMs, eg, SRILM [6], MITLM [7], IRSTLM [8], and …

Improving Semantic Parsing with Enriched Synchronous Context-Free Grammars in Statistical Machine Translation
J Li, M Zhu, W Lu, G Zhou – ACM Transactions on Asian and Low …, 2016 – dl.acm.org
Page 1. 6 Improving Semantic Parsing with Enriched Synchronous Context-Free Grammars in Statistical Machine Translation JUNHUI LI, Soochow University, China MUHUA ZHU, Alibaba Inc., China WEI LU, Singapore University …

Achieving Automatic Speech Recognition for Swedish using the Kaldi toolkit
Z Mossberg – 2016 – diva-portal.org
Page 1. IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2016 Achieving Automatic Speech Recognition for Swedish using the Kaldi toolkit ZIMON MOSSBERG …

Scalable Machine Translation in Memory Constrained Environments
P Baltescu – arXiv preprint arXiv:1610.02003, 2016 – arxiv.org
Page 1. Scalable Machine Translation in Memory Constrained Environments Paul-Dan Baltescu St. Hugh’s College University of Oxford A thesis submitted for the degree of Master by Research Trinity 2016 arXiv:1610.02003v1 [cs.CL] 6 Oct 2016 Page 2. Abstract …

A study of similar word model for unfrequent word enhancement in speech recognition
X Ma, D Wang, J Tejedor – Center for Speech and Language …, 2016 – cslt.riit.tsinghua.edu.cn
… a particular short history [3]. Due to its simplicity and efficiency, n-gram LMs have been widely used in modern large-scale ASR systems [4,5]. There are many free-available tools that can be used to build and manipulate n-gram LMs, eg, SRILM [6], MITLM [7], IRSTLM [8] and …

Spelling normalisation and linguistic analysis of historical text for information extraction
E Pettersson – 2016 – diva-portal.org
Page 1. ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 17 Page 2. Page 3. Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction Eva Pettersson Page 4. Dissertation …

Artificial error generation for translation-based grammatical error correction
M Felice – 2016 – cl.cam.ac.uk
Page 1. Technical Report Number 895 Computer Laboratory UCAM-CL-TR-895 ISSN 1476-2986 Artificial error generation for translation-based grammatical error correction Mariano Felice October 2016 15 JJ Thomson Avenue …