Corpus Building


Corpus building is the process of creating a corpus, which is a collection of linguistic data, such as text or speech, that is used for research or analysis. A corpus is typically used to study language and linguistic phenomena, and it can be used for a wide range of purposes, including language learning, language translation, language modeling, and more.

There are many different types of corpora, and they can vary in size, language, genre, and other characteristics. Some common types of corpora include written corpora, which consist of written texts, such as books, articles, and other written documents; spoken corpora, which consist of spoken language, such as conversations, lectures, and other spoken interactions; and specialized corpora, which are designed to study specific languages, domains, or language phenomena.

Corpus building is an important step in the process of creating and using a corpus, as it involves collecting, transcribing, and annotating the data that will be included in the corpus. This can be a time-consuming and resource-intensive process, as it may involve finding, selecting, and organizing large amounts of data from a variety of sources. However, the effort invested in building a corpus can pay off by providing a rich and diverse dataset that can be used for a wide range of linguistic research and analysis.

See also:

Best Corpus Linguistics VideosCorpus Annotation Tools | Corpus CreationCorpus Workbench | UAM CorpusTool

[BOOK] Building and Using Comparable Corpora S Sharoff, R Rapp, P Zweigenbaum, P Fung – 2013 – Springer This book came from the experience of a series of annual BUCC workshops. The first workshop of this kind was held in 2008 at LREC in Marrakech organised by Pierre Zweigenbaum, Éric Gaussier and Pascale Fung. Since then, the workshops changed the … Cited by 3 Related articles All 3 versions

‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building M Charles – English for Specific Purposes, 2012 – Elsevier This paper reports on the feasibility and value of an approach to teaching EAP writing in which students construct and examine their own individual, discipline-specific corpora. The approach was trialed in multidisciplinary classes of advanced-level students (mostly … Cited by 23 Related articles All 3 versions

The AMARA Corpus: Building parallel language resources for the educational domain A Abdelali, F Guzman, H Sajjad, S Vogel – Proceedings of the Ninth …, 2014 – Abstract This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multilingually aligned for 20 languages, ie 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich … Cited by 2 Related articles

From Semi-Automatic to Automatic Affix Extraction in Middle English Corpora: Building a Sustainable Database for Analyzing Derivational Morphology over … H Peukert – Empirical Methods in Natural Language Processing, 2012 – Abstract The annotation of large corpora is usually restricted to syntactic structure and word class. Pure lexical information and information on the structure of words are stored in specialized dictionaries (Baayen et al., 1995). Both data structures–dictionary and text … Related articles All 2 versions

The AMARA Corpus: Building Resources for Translating the Web’s Educational Content F Guzman, H Sajjad, A Abdelali… – Proceedings of the …, 2013 – Abstract In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English … Cited by 2 Related articles

Beyond SoNaR: towards the facilitation of large corpus building efforts M Reynaert, I Schuurman, V Hoste… – Proceedings of the …, 2012 – In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to … Cited by 2 Related articles All 10 versions

An automated method to build a corpus of rhetorically-classified sentences in biomedical texts H Houngbo, RE Mercer – ACL 2014, 2014 – Abstract The rhetorical classification of sentences in biomedical texts is an important task in the recognition of the components of a scientific argument. Generating supervised machine learned models to do this recognition requires corpora annotated for the rhetorical …

Building Arabic corpora from Wikisource. I Bensalem, S Chikhi, P Rosso – AICCSA, 2013 – Abstract—This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the building of an evaluation … Cited by 1 Related articles All 4 versions

Motivating College Students’ Learning English for Specific Purposes Courses through Corpus Building LF Wu – English Language Teaching, 2014 – Abstract This study was conducted to determine how to motivate technical college students to learn English for specific purposes (ESP) courses through corpus building and enhance their language proficiency during the coursework for their majors. This study explores … Related articles

Japanese Corpus Build Based on the Technology of Computer to Work and Applications N Pan, X Yu – Applied Mechanics and Materials, 2014 – Trans Tech Publ Computer work is a combination of human and computer networks, software and hardware, and other related technologies for collaborative work methods. Technical support group members working computer system work together to share resources and information in …

Opinion mining in an informative corpus: Building lexicons P Enjalbert, L Zhang, S Ferrari – Empirical Methods in Natural …, 2012 – Abstract This paper presents first steps of an ongoing work aiming at the constitution of lexicons for opinion mining. Our work is corpus-oriented, the corpus being of informative nature (related to avionic manufacturers) rather than opinion-oriented (as in current works … Cited by 4 Related articles All 3 versions

Romanian Translational Corpora: Building Comparable Corpora for Translation Studies I Ilisei, D Inkpen, G Corpas, R Mitkov – The 5th Workshop on Building …, 2012 – Abstract Building comparable corpora for the investigation of translational hypotheses is an important task within the translation studies domain. This paper describes the compilation of a translational comparable corpus for the Romanian language. The resource comprises … Related articles All 10 versions

SIGN MOTION: An Innovative Creation and Annotation Platform for Sign Language 3D-Content Corpora Building Relying on Low Cost Motion Sensors M Boulares, M Jemni – Computers Helping People with Special Needs, 2014 – Springer Abstract The manual transcription process of Sign Language is a work-intensive step which requires considerable effort to create Signs. Even, often the result of this step misses the natural aspect of motion to be conform to the natural human interpretation. In other words, …

Paraphrase Corpus Building M Vila, H Rodr?guez, MA Mart? – Paraphrase Scope and Typology. A Data- …, 2013 – Abstract Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the WRPA method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA …

Research on Web Application Technology for Building a Chinese-French Parallel Corpus of the Four Great Chinese Classical Novels CP Liu – Applied Mechanics and Materials, 2014 – Trans Tech Publ Abstract:As masterpieces in Chinese classical literature, the Four Great Chinese Classical Novels with their multilingual translations have exerted a profound influence in literature and translation studies both home and abroad. Building a Chinese-French bilingual parallel corpus of the … Related articles

Building Chinese Interlanguage Corpus: The Case of Character Error-tagged Chinese Interlanguage Corpus of Sun Yat-Sen University [J] Z Ruipeng – Applied Linguistics, 2012 – The paper reports the preliminary findings of character error-coded Chinese Interlanguage Corpus of Sun Yat-Sen University. The corpus is used as an illustration on some theoretical issues in interlanguage corpus building. The first one is the authenticity and continuity of … Cited by 2

An Italian Multimodal Corpus: The Building Process MC Caschera, A D’Ulizia, F Ferri, P Grifoni – On the Move to Meaningful …, 2014 – Springer Abstract During the design of multimodal interaction environments, the use of a corpus of multimodal sentences is very important in order to achieve various tasks of multimodal interaction. In last decade, several researchers addressed the creation of multimodal … Cited by 1

Corpus Building of Literary Lesser Rich Language-Bodo: Insights and Challenges SKSB Boro – 24th International Conference on Computational …, 2012 – ABSTRACT Collection of natural language texts in to a machine readable format for investigating various linguistic phenomenons is call a corpus. A well structured corpus can help to know how people used that language in day-to-day life and to build an intelligent … Related articles All 6 versions

On building a reusable Twitter corpus R McCreadie, I Soboroff, J Lin, C Macdonald… – Proceedings of the 35th …, 2012 – Abstract The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In … Cited by 29 Related articles All 8 versions

Building a Corpus of Spatial Relational Expressions Extracted from Web Documents JO Wallgrün, A Klippel, T Baldwin – 2014 – ABSTRACT Spatial language, despite decades of research, still poses substantial challenges for automated systems, for instance in geographic information retrieval or human- robot interaction. We describe an approach to building a corpus of natural language …

Building a 70 billion word corpus of English from ClueWeb. J Pomikálek, M Jakubícek, P Rychlý – LREC, 2012 – Abstract This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, … Cited by 12 Related articles All 8 versions

Building a large corpus based on newspapers from the web G Andersen, K Hofland – G. Andersen (ed.), 2012 – The Norwegian Newspaper Corpus (NNC) is an initiative to create a large monitor corpus representing contemporary Norwegian language in both its written varieties, Bokmal and Nynorsk. The corpus is compiled through daily harvesting and processing of published … Cited by 11 Related articles All 2 versions

Building Large Corpora from the Web Using a New Efficient Tool Chain. R Schäfer, F Bildhauer – LREC, 2012 – Abstract Over the last decade, methods of web corpus construction and the evaluation of web corpora have been actively researched. Prominently, the WaCky initiative has provided both theoretical results and a set of web corpora for selected European languages. We … Cited by 36 Related articles All 4 versions

Towards Building a Corpus for Palestinian Dialect D Akra, M Jarrar – 2014 – … list • Developing tools to automatically annotate corpus. • build a small application that would take Palestinian Text as Input; return it in Standard Arabic as Output • analyze the results 19 Page 20. ???? … Morphological lexicons from Morphologically Annotated Corpora . In … Related articles

Building an efficient curation workflow for the Arabidopsis literature corpus D Li, TZ Berardini, RJ Muller, E Huala – Database, 2012 – Abstract TAIR (The Arabidopsis Information Resource) is the model organism database (MOD) for Arabidopsis thaliana, a model plant with a literature corpus of about 39 000 articles in PubMed, with over 4300 new articles added in 2011. We have developed a … Cited by 8 Related articles All 9 versions

How do children acquire early grammar and build multiword utterances? a Corpus study of French children aged 2 to 4 MT Normand, I Moreno?Torres, C Parisse… – Child …, 2013 – Wiley Online Library In the last 50 years, researchers have debated over the lexical or grammatical nature of children’s early multiword utterances. Due to methodological limitations, the issue remains controversial. This corpus study explores the effect of grammatical, lexical, and pragmatic … Cited by 9 Related articles All 10 versions

Building the British Sign Language Corpus A Schembri, J Fenlon, R Rentelis, S Reynolds… – 2013 – This paper presents an overview of the British Sign Language Corpus Project—the first endeavor to create a machine-readable digital corpus of British Sign Language (BSL) collected from deaf signers across the United Kingdom. In the field of sign language … Cited by 6 Related articles All 4 versions

Building parallel corpora through social network gaming N Green – … of the Collaborative Resource Development and …, 2012 – Abstract Building training data is labor-intensive and presents a major obstacle to the advancement of Natural Language Processing (NLP) systems. A prime use of NLP technologies has been toward the construction machine translation systems. The most … Related articles All 2 versions

RA-SR: Using a ranking algorithm to automatically building resources for subjectivity analysis over annotated corpora Y Gutiérrez, A González, AF Orquín, A Montoyo… – WASSA 2013, 2013 – Abstract In this paper we propose a method that uses corpora where phrases are annotated as Positive, Negative, Objective and Neutral, to achieve new sentiment resources involving words dictionaries with their associated polarity. Our method was created to build … Cited by 2 Related articles All 6 versions

Optimizing annotation efforts to build reliable annotated corpora for training statistical models C Grouin, T Lavergne, A Névéol – LAW VIII, 2014 – Abstract Creating high-quality manual annotations on text corpus is time-consuming and often requires the work of experts. In order to explore methods for optimizing annotation efforts, we study three key time burdens of the annotation process:(i) multiple annotations,( …

Building a Corpus for Palestinian Arabic: a Preliminary Study M Jarrar, N Habash, D Akra, N Zalmout – ANLP 2014, 2014 – Abstract This paper presents preliminary results in building an annotated corpus of the Palestinian Arabic dialect. The corpus consists of about 43K words, stemming from diverse resources. The paper discusses some linguistic facts about the Palestinian dialect, …

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. D Goldhahn, T Eckart, U Quasthoff – LREC, 2012 – Abstract The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of … Cited by 21 Related articles All 3 versions

Building bilingual terminologies from comparable corpora: The TTC TermSuite B Daille – … on Building and Using Comparable Corpora with …, 2012 – Abstract In this paper, we exploit domain-specific comparable corpora to build bilingual terminologies. We present the monolingual term extraction and the bilingual alignment that will allow us to identify and translate high specialised terminology. We stress the huge … Cited by 5 Related articles All 13 versions

Building gold standard corpora for medical natural language processing tasks L Deleger, Q Li, T Lingren, M Kaiser… – AMIA Annual …, 2012 – Abstract We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter- … Cited by 13 Related articles All 5 versions

Building a german/simple german parallel corpus for automatic text simplification D Klaper, S Ebling, M Volk – Proc. of the Second Workshop on Predicting …, 2013 – Abstract In this paper we report our experiments in creating a parallel corpus using German/Simple German documents from the web. We require parallel data to build a statistical machine translation (SMT) system that translates from German into Simple … Cited by 2 Related articles All 4 versions

Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature N Alnazzawi, P Thompson, S Ananiadou – Proceedings of the 5th …, 2014 – Abstract Narrative information in Electronic Health Records (EHRs) and literature articles contains a wealth of clinical information about treatment, diagnosis, medication and family history. This often includes detailed phenotype information for specific diseases, which in … Cited by 1 Related articles All 2 versions

Dicta-Sign–building a multilingual sign language corpus S Matthes, T Hanke, A Regen, J Storz… – … between Corpus and …, 2012 – Abstract This paper presents the multilingual corpus of four European sign languages compiled in the framework of the Dicta-Sign project. Dicta-Sign researched ways to enable communication between Deaf individuals through the development of human-computer … Cited by 4 Related articles All 2 versions

Building a diverse document leads corpus annotated with semantic relations M Hangyo, D Kawahara, S Kurohashi – Proceedings of the 26th Pacific …, 2012 – Abstract In these days, semantic analysis has been actively studied in natural language processing. For the study of semantic analysis, corpora with semantic annotations are essential. Although there are such corpora annotated on newspaper articles, there are … Cited by 6 Related articles All 5 versions

Twitter as a Comparable Corpus to build Multilingual Affective Lexicons A Fraisse, P Paroubek – The 7th Workshop on Building and …, 2014 – Résumé The main issue of any lexicon-based sentiment analysis system is the lack of affective lexicons. Such lexicons contain lists of words annotated with their affective classes. There exist some number of such resources but only for few languages and often for a … Cited by 1 Related articles

Building an English-Vietnamese Bilingual Corpus for Machine Translation QH Ngo, W Winiwarter – Asian Language Processing (IALP), …, 2012 – Abstract—Bilingual corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignments is of significance to provide a gold- … Cited by 5 Related articles All 5 versions

Building a multilingual parallel corpus for human users. A Rosen, M Vavrín – LREC, 2012 – Abstract We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a … Cited by 5 Related articles All 3 versions

Building a bilingual dictionary from a Japanese-Chinese patent corpus K Yasuda, E Sumita – Computational Linguistics and Intelligent Text …, 2013 – Springer Abstract In this paper, we propose an automatic method to build a bilingual dictionary from a Japanese-Chinese parallel corpus. The proposed method uses character similarity between Japanese and Chinese, and a statistical machine translation (SMT) framework in a … Cited by 4 Related articles All 2 versions

Building a corpus of secondary school texts: First you have to catch the rabbit A Coxhead, R White – New Zealand Studies in Applied …, 2012 – Abstract: In an old joke about how to make rabbit stew, the punch line says,” First, you have to catch the rabbit.” In other words, making the stew is easy once this rather difficult first task is successfully completed. Catching the rabbit, in our case, should be worth the effort since … Related articles All 2 versions

Automatic Building and Using Parallel Resources for SMT from Comparable Corpora S Pal, P Pakray, SK Naskar – Proceedings of the 3rd Workshop on …, 2014 – Abstract Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic … Related articles All 4 versions

Building Large Resources for Text Mining: The Leipzig Corpora Collection U Quasthoff, D Goldhahn, T Eckart – Text Mining, 2014 – Springer Abstract Many text mining algorithms and applications require the availability of large text corpora and certain statistics-based annotations. To ensure comparability of results a standardized corpus building process is required. Particularly noteworthy are all pre- …

Reuse of termino-ontological resources and text corpora for building a multilingual domain ontology: An application to Alzheimer’s disease K Dramé, G Diallo, F Delva, JF Dartigues… – Journal of biomedical …, 2014 – Elsevier Abstract Ontologies are useful tools for sharing and exchanging knowledge. However ontology construction is complex and often time consuming. In this paper, we present a method for building a bilingual domain ontology from textual and termino-ontological … Related articles All 4 versions

SeedLing: Building and using a seed corpus for the Human Language Project G Emerson, L Tan, S Fertmann, A Palmer… – ACL 2014, 2014 – Abstract A broad-coverage corpus such as the Human Language Project envisioned by Abney and Bird (2010) would be a powerful resource for the study of endangered languages. Existing corpora are limited in the range of languages covered, in …

Building and exploiting a corpus of dialog interactions between french speaking virtual and human agents LMR Barahona, A Lorenzo, C Gardent – The eighth international …, 2012 – Résumé: We describe the acquisition of a dialog corpus for French based on multi-task human-machine interactions in a serious game setting. We present a tool for data collection that is configurable for multiple games; describe the data collected using this tool and the … Cited by 8 Related articles All 11 versions

Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles QH Ngo, D Dien, W Winiwarter – COLING 2014, 2014 – Abstract Named entity recognition aims to classify words in a document into pre-defined target entity classes. It is now considered to be fundamental for many natural language processing tasks such as information retrieval, machine translation, information extraction …

Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure Y Li, W Feng, J Sun, F Kong, G Zhou – Proceedings of the 2014 …, 2014 – Abstract In this paper, we propose a Connectivedriven Dependency Tree (CDT) scheme to represent the discourse rhetorical structure in Chinese language, with elementary discourse units as leaf nodes and connectives as non-leaf nodes, largely motivated by the Penn … Cited by 1

Building a Japanese Corpus of Temporal-Causal-Discourse Structures Based on SDRT for Extracting Causal Relations K Kaneko, D Bekki – EACL 2014, 2014 – Abstract This paper proposes a methodology for generating specialized Japanese data sets for the extraction of causal relations, in which temporal, causal and discourse relations at both the fact level and the epistemic level, are annotated. We applied our methodology to … Cited by 1 Related articles All 2 versions

Building a learner corpus. J Hana, A Rosen, B Stindlová, P Jäger – LREC, 2012 – Abstract The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links … Cited by 6 Related articles All 8 versions

Building Very Large Corpus Containing Useful Rich Materials for Language Learning from Closed Caption TV H Mochizuki, K Shibano – World Conference on E-Learning in Corporate, …, 2014 – Abstract This paper describes the specific details of a very large spoken language corpus constructed from closed caption TV data. We collected the closed caption data from over 70,000 TV programs from January 2013 to June 2014. The total number of words in our …

Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools JI Toledo-Alvarado, A Guzmán-Arenas… – Journal of applied …, 2012 – ABSTRACT In this paper we show a procedure to build automatically an ontology from a corpus of text documents without external help such as dictionaries or thesauri. The method proposed finds relevant concepts in the form of multi-words in the corpus and non- … Cited by 1 Related articles All 5 versions

Building and Modelling Multilingual Subjective Corpora M Saad, D Langlois, K Smaili – … of the Ninth International Conference on …, 2014 – Abstract Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that … Related articles All 5 versions

Building Corpora for Figurative Language Processing: The Case of Irony Detection A Reyes, P Rosso – … the 4th International Workshop on Corpora …, 2012 – Abstract Figurative language is one of the most arduous topics that natural language processing (NLP) has to face. Unlike literal language, the former takes advantage of linguistic devices, such as metaphor, analogy, ambiguity, irony, sarcasm, and so on, in … Cited by 2 Related articles All 2 versions

Building a fine-grained subjectivity lexicon from a web corpus. I Maks, P Vossen – LREC, 2012 – Abstract In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News … Cited by 5 Related articles All 5 versions

Building Japanese Predicate-argument Structure Corpus using Lexical Conceptual Structure. Y Matsubayashi, Y Miyao, A Aizawa – LREC, 2012 – Abstract This paper introduces our study on creating a Japanese corpus that is annotated using semantically-motivated predicate-argument structures. We propose an annotation framework based on Lexical Conceptual Structure (LCS), where semantic roles of … Related articles All 2 versions

Building readability lexicons with unannotated corpora J Brooke, V Tsang, D Jacob, F Shein… – Proceedings of the First …, 2012 – Abstract Lexicons of word difficulty are useful for various educational applications, including readability classification and text simplification. In this work, we explore automatic creation of these lexicons using methods which go beyond simple term frequency, but without relying … Cited by 4 Related articles All 46 versions

Building a large-scale corpus for evaluating event detection on twitter AJ McMinn, Y Moshfeghi, JM Jose – Proceedings of the 22nd ACM …, 2013 – Abstract Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of … Cited by 7 Related articles All 3 versions

Building a standard Amazigh corpus S Boulaknadel, FA Allah – … of the Third International Conference on …, 2013 – Springer Abstract Natural language processing is showing more interest in the Amazigh language in recent years. Suitable resources for Amazighe are becoming a vital necessity for the progress of this research. Corpora are an important resource but Amazighe lacks sufficient … Cited by 5 Related articles All 4 versions

Building a representative corpus of classical music J London – Music Perception: An Interdisciplinary Journal, 2013 – JSTOR This paper presents an object lesson in the challenges and considerations involved in assembling a musical corpus for empirical research. It develops a model for the construction of a representative corpus of classical music of the “common practice period”(1700-1900), … Cited by 1 Related articles

Building and analyzing a corpus of contextualized traces collected during a Technology Enhanced teaching module C Hajer, C Courtin, JJ Girardot – … Conference on Advanced …, 2014 – Sharing and analyzing data collected within Technology Enhanced Learning environments is an interesting issue for researchers to validate their models and systems. In this paper we present a corpus we built and analyzed in order to validate our proposed” Proxy approach …

A System for Building FrameNet-like Corpus for the Biomedical Domain H Tan – Proceedings of the 5th International Workshop on …, 2014 – Abstract Semantic Role Labeling (SRL) plays an important role in different text mining tasks. The development of SRL systems for the biomedical area is frustrated by the lack of large- scale domain specific corpora that are annotated with semantic roles. In our previous work … Related articles All 3 versions

A New Approach for Building Domain-Specific Corpus with Wikipedia XY Zhang, X Li, ZJ Ruan – Applied Mechanics and Materials, 2013 – Trans Tech Publ Xinye Zhang 1, a , Xiu Li 1,b and Zhijian Ruan 1,c … East Main Building of Tsinghua University, Haidian District, Beijing, 100084, China,, … Keywords: Domain-specific … Related articles

Towards Building a Corpus of Turkish Referring Expressions C Acartürk, MP Çak?r – Proc. 1st Workshop on Language Resources …, 2012 – Abstract In this paper we report on the preliminary findings of our ongoing study on Turkish referring expressions used in situated dialogs. Situated dialogs of pairs of Turkish speakers were collected while they were engaged with a collaborative Tangram puzzle solving task, … Cited by 2 Related articles All 2 versions

Towards Building Arabic Corpus For Drug Information H Al-Ibrahim, HS Al-Khalifa, AM Al-Salman – Proceedings of the 6th …, 2014 – Abstract Corpora have opened up many new areas of research in the linguistic domain, which would never been possible without them. Moreover, corpora have proved their usefulness not only in the linguistic domain but also in other domains, such as medical, …

A joint approach for building a large Tibetan corpus with syntactic parsing and semantic role labeling L Qiu, C Long, X Zhao – Intelligent Networks and Intelligent …, 2012 – Abstract—Syntactic parsing and semantic role labeling have been studied in natural language processing for many years and many good research results have been obtained. According to the characteristics of Tibetan, a joint approach for syntactic parsing and … Cited by 2 Related articles All 4 versions

Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words K Almeman, M Lee – … and their Applications (ICCSPA), 2013 1st …, 2013 – Abstract—The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorise distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the … Cited by 5 Related articles All 4 versions

A rule-based approach for building an artificial English-ASL corpus Z Tmar, A Othman, M Jemni – Electrical Engineering and …, 2013 – Abstract—A serious problem facing the Community for researchers in the field of sign language is the absence of a large parallel corpus for signs language. The ASLG-PC12 project proposes a rule-based approach for building big parallel corpus between English … Cited by 2 Related articles All 3 versions

Building a hierarchical annotated corpus of urdu: the URDU.KON-TB treebank Q Abbas – Computational Linguistics and Intelligent Text …, 2012 – Springer Abstract This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu … Cited by 5 Related articles All 5 versions

A novel approach to build Kannada web Corpus S Parameswarappa, VN Narayana… – … and Informatics (ICCCI …, 2012 – A corpus is a collection of documents in electronic, computer processable form-1/. Open, freely and publicly available corpora can be used by all researchers as standard data sets to develop and test their systems. In particular, a text corpus is a collection of text documents. … Cited by 2 Related articles

Issues in building English-Chinese parallel corpora with WordNets F Bond, S Wang – … WordNet Conference (GWC-7), ed. by …, 2014 – Abstract We discuss some of the issues in producing sense-tagged parallel corpora: including pre-processing, adding new entries and linking. We have preliminary results for three genres: stories, essays and tourism web pages, in both Chinese and English. Cited by 2 Related articles All 3 versions

Building a Microblog Corpus for Search Result Diversification K Tao, C Hauff, GJ Houben – Information Retrieval Technology, 2013 – Springer Abstract Queries that users pose to search engines are often ambiguous-either because different users express different query intents with the same query terms or because the query is underspecified and it is unclear which aspect of a particular query the user is … Cited by 2 Related articles All 2 versions

Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary Z Liu – 2013 – Abstract Decades of work have been conducted on automated building of parallel corpus and bilingual dictionary in the field of natural language processing. However, rarely have any studies been done between high-density character-based languages and medium- … Related articles All 3 versions

Method of building BFS-CTC: A Chinese tagged corpus of sentential semantic structure SL Luo, YY Liu, Y Feng, L Han, G Chen… – Transactions of Beijing …, 2012 – Based on the modern Chinese semantics, a Chinese sentential semantic mode is built, and then a Chinese tagged corpus, BFS-CTC (Beijing forest studio-Chinese tagged corpus), is built according to the Chinese sentential semantic mode. There are more than ten … Related articles All 2 versions

Focused crawling for building Web comment corpora M Neunerdt, M Niermann, R Mathar… – … (CCNC), 2013 IEEE, 2013 – Abstract—Web 2.0 provides various types of social media applications, eg, blogs, forums and news sites that allow users to post Web comments. This kind of communication plays an important role in acceptance research. To extract different opinions from such data, it is … Cited by 2 Related articles All 2 versions

Building a Corpus of Indefinite Uses Annotated with Fine-grained Semantic Functions. M Aloni, A van Cranenburgh, R Fernández, M Sznajder – LREC, 2012 – Abstract Natural languages possess a wealth of indefinite forms that typically differ in distribution and interpretation. Although formal semanticists have strived to develop precise meaning representations for different indefinite functions, to date there has hardly been … Cited by 1 Related articles All 5 versions

Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora R Boujelbane, ME Khemekhem… – Proceedings of the Sixth …, 2013 – Abstract Nowadays in tunisia, the arabic Tunisian Dialect (TD) has become progressively used in interviews, news and debate programs instead of Modern Standard Arabic (MSA). Thus, this gave birth to a new kind of language. Indeed, the majority of speech is no longer … Cited by 7 Related articles All 2 versions

Building wordnets by machine translation of sense tagged corpora A Oliver, S Climent – GWC 2012 6th International Global Wordnet …, 2012 – Abstract This paper describes a methodology for the construction of WordNets based on machine translation of an English sense tagged corpus. For the construction of such a corpus we used two freely available resources: the Semcor Corpus and the Princeton … Cited by 2 Related articles All 4 versions

Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds SM Alzahrani – Information Retrieval Technology, 2013 – Springer Abstract The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About … Related articles All 2 versions

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs K Wo?k, K Marasek – Procedia Technology, 2014 – Elsevier Abstract Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained …

Building scholar e-communities using a semantically aware framework: Archaia Kypriaki Grammateia Digital Corpus D Pitzalis, E Christophorou… – VAST12: The 13th …, 2012 – Abstract Web-based learning communities have developed into a very popular vehicle for sharing information amongst students, researchers and enthusiastic users and are slowly gaining importance in the humanities field. Unfortunately, as data organization and … Cited by 3 Related articles All 5 versions

Gathering Public Concerns from Web Towards Building Corpus of Japanese Regional Concerns S Shiramatsu, N Hirata, RME Swezey… – … (IIAIAAI), 2012 IIAI …, 2012 – Abstract—Importance of concern assessment has been in-creased in Japanese regional communities. We have developed an e-Participation web platform based on a Linked Open Data set called SOCIA (Social Opinions and Concerns for Ideal Argumentation). To … Related articles All 2 versions

The Pro. Bio. Dic.(Prototype of a Bioethics Dictionary) project: Building a corpus of popular and specialized bioethics texts A Vicentini, K Grego, D Russo – JAHR, 2013 – Sažetak This paper reports on an ongoing, long-term research project in the field of medical ethics and bioethics conducted by a multidisciplinary team combining medical, linguistic, IT and philosophical research interests: the Prototype of a Bioethics Dictionary (Pro. bio. dic). Related articles All 4 versions

Building and Evaluating Somali Language Corpora N Abdillahi – ACL 2014, 2014 – Abstract In this paper we outline our work to build Somali language Corpora. A read-speech corpus named Asaas and containing about 10 hours and 26 minutes of good quality signal fully transcribed and well corrected with a well-balanced phonetic distribution is presented …

The Building Builds the Church: Corpus Christi University Parish Four Years After JJ Bacik – New Theology Review, 2013 – As the pastor of Corpus Christi University Parish, I was involved in building a new church to serve the faculty and students of the University of Toledo. Our church, which received several honors, including a merit award in the Eugene Potente Liturgical Design … All 8 versions

A System for Building Corpus Annotated With Semantic Roles S Rahimi Rastgar, N Razavi – 2013 – Abstract Semantic role labelling (SRL) is a natural language processing (NLP) technique that maps sentences to semantic representations. This can be used in different NLP tasks. The goal of this master thesis is to investigate how to support the novel method proposed … Cited by 1 Related articles

The Building and Application of the Parallel Corpus of Guizhou External Publicity J ZHOU, J CHEN – Journal of Guizhou University (Social Sciences), 2013 – A Parallel Corpus of Guizhou External Publicity has been primarily built by taking use of the current resources and technology. The materials of both the Chinese and the English versions of Guizhou external publicity, such as texts of ethnic minority customs, tourism … Related articles

Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model R Boujelbane, S BenAyed, LH Belguith – ACL 2013, 2013 – Abstract Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for … Cited by 2 Related articles All 5 versions

Issues in Khinalug syntax: building on corpus evidence M Daniel – Higher School of Economics Research Paper No. WP …, 2013 – Abstract: The paper treats several issues in the syntax of Khinalug, an East Caucasian language of Northern Azerbaijan, presenting some results of the author’s ongoing research of Khinalug syntax. The analysis is based on corpus data and covers three issues that can … Related articles All 6 versions

Building a Corpus of South African English G Dwyer – 2014 – 1 Background The Dictionary Unit for South African English (DSAE) produces the South African Concise Oxford Dictionary, the authoritative reference for South African English. In order to track and analyze how South African English is being used, and to identify new …

Building semantic corpus from wordNet L Stanchev – … and Biomedicine Workshops (BIBMW), 2012 IEEE …, 2012 – Abstract-We propose a novel methodology for extracting semantic similarity knowledge from semi-structured sources, such as WordNet. Unlike existing approaches that only explore the structured information (eg, the hypernym relationship in WordNet), we present a … Cited by 3 Related articles All 6 versions

Building diachronical reference corpora for the French language A Lavrentiev – … of the International Conference’Corpus …, 2013 –

Building a Corpus of South African English: Literature Review G Dwyer – 2014 – A corpus is a large body of classified text, from which knowledge about how language is being used can be extracted. Corpora have become essential for lexicographers, who use them to write accurate entries for dictionaries. For example, Oxford University Press (OUP, …

Splitting Complex Sentences for Natural Language Processing Applications: Building a Simplified Spanish Corpus JC Collados – Procedia-Social and Behavioral Sciences, 2013 – Elsevier Abstract This paper presents a new Spanish parallel corpus of original and syntactically simplified texts. The simplification carried out basically consists of opportunistically splitting a complex original sentence into several simple ones. This parallel corpus is envisioned as … Related articles

Corpora as a Source of Biomedical Information: Building a Technological Knowledge Base TV Vila – Procedia-Social and Behavioral Sciences, 2013 – Elsevier Abstract The versatility of linguistic corpora makes them an excellent conceptual and terminological resource. For their part, knowledge bases are rapidly gathering momentum in the field of terminology, since they allow for the creation of multidimensional conceptual … Related articles

Building An Old Occitan Corpus via Cross-Language Transfer O Scrivner – Empirical Methods in Natural Language Processing, 2012 – Abstract This paper describes the implementation of a resource-light approach, cross- language transfer, to build and annotate a historical corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic annotation from resource-rich source languages, … Cited by 2 Related articles All 4 versions

Building of Networks of Natural Hierarchies of Terms Based on Analysis of Texts Corpora D Lande – arXiv preprint arXiv:1405.6068, 2014 – Abstract: The technique of building of networks of hierarchies of terms based on the analysis of chosen text corpora is offered. The technique is based on the methodology of horizontal visibility graphs. Constructed and investigated language network, formed on the basis of … Cited by 3 Related articles All 2 versions

Building an Annotated Corpus of Late Egyptian. The Ramses Project: Review and Perspectives S Polis, AC Honnay, J Winand – … . Selected papers from the meeting of …, 2013 – Abstract:[en] This paper reviews the experience of the Ramses Project in constructing a richly annotated corpus of Late Egyptian that consists of 300 000 words in 2011 (and is expected to grow up to more than 1 million words in coming years). During the first five … Cited by 1 Related articles

Building a Lexical Corpus: A Glossary of Film Terminology M Blažek – 2012 – Ve své bakalá?ské práci vysv?tluji pojem korpus a popisuji proces jeho tvorby. Hlavní zájem zam??uji na specifika tvorby korpusu a další jeho vlastnosti v jazyce dneška. Teoretickou ?ást za?ínám vysv?tlením pojm? korpus a korpusová lingvistika. Pokra?uji jejich … Related articles

Electronic target-language specialised corpora in translator education: Building and searching strategies P Rodríguez-Inés – Babel, 2013 – Résumé/Abstract Ces dix dernières années, l’utilisation de corpus électroniques à la fois monolingues et bilingues est de plus en plus fréquente dans la formation de traducteurs, notamment dû au fait que ceux-ci présentent de nombreux avantages par rapport aux … Cited by 1 Related articles All 2 versions

The Construction of the Building English Corpus Thought, Method and Application ??? – ????, 2012 – ???????? Abstract: With the increasing international communication, is a lot of construction enterprise began to foreign companies in development, so no matter from academic Angle or from the point of view of daily use it we are, it is necessary to establish a construction English … All 3 versions

Building Local Learner Corpora To Improve Foreign Language Teaching R Kawecki – EDULEARN12 Proceedings, 2012 – This research project was originated due to the difficulties students of the Centre for Language Learning at the University of the West Indies in Trinidad and Tobago had when writing French as a foreign language. This frustration led to the building of a Caribbean …

Building Corpus-Informed Word Lists for L2 Vocabulary Learning in Nine Languages F Charalabopouloua, M Gavrilidoua… – … , Sweden, 22-25 …, 2012 – Abstract. Lexical competence constitutes a crucial aspect in L2 learning, since building a rich repository of words is considered indispensable for successful communication. CALL practitioners have experimented with various kinds of computer-mediated glosses to … Related articles All 4 versions

Building Linguistic Corpora from Wikipedia Articles and Discussions E Margaretha, H Lüngen – … von/Edited by Michael Beißwenger, Nelleke …, 2014 – Abstract Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus- …

Arabic language processing for text classification: contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification … MYA Al-Nashashibi – 2012 – The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex …

Building a corpus for comparative analysis of language attrition L de Almeida FERRARI – … Conference. Speech and Corpora, 2012 – Abstract The aim of this research is the study of first language attrition of Italian L1 in contact with Brazilian Portuguese. Language attrition is the gradual decline or the loss of a ?rst or second language by an individual. This is a corpus-based study: a corpus of oral … Related articles