Corpus Creation


A corpus is a collection of written or spoken language texts that is used for language research and analysis. Corpora are widely used in linguistics, computer science, and other fields to study the structure and use of language, and to develop natural language processing (NLP) systems such as chatbots and question answering systems.

There are several steps involved in corpus creation:

  1. Selecting the texts: The first step in creating a corpus is to select the texts that will be included in the collection. The texts should be chosen based on the research questions or goals of the study, and should be representative of the language or language use being studied.
  2. Preprocessing the texts: Once the texts have been selected, they need to be preprocessed to prepare them for analysis. This may involve cleaning the texts to remove formatting, punctuation, or other elements that are not relevant to the study. It may also involve tokenizing the texts, which means dividing them into individual words or smaller units of meaning.
  3. Annotating the texts: Annotating the texts means adding additional information or labels to the texts to help with analysis. This may involve adding part-of-speech tags to each word to indicate its role in the sentence, or adding semantic tags to indicate the meaning of the word.
  4. Compiling the corpus: After the texts have been preprocessed and annotated, they are compiled into a single corpus. The corpus may be stored in a database or other electronic format, and it may be made available to researchers or other users through a corpus management system.

Corpus creation is a complex process that involves a range of tasks and techniques. It is an important part of language research and the development of NLP systems, as it provides a large and representative sample of language data that can be used to study language structure and use, and to build and evaluate natural language processing systems.


See also:

Best Corpus Linguistics VideosCorpus Annotation Tools | Corpus BuildingCorpus Workbench | UAM CorpusTool

Corpus creation for new genres: A crowdsourced approach to PP attachment M Jha, J Andreas, K Thadani, S Rosenthal… – Proceedings of the …, 2010 – Abstract This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowd-sourcing judgments. We develop and present a system to extract … Cited by 9 Related articles All 13 versions

ANC2Go: A Web Application for Customized Corpus Creation. N Ide, K Suderman, B Simms – LREC, 2010 – Abstract We describe a web application called “ANC2Go” that enables the user to select data from the Open American National Corpus (OANC) and the Manually Annotated Sub- corpus (MASC) together with some or all of the annotations available. The user also may … Cited by 8 Related articles All 8 versions

The Italian Sign Language sign bank: Using wordnet for sign language corpus creation P Prinetto, U Shoaib, G Tiotto – … and Information Technology ( …, 2011 – Abstract—Sign languages are visual-gestural languages used by deaf people to communicate with others. As with spoken languages, Sign languages vary among countries and have their own vocabulary and grammar. Since they suffer of an extreme variability … Cited by 6 Related articles All 2 versions

A web based platform for sign language corpus creation D Barberis, N Garazzino, E Piccolo, P Prinetto… – … helping people with …, 2010 – Springer Abstract This paper presents the design and implementation issues of a tool for the annotation of sign language based on speech recognition. It is at his first version of stable development and it was designed within the Automatic Translation into Sign Languages ( … Cited by 4 Related articles All 4 versions

From archive to corpus: transcription and annotation in the creation of signed language corpora T Johnston – International journal of corpus linguistics, 2010 – Abstract: Annotations are an important resource in corpus-based linguistic research. In fact, the most important feature of a modern signed language corpus should be that it has been annotated rather than simply transcribed. Digital multi-media annotation software can now … Cited by 36 Related articles All 7 versions

Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora M Negri, L Bentivogli, Y Mehdad… – Proceedings of the …, 2011 – Abstract We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to … Cited by 44 Related articles All 9 versions

Automatic creation of a reference corpus for political opinion mining in user-generated content L Sarmento, P Carvalho, MJ Silva… – Proceedings of the 1st …, 2009 – Abstract We propose and evaluate a method for automatically creating a reference corpus for training text classification procedures for mining political opinions in user-generated content. The process starts by compiling a collection of highly opinionated comments … Cited by 41 Related articles All 9 versions

Using Mechanical Turk to create a corpus of Arabic summaries M El-Haj, U Kruschwitz, C Fox – Proceedings of the Seventh …, 2010 – Abstract This paper describes the creation of a human-generated corpus of extractive Arabic summaries of a selection of Wikipedia and Arabic newspaper articles using Mechanical Turk— an online workforce. The purpose of this exercise was two-fold. First, it addresses a … Cited by 19 Related articles All 15 versions

Emotional speech corpus creation, structure, distribution and re-use B Vaughan, C Cullen – 2009 – Abstract This paper details the on-going creation of a natural emotional speech corpus, its structure, distribution, and re-use. Using Mood Induction Procedures (MIPs), high quality emotional speech assets are obtained, analysed, tagged (for acoustic features), annotated … Cited by 2 Related articles All 3 versions

Time-efficient creation of an accurate sentence fusion corpus K McKeown, S Rosenthal, K Thadani… – … Technologies: The 2010 …, 2010 – Abstract Sentence fusion enables summarization and question-answering systems to produce output by combining fully formed phrases from different sentences. Yet there is little data that can be used to develop and evaluate fusion techniques. In this paper, we … Cited by 10 Related articles All 9 versions

Creation and analysis of a reading comprehension exercise corpus: Towards evaluating meaning in context N Ott, R Ziai, D Meurers – Multilingual Corpora and Multilingual Corpus …, 2012 – Abstract We discuss the collection and analysis of a cross-sectional and longitudinal learner corpus consisting of answers to reading comprehension questions written by adult second language learners of German. We motivate the need for such task-based learner corpora … Cited by 12 Related articles All 6 versions

The ORD Speech Corpus of Russian Everyday Communication “One Speaker’s Day”: Creation Principles and Annotation A Asinovsky, N Bogdanova, M Rusakova… – Text, Speech and …, 2009 – Springer Abstract The main aim of the ORD speech corpus is to fix Russian spontaneous speech in natural communicative situations. The corpus presents the unique linguistic material, allowing to perform fundamental research in many scientific aspects and to solve different … Cited by 6 Related articles All 6 versions

Topic based creation of a persian-english comparable corpus Z Rahimi, A Shakery – Information Retrieval Technology, 2011 – Springer Abstract One of the most important issues in cross language information retrieval (CLIR) is where to obtain the translation knowledge. Multilingual corpora are valuable resources for this purpose, but few studies have been done on constructing multilingual corpora in … Cited by 6 Related articles All 5 versions

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text AJ Yepes, É Prieur-Gaston, A Névéol – BMC bioinformatics, 2013 – Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information … Cited by 7 Related articles All 11 versions

Festvox: Tools for Creation and Analyses of Large Speech Corpora GK Anumanchipalli, K Prahallad… – Workshop on Very Large …, 2011 – Abstract This paper summarises the tools provided within Festvox [1], a freely available software suite for creation and analyses of large scale speech corpora for enabling research, development and instruction in speech technologies. Index Terms: Speech … Cited by 5 Related articles All 9 versions

Combining confidence score and mal-rule filters for automatic creation of bangla error corpus: grammar checker perspective B Kundu, S Chakraborti, SK Choudhury – Computational Linguistics and …, 2012 – Springer Abstract This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically … Cited by 1 Related articles All 3 versions

A first approach to the creation of a Spanish corpus of dyslexic texts L Rello, R Baeza-Yates, H Saggion… – LREC Workshop Natural …, 2012 – Abstract Corpora of dyslexic texts are valuable for studying dyslexia and addressing accessibility practices, among others. However, due to the difficulty of finding texts written by dyslexics, these kind of resources are scarce. In this paper, we introduce a small Spanish … Cited by 4 Related articles All 6 versions

Creation and Analysis of a Corpus of Text Rich Indian TV Videos T Chattopadhyay, S Sengupta, A Sinha… – Document Analysis …, 2011 – Abstract—A lot of research is now going on to extract the context of the show to provide additional information related to the TV show. One major method to extract the context from TV is to recognize the texts from the videos which is also known as video Optical … Cited by 4 Related articles All 5 versions

[BOOK] Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian G Andersen – 2012 – This book describes new methodological and technological approaches to corpus building and presents recent research based on the Norwegian Newspaper Corpus. This is a large monitor corpus of contemporary Norwegian language, compiled through daily harvesting … Cited by 4 Related articles All 2 versions

The creation of a corpus of English metalanguage S Wilson – Proceedings of the 50th Annual Meeting of the …, 2012 – Abstract Metalanguage is an essential linguistic mechanism which allows us to communicate explicit information about language itself. However, it has been underexamined in research in language technologies, to the detriment of the performance … Cited by 4 Related articles All 6 versions

Corpus-based approaches for the creation of a frequency based vocabulary list in the EU project KELLY–issues on reliability, validity and coverage S Johansson Kokkinakis, E Volodina – eLex, 10-12 November 2011, …, 2011 – At present there are relatively few vocabulary lists for Swedish describing modern vocabulary as well as being adapted to language learners’ needs. In Europe including Sweden there exist approaches to unify ways of working consistently with language … Cited by 3 Related articles

The creation of free linguistic corpora from the web M Brunello – Web as Corpus Workshop (WAC5), 2009 – The creation of free linguistic corpora from the web Marco Brunello Università degli Studi di Padova Palazzo Maldura, via Beato Pellegrino, 1, Padova brunez@ email. it Abstract This paper shows how it’s possible to build free corpora from the web using documents re- leased under … Cited by 3 Related articles All 3 versions

A corpus-based analysis of two crucial steps in Business Management research articles: The creation of a research space and the statement of limitations P Mur-Dueñas – Nordic Journal of English Studies, 2012 – Abstract English has been established as the Language for Research Publication Purposes in many disciplinary fields. Many scholars worldwide, therefore, face linguistic and rhetorical difficulties when writing their academic texts for publication in an L2. This paper focuses … Cited by 4 Related articles All 7 versions

Rapid creation of large-scale corpora and frequency dictionaries. A Zséder, G Recski, D Varga, A Kornai – LREC, 2012 – Abstract We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, … Cited by 3 Related articles All 9 versions

Towards creation of a corpus for argumentation mining the biomedical genetics research literature NL Green – ACL 2014, 2014 – Abstract Argumentation mining involves automatically identifying the premises, conclusion, and type of each argument as well as relationships between pairs of arguments in a document. We describe our plan to create a corpus from the biomedical genetics research … Cited by 2

Creation of a corpus for evidence based medicine summarisation D Mollá, ME Santiago-Martínez – The Australasian medical journal, 2012 – Background Automated text summarisers that find the best clinical evidence reported in collections of medical literature are of potential benefit for the practice of Evidence Based Medicine (EBM). Research and development of text summarisers for EBM, however, is … Cited by 1 Related articles All 8 versions

From corpus to lexicon: the creation of ID-glosses for the Corpus NGT O Crasborn, A de Meijer – … : Interactions between Corpus and …, 2012 – Abstract When glossing of the Corpus NGT started in 2007, there was no lexicon at our disposal to base ID-glosses on. Semantic labels were used without ensuring a constant relationship between sign form and gloss. This is currently being repaired by creating a … Cited by 2 Related articles All 6 versions

Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model R Boujelbane, S BenAyed, LH Belguith – ACL 2013, 2013 – Abstract Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for … Cited by 2 Related articles All 5 versions

Typing Race Games as a Method to Create Spelling Error Corpora. P Rodrigues, CA Rytting – LREC, 2012 – Abstract This paper presents a method to elicit spelling error corpora using an online typing race game. After being tested for their native language, English-native participants were instructed to retype stimuli as quickly and as accurately as they could. The participants … Cited by 1 Related articles All 2 versions

On the creation of a learner corpus for the purpose of error analysis JM Rogers – Journal oflnquiry and Research, 2012 – Abstract Learners with similar backgrounds have a tendency to make the same types of errors in L2 production. Such errors can be viewed as having the potential to inform pedagogical methodologies, in that they shed light onto which features of the L2 are the … Cited by 1 Related articles All 2 versions

Using Corpus and Web Language Data to Create EAP Teaching Materials D Oakey – … Literacies and Language: Multimodality and Literacy …, 2011 – The use of corpus data to promote English language learning has a distinguished pedigree at the University of Birmingham (UK). In the 1980s, the COBUILD project (Sinclair, 1987), the latejohn Sinclair’s outstanding contribution to the ?eld of lexicography, designed and … Cited by 3 Related articles

Design, creation, and analysis of Czech corpora for structural metadata extraction from speech J Kolá? – Language resources and evaluation, 2011 – Springer Abstract Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and downstream automatic processes. The MDE annotation includes inserting … Cited by 1 Related articles All 8 versions

A framework for creation of telephone, cellular and VoIP speech corpus S Das, A Mandal, KR Prasanna Kumar… – … held jointly with …, 2013 – Abstract—State of the art techniques used to automate speech processing for different applications are data-driven. The statistical methods derive robust models from large speech data characterizing variations of speech, channels and background. The increased … Related articles

Effective Corpora Creation For Sentiment Analysis P KONCZ, J PARALI? – Abstract. Sentiment analysis is currently a popular research area which methods are usually divided into two main types. Both of them, methods based on machine learning as well as dictionary based methods, are dependent on manually annotated corpora. These corpora … Related articles

Technologies and tools for corpus creation, normalization and annotation P Prokopidis, V Papavassiliou, P Pavel, L Rimel… – 2014 – The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition … Related articles All 7 versions

Research on corpus creation and development of oracle bone inscriptions F Gao, Y Liu, J Xiong – Artificial Intelligence, Management …, 2011 – Abstract—Based on the research background analysis, this paper clarifies the importance of natural language processing for domain documents. After analyzing the specialty of domain corpus, this paper discusses the idea and principle of domain corpus creation in a deep … Related articles

Corpus creation using blogs: investigating [be done X] in Canadian English JAJ Hinnell – WebCorp by its use of the Google Blogs API, gathering a robust, representative corpus was challenging. In particular to tease apart semantic differences inherent in Canadian [bdX] and Canadian and American [bdwX], a very fine-grained set of searches was essential. This … Related articles

Testing and its implications in lexicography: A case study of corpus creation in Dhopadhola P Oketcho – 2011 – “Testing and its implications in lexicography: a case study of corpus creation in Dhopadhola” is a lexicographical study that demonstrates how a generated language corpus can be used to create more reliable, verifiable and linguistically justifiable dictionaries in largely …

Automatic Face Corpus Creation. L Lenc, P Král – ICAART (2), 2013 – Abstract: This paper deals with the automatic real-world face corpus creation. The main contribution consists in proposition and evaluation of the automatic face corpus creation algorithm. Next, we statistically analysed the structure of the created face corpus when the … Related articles

AnnoTool: crowdsourcing for natural language corpus creation KKM Hayden – 2013 – This thesis explores the extent to which untrained annotators can create annotated corpora of scientific texts. Currently the variety and quantity of annotated corpora are limited by the expense of hiring or training annotators. The expense for finding and hiring professionals …

Corpus Creation and Perceptual Evaluation of Expressive Theatrical Gestures P Carreno-Medrano, S Gibet, C Larboulette… – Intelligent Virtual …, 2014 – Springer Abstract While human communication involves rich, complex and expressive gestures, available corpora of captured motions used for the animation of virtual characters contain actions ranging from locomotion to everyday life motions. We aim at creating a novel …

The Basic Methods in Sign Language Corpus Creation [J] LIHWU Ling – Chinese Journal of Special Education, 2013 – Sign language corpus, a practice and a research result of the study of sign language linguistics, has become an independent discipline. It is characterized by representative samples and machine-read language materials. Based on this, this article introduces the …

Semi-automatic bilingual corpus creation with zero entropy alignments A Laukaitis, O Vasilecas, R Laukaitis, D Plikynas – Informatica, 2011 – IOS Press In this paper, we describe a model for aligning books and documents from bilingual corpus with a goal to create “perfectly” aligned bilingual corpus on word-to-word level. Presented algorithms differ from existing algorithms in consideration of the presence of human … Related articles All 5 versions

An Empirical Assessment of Contemporary Online Media in Ad-Hoc Corpus Creation for Social Events K Narang, S Nagar, S Mehta, LV Subramaniam, K Dey – Abstract Social networking sites such as Facebook and Twitter have become favorite portals for users to discuss and express opinions. Research shows that topical discussions around events tend to evolve socially on microblogs. However, sources like Twitter have no … Related articles All 2 versions

… Speaking Activities Provide Useful Data for a Learner Corpus? An Evaluation of Reflective Transcription Tasks for Learners as a Means of Corpus Creation. S JEACO – Large numbers of Chinese learners study university degrees in English and they form a substantial proportion of the international student body in English-speaking countries. Within China, there has also been an increase in international cooperative universities and … Related articles

A High Performance Optoelectronic Machine for Automated Arabic-English Parallel Corpus Creation and for Text Mining Processing. SSA Ghoniemy, OH Karam – International Journal of Computer & Electrical …, 2014 – Abstract—In this paper a parallel optoelectronic computer architecture is proposed for large- scale parallel corpus, full text search and text mining applications while achieving high speed and high performance and utilizing the parallel processing nature of optics. An … Related articles All 2 versions

Automatic Dialog Act Corpus Creation from Web Pages. P Král, C Cerisara – ICEIS (5), 2010 – Abstract: This work presents two complementary tools dedicated to the task of textual corpus creation for linguistic researches. The chosen application domain is automatic dialog acts recognition, but the proposed tools might also be applied to any other research area that … Related articles All 5 versions

Research on the Creation of Small-Scale English-Chinese Parallel Corpus for Manufacturing Systems SB Bao – Applied Mechanics and Materials, 2013 – Trans Tech Publ Abstract. English, which is specially used in the field of manufacturing systems, belongs to ESP (English for specific purposes). In order to improve the effect of ESP education in China, it is very necessary to create an English-Chinese parallel corpus for aiding ESP teaching and … Related articles

Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios J Teixeira, L Sarmento, E Oliveira – Information Systems and …, 2011 – Abstract-In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference … Related articles All 3 versions

Creation and Analysis of a Reading Comprehension Exercise Corpus: Towards Evaluating Meaning in Context D Meurers, N Ott, R Ziai – Abstract We discuss the collection and analysis of a longitudinal learner corpus consisting of answers to reading comprehension questions written by adult second language learners of German. We motivate the need for such task-based learner corpora and identify the … Related articles

Creation of L3 Multilingual Corpora of Taiwanese Learners HC Lu, S Huang, SC Hsieh, GZ Liu – The studies of learner corpus have recently drawn special attention, both in the fields of language learning and corpus linguistics. Following the trend and under the project entitled “The Construction & Research of Multilingual Corpora” of NCKU project of Promoting … Related articles

Creation Of The Linguistic Corpus Of The Ukrainian Terminology In Library Science NV Veretennikova, NE Kunanets – SCIENCE AND WORLD, 2013 –

A Corpus-based Analysis of “Create” and “Produce” SF Chung – ????????, 2011 – Abstract This paper examines the synonymous pair ‘create’and ‘produce’in English and suggests that their similarities and differences can be elucidated based on the types of products denoted by their objects. PRODUCTS, as part of the eventive information of … Related articles All 2 versions

The Compilation and Creation of San Guo Yan Yi Chinese-English Parallel Corpus ???? ??? – ??????, 2013 – Abstract:@@@@ This paper, based …

Creation of a bottom-up corpus-based ontology for Italian Linguistics. E Bianchi, M Tavosanis, E Giovannetti – LREC, 2012 – Abstract This paper describes the steps of construction of a shallow lexical ontology of Italian Linguistics in Italian, set to be used by a metasearch engine for query refinement. The ontology was constructed with the software Protégé 4.0. 2 and encoded in OWL format; its … Related articles All 2 versions

Software for Creation of Sintactico-Statistical Russian Language Model Based on the Text Corpus IS Kipyatkova – Trudy SPIIRAN, 2013 – Abstract: Creation of the language model is one of the stages of training of a continuous speech recognition system. In the paper, the developed software for creation of syntactic- statistical Russian language model based on a text corpus is described. The main stages … Cited by 1

The Tenth-Century Cyrillic Manuscript Codex Suprasliensis: the creation of an electronic corpus UNESCO project (2010–2011) HM Eckhoff, DJ Birnbaum, A Miltenova… – … Technologies for Digital …, 2011 – Abstract This paper presents an overview of principles and problems connected with the preparation of an electronic edition of the largest Old Church Slavonic manuscript, the Codex Suprasliensis, in the context of a project funded by UNESCO. Specifications of the … Related articles All 4 versions

A Corpus-based Analysis of “Create” and “Produce” ??? – 2011 – This paper examines the synonymous pair ‘create’and ‘produce’in English and suggests that their similarities and differences can be elucidated based on the types of products denoted by their objects. PRODUCTS, as part of the eventive information of ACTIVITY, are … All 3 versions

Exploring Newspaper Language. Using the web to create and investigate a large corpus of modern Norwegian T Nordgård – Norsk Lingvistisk Tidsskrift, 2013 – Utgangspunktet er prosjektet Norsk Aviskorpus (Norwegian Newspaper Corpus, NNC). Som navnet tilsier, består korpuset av norske avistekster som er samlet inn fra 1998 til i dag. Boken inneholder artikler som beskriver korpuset, og eksempler på hvordan det kan … All 2 versions

Creation of Marathi speech corpus for automatic speech recognition S Gaikwad, B Gawali, S Mehrotra – Oriental COCOSDA held …, 2013 – Abstract This paper describes the collection of audio corpus for Marathi language. Marathi is one of the regional Indian languages. There are 12 vowels and 36 consonants present in Marathi languages. The objective of the research is to create the speech corpus which can … Related articles

Creation of a Public Corpus of Contact-Less Acquired Latent Fingerprints without Privacy Implications M Hildebrandt, J Sturm, J Dittmann… – … and Multimedia Security, 2013 – Springer Abstract Data sets of biometric or forensic samples are an important basis for evaluations and research. Especially biometric data is considered as personal data, which is protected by privacy regulations. Since the data cannot be altered or revoked, at least in some … Related articles All 2 versions

The Creation of the Estonian Emotional Speech Corpus and the Perception of Emotions R Altrov – 2014 – Väitekirja eesmärk oli luua Eesti emotsionaalse kõne korpuse teoreetiline alus ja kontrollida loodud korpuse materjali põhjal teoreetiliste seisukohtade õigsust. Uurimus näitas, kui oluline on korpust enne selle loomist hoolikalt planeerida ja tulemust analüüsida. Saadud … All 2 versions

A corpus based approach for the automatic creation of arabic broken plural dictionaries SR El-Beltagy, A Rafea – Computational Linguistics and Intelligent Text …, 2013 – Springer Abstract Research has shown that Arabic broken plurals constitute approximately 10% of the content of Arabic texts. Detecting Arabic broken plurals and mapping them to their singular forms is a task that can greatly affect the performance of information retrieval, annotation … Related articles All 4 versions

Characterization of Corpora from Enterprise Technology Creation for Retrieval and Mining V Deolalikar – Data Mining Workshops (ICDMW), 2013 IEEE …, 2013 – Abstract—Enterprise information management (EIM) deals with the demands upon enterprise unstructured information placed by applications such as eDiscovery, compliance, information lifecycle management, etc. Each of these applications poses a unique … Related articles All 4 versions

The creation of large-scaled annotated corpora of minority languages using UniParser and the EANC platform T Arkhangelskiy, O Belyaev, A Vydrin – Foresight, 2013 – This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www. eanc. …

Creation and analysis of a reading comprehension exercise corpus N Ott, R Ziai, D Meurers – … Corpora and Multilingual Corpus …, 2012 – We discuss the collection and analysis of a cross-sectional and longitudinal learner corpus consisting of answers to reading comprehension questions written by adult second language learners of German. We motivate the need for such task-based learner corpora … Related articles All 2 versions

Evaluation of a gambling-related wordlist by a population of casino gamblers and a non-gambler control group to create a standardized corpus A Gay, T Sigaud, A Grosselin, C Massoubre – European Psychiatry, 2014 – Elsevier Methods 247 casino gamblers, recruited at the casino, and 127 control subjects, matched for age, sex and educational level, scored 118 gambling-related words based on 3 criteria: gambling association’s level, emotional valence and familiarity. Gambling behaviour was …

Using Collections and Worksets in Large-Scale Corpora: Preliminary Findings from the Workset Creation for Scholarly Analysis Project HE Green, K Fenlon, M Senseney, S Bhattacharyya… – 2014 – Abstract: Scholars from numerous disciplines rely on collections of texts to support research activities. On this diverse and interdisciplinary frontier of digital scholarship, libraries and information institutions must 1) prepare to support research using large collections of … Related articles

Time-Efficient Creation of an Accurate Sentence Fusion Corpus K Thadani, S Rosenthal, K McKeown, C Moore – 2010 – Abstract: Sentence fusion enables summarization and question-answering systems to produce output by combining fully formed phrases from different sentences. Yet there is little data that can be used to develop and evaluate fusion techniques. In this paper, we … Related articles

Term-creation strategies used by Ndebele translators in Zimbabwe in the health sector: a corpus-based approach K Ndhlovu – Stellenbosch Papers in Linguistics Plus, 2014 – In the scientific arena, many African languages face the challenge of a lack of terminology. That is, translators who translate from developed Western languages into African languages often encounter a lack of adequate terminology in their efforts to communicate between …

Generic Tagging Strategy Using a Semio?Contextual Approach to the Corpus for the Creation of Controlled Databases L Verlaet – Competitive Intelligence and Decision Problems, 2011 – Wiley Online Library Research carried out in competitive intelligence covers a wide variety of fields of investigation: strategic watching, mastery of information, data and system protection, and so on. We are particularly interested in information mastery, that is, knowledge management … Related articles

SIGN MOTION: An Innovative Creation and Annotation Platform for Sign Language 3D-Content Corpora Building Relying on Low Cost Motion Sensors M Boulares, M Jemni – Computers Helping People with Special Needs, 2014 – Springer Abstract The manual transcription process of Sign Language is a work-intensive step which requires considerable effort to create Signs. Even, often the result of this step misses the natural aspect of motion to be conform to the natural human interpretation. In other words, …

Creation of Texts in all Functional Styles for the Macedonian GRALIS-Corpora (GRALIS-MAK) and theirs copyrights D Poposki, E Bojkovska, B Tosovic, A Wonisch – 2011 – The linguistic corpus GRALIS-MAK does not tackle the problem of authorship which is adjusted for the e-publishing and therefore its content is closed for the global research community. This article gives a short overview of the Creative Commons licenses which …

TWORPUS–An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora A Bazo, M Burghardt, C Wolff – … Processing and Knowledge in the Web, 2013 – Springer Abstract In this paper we present Tworpus, an easy-to-use tool for the creation of tailored Twitter corpora. Tworpus allows scholars to create corpora without having to know about the Twitter Application Programming Interface (API) and related technical aspects. At the … Cited by 1 Related articles All 4 versions

Information technologies for corpus studies: underpinnings for cross-linguistic database creation NV Buntman, AA Zalizniak, IM Zatsman… – Informatika i Ee …, 2014 – Abstract: Information technology for creation of cross-linguistic databases of Russian texts with French translations (also known as parallel texts) is considered. The underlying principles of the developed database provide a unique combination of three types of …

Coding for Demographic Categories in the Creation of Legacy Corpora: Asian American Ethnic Identities L Hall?Lew, AW Wong – Language and Linguistics Compass, 2014 – Wiley Online Library Abstract A set of shared coding conventions for speaker ethnicity is necessary for open- source data sharing and cross-study compatibility between linguistic corpora. However, ethnicity, like many other aspects of speaker identity, is continually negotiated and …

Automatic creation of WordNets from parallel corpora A Oliver, S Climent – Abstract In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT … Cited by 1 Related articles

Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia M Althobaiti, U Kruschwitz, M Poesio – EACL 2014, 2014 – Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 106–115, Gothenburg, Sweden, April 26-30 2014. cO2014 Association for Computational Linguistics Automatic Creation of … Related articles All 2 versions

Corpus-driven Creation of a Reliable Learner’s Vocabulary forClassical Chinese T Schalmey – ????????, 2013 – The quality of teaching materials for Classical Chinese often lags behind that of teaching materials available for modern vernacular Chinese, more widely studied ancient languages like Latin and also behind current developments in didactics. At the same time, the …

Speech Corpora as Facilities of Creation and Storage of Exemplary Speech Signals AM Prodeus – Research Bulletin of NTUU” Kyiv Polytechnic Institute”, 2013 – Speech corpora are an important constituent of modern investigators’ toolkit in such areas as speech correction, designing and testing elements of telecommunication systems and systems of automatic speech recognition. In this paper, we search for elements of … Related articles All 2 versions

Creation of annotated Tamil handwritten word corpus for OHR B Nethravathi, CP Archana, K Shashikiran, R AG – Abstract: Annotated datasets form a critical aspect in the development of robust technology for handwriting recognition and can be used for comparing results of different techniques used by various research groups. This paper describes the efforts at MILE lab, IISc, to … Related articles

A corpus-based study of the discursive creation of a child consumer identity in official tourist information websites vs. opinion forums R Dolón – Dialogicity in Written Specialised Genres, 2014 – Starting from an understanding of forums in terms of dialogic action games, as put forward by Weigand (see eg Weigand 2008, 2009, 2010), I look into the dialogic behaviour that unfolds in forums as an action game that conforms to a specific cultural unit. Largely …

Problems of Creation of the All-Turkic National Corpus G Doszhan – Proceedings of the 2013 International Conference …, 2013 – The paper presents the results of research on the theoretical and practical issues of the creation of national corpus of the Turkic world. This paper consists of four parts and conclusion. The first section is devoted to a theoretical analysis of a problem, the second section describes a … Related articles All 2 versions

Best practices in the design, creation and dissemination of speech corpora at The Language Archive S Drude, D Broeder, P Wittenburg… – … for Speech Corpora in …, 2012 – Abstract In the last 15 years, the Technical Group (now:“The Language Archive”, TLA) at the Max Planck Institute for Psycholinguistics (MPI) has been engaged in building corpora of natural speech and making them available for further research. The MPI has set standards … Related articles All 3 versions

Best practices in the creation, archiving and dissemination of speech corpora at the Language Archive S Drude, P Trilsbeek, H Sloetjes… – … for Spoken Corpora in …, 2014 – The amount of (digital) data created and used every day worldwide is increasing exponentially. Most of it is transient and of value only for very few people; such data, trillions of individual files, are mostly located on hard drives in personal computers or in places in …

Create a Retirement Corpus to be Proud Of – If you live 20-25 years after retirement, this means so many years of expenses and no income, unless you are a professional like doctor or lawyer. After retirement, your expenses would come down only by about 20%-30% because, while some expenses would be cut, …

How Grammaticalization Processes Create Grammar: From Historical Corpus Data To Agent-Based Models LUC STEELS, F VAN DE VELDE, R VAN TRIJP… – WORKSHOPS – 56 HOW GRAMMATICALIZATION PROCESSES CREATE GRAMMAR: FROM HISTORICAL CORPUS DATA TO AGENT-BASED MODELS LUC STEELS, FREEK VAN DE VELDE & REMI VAN TRIJP PROGRAM 09: 00 Welcome & Introduction Luc Steels Session 1: Finding … All 2 versions

Merging data, the essence of creation of multi-layer corpora F Zipser, I HU-Berlin, J Schmolling – … 2. Compare data and find their common base 3. Keep common base and merge different layers 4. Search / display / store multi-layer corpus Merging data, the essence of creation of multi-layer corpora Florian Zipser, HU-Berlin IDSL Mario Frank, University of Potsdam IICS … Related articles All 2 versions