Language Technology for Cultural Heritage (2011) Sporleder & Zervanou eds.
Contents
Foreword by Willard McCarty . . . v
References . . . xii
Language Technology for Cultural Heritage, Social Sciences and Humanities: Chances and Challenges . . . xxi
Caroline Sporleder, Antal van den Bosch and Kalliopi Zervanou
1 From Quill and Paper to Digital Knowledge Access and Discovery . . . xxi
2 Mutual Benefits . . . xxii
3 Challenges . . . xxv
4 This Volume . . . xxvii
References . . . xxxi
Part I Pre-Processing
Strategies for Reducing and Correcting OCR Errors . . . 3
Martin Volk, Lenz Furrer and Rico Sennrich
1 Introduction . . . 4
2 The Text+Berg Project . . . 5
2.1 Language Identification . . . 7
2.2 Further Annotation . . . 8
2.3 Aims and Current Status . . . 8
3 Scanning and OCR . . . 9
3.1 Enlarging the OCR Lexicon . . . 9
3.2 Post-correcting OCR Errors . . . 10
4 Evaluation . . . 15
4.1 Evaluation Setup . . . 15
4.2 Evaluation Results . . . 16
5 Related Work . . . 19
6 Conclusion . . . 20
References . . . 21
Alignment between Text Images and their Transcripts for Handwritten Documents . . . 23
Alejandro H. Toselli, Verónica Romero and Enrique Vidal
1 Introduction . . . 24
2 HMM-based HTR and Viterbi Alignment . . . 26
2.1 HMM HTR Basics . . . 26
2.2 Viterbi Alignment . . . 28
2.3 Word and Line Alignments . . . 29
3 Overview of the Alignment Prototype . . . 29
4 Alignment Evaluation Metrics. . . 30
5 Experiments. . . 32
5.1 Corpus Description . . . 32
5.2 Experiments and Results . . . 33
6 Remarks, Conclusions and Future Work . . . 35
References . . . 36
Part II Adapting NLP Tools to Older Language Varieties
A Diachronic Computational Lexical Resource for 800 Years of Swedish . . . 41
Lars Borin and Markus Forsberg
1 Introduction . . . 42
2 Lexical Resources for Present-Day Swedish . . . 44
2.1 SALDO . . . 44
2.2 Swedish FrameNet++ . . . 46
3 A Lexical Resource for 19th Century Swedish . . . 47
4 A Lexical Resource for Old Swedish . . . 48
4.1 Developing a Computational Morphology for Old Swedish . . . 51
4.2 The Computational Treatment of Variation in Old Swedish . . . 56
4.3 Linking the Old Swedish Lexical Resource to SALDO . . . 58
5 Summary and Conclusions . . . 58
References . . . 59
Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change . . . 63
Eiríkur Rögnvaldsson and Sigrún Helgadóttir
1 Introduction . . . 63
2 Tagging Modern Icelandic . . . 64
2.1 The Tagset . . . 64
2.2 Training the Tagger . . . 65
3 Tagging Old Icelandic Texts . . . 66
3.1 Old vs. Modern Icelandic . . . 67
3.2 The Old Icelandic Corpus . . . 67
3.3 Training the Tagger on the Old Icelandic Corpus . . . 68
4 Tagged Texts in Syntactic Research . . . 70
4.1 Object Shift . . . 71
4.2 Passive . . . 73
5 Conclusion . . . 74
References . . . 75
Part III Linguistic Resources for CH/SSH
The Ancient Greek and Latin Dependency Treebanks . . . 79
David Bamman and Gregory Crane
1 Introduction . . . 79
2 Treebanks . . . 80
3 Building the Ancient Greek and Latin Dependency Treebanks . . . 81
4 Ancient Greek Dependency Treebank . . . 83
5 Latin Dependency Treebank . . . 84
6 The Influence of a Digital Library . . . 84
6.1 Structure . . . 86
6.2 Reading Support . . . 88
7 The Impact of Historical Treebanks . . . 90
7.1 Lemmatized Searching . . . 91
7.2 Morphosyntactic Searching . . . 91
7.3 Lexicography . . . 92
7.4 Discovering Textual Similarity . . . 94
8 Conclusion . . . 95
References . . . 96
A Parallel Greek-Bulgarian Corpus: A Digital Resource of the Shared Cultural Heritage . . . 99
Voula Giouli, Kiril Simov and Petya Osenova
1 Introduction . . . 100
2 Background . . . 100
3 The Bilingual Greek–Bulgarian Literary and Folklore Corpus: Selection and Description . . . 101
3.1 Corpus Specifications . . . 101
3.2 Collection Description . . . 102
3.3 Metadata Descriptions . . . 103
4 Text Annotation and Processing . . . 104
4.1 The Greek Pipeline . . . 105
4.2 NLP Suite for Bulgarian . . . 106
4.3 Sentence Alignment . . . 108
5 Tools Customization and Metadata Harmonization . . . 108
6 Bilingual Glossaries . . . 109
7 Content Management . . . 110
8 Conclusions . . . 111
References . . . 111
Part IV Personalisation
Authoring Semantic and Linguistic Knowledge for the Dynamic Generation of Personalized Descriptions . . . 115
Stasinos Konstantopoulos, Vangelis Karkaletsis, Dimitrios Vogiatzis and Dimitris Bilidas
1 Introduction . . . 115
2 Authoring Domain Ontologies . . . 117
3 Description Adaptation . . . 119
3.1 Personalization and Personality . . . 119
3.2 Representation and Interoperability. . . 121
4 Adaptive Natural Language Generation . . . 122
4.1 Document Planning . . . 122
4.2 Micro-Planning . . . 123
4.3 Surface Realization . . . 125
5 Intelligent Authoring Support . . . 126
5.1 Profile Completion . . . 126
5.2 Interaction Log Mining . . . 128
6 Related Work . . . 129
7 Conclusion . . . 129
References . . . 131
Part V Structural and Narrative Analysis
Automatic Pragmatic Text Segmentation of Historical Letters . . . 135
Iris Hendrickx, Michel Généreux and Rita Marquilhas
1 Introduction . . . 135
2 Corpus of Historical Letters. . . 137
2.1 Annotated Data Set . . . 139
3 Experimental Setup . . . 141
4 Text Segmentation . . . 143
4.1 Classifying EachWord . . . 145
4.2 Segment Production (Smoothing) . . . 146
5 Semantic Tagging . . . 148
6 Conclusions . . . 150
References . . . 152
Proppian Content Descriptors in an Integrated Annotation Schema for Fairy Tales . . . 155
Thierry Declerck, Antonia Scheidel and Piroska Lendvai
1 Introduction . . . 156
2 Summary of Propp’s Analysis . . . 156
3 Preprocessing Propp . . . 159
3.1 Relaxing the “Fairy Tale Grammar” . . . 159
3.2 Functions and Moves . . . 160
4 Functions and Frames. . . 160
4.1 Proppian “Frames” and FrameNet . . . 160
4.2 APftML Frame Elements . . . 161
4.3 Functional Annotation . . . 163
5 Fairy Tale Characters . . . 165
5.1 Characters vs. Dramatis Personae . . . 166
6 Temporal and Spatial Structure . . . 167
7 Dialogue and Narration . . . 168
8 Conclusion . . . 169
References . . . 169
Adapting NLP Tools and Frame-Semantic Resources for the Semantic Analysis of Ritual Descriptions . . . 171
Nils Reiter, Oliver Hellwig, Anette Frank, Irina Gossmann, Borayin Maitreya Larios, Julio Rodrigues and Britta Zeller
1 Introduction . . . 171
2 Computational Linguistics for Ritual Structure Research . . . 173
2.1 Project Research Plan . . . 173
2.2 Related Work . . . 174
3 Ritual Descriptions . . . 174
3.1 Textual Sources . . . 175
3.2 Text Characteristics . . . 175
4 Automatic Linguistic Processing . . . 177
4.1 Tokenizing . . . 177
4.2 Part of Speech Tagging and Chunking . . . 177
4.3 Anaphora and Coreference Resolution . . . 180
5 Semantic Annotation of Ritual Descriptions . . . 184
5.1 Adaptation of Existing Resources . . . 185
6 Detecting Ritual Structure . . . 188
7 Future Work and Conclusions . . . 190
7.1 FutureWork . . . 190
7.2 Conclusions . . . 190
References . . . 191
Part VI Data Management, Visualisation and Retrieval
Information Retrieval and Visualization for the Historical Domain . . . 197
Yevgeni Berzak, Michal Richter, Carsten Ehrler and Todd Shore
1 Introduction . . . 197
2 Background . . . 198
3 Information Extraction from a Historical Collection . . . 199
3.1 Dataset . . . 199
3.2 Extraction of Named Entities . . . 200
3.3 Aliasing . . . 200
4 Visualization of Document Similarities . . . 202
4.1 Similarity measurement . . . 202
4.2 Visualization of similarities . . . 203
5 Graphical User Interface . . . 204
6 The Benefit for Historical Research . . . 207
7 Conclusion and Outlook. . . 209
7.1 Topic Models . . . 209
7.2 Clustering and Layouting . . . 210
7.3 Evaluation . . . 210
7.4 Adaptation to Other Domains . . . 211
References . . . 211
IntegratingWiki Systems, Natural Language Processing, and Semantic Technologies for Cultural Heritage Data Management . . . 213
René Witte, Thomas Kappler, Ralf Krestel, and Peter C. Lockemann
1 Introduction . . . 213
2 User Groups and Requirements . . . 214
2.1 User Groups . . . 214
2.2 Detected Requirements . . . 215
3 Related Work . . . 216
4 Semantic Heritage Data Management . . . 217
4.1 Architectural Overview. . . 217
4.2 Source Material . . . 219
4.3 Digitization and Error Correction . . . 219
4.4 Format Transformation and Wiki Upload . . . 220
4.5 Integrating Natural Language Processing . . . 223
4.6 Semantic Extensions . . . 225
5 Summary and Conclusions . . . 229
References . . . 229