The Corpus Thread - Det Danske Sprog- og Litteraturselskab

More documents

Recommendations

Info

1.1. Aim of the project 4 Outline 1.1 Aim of the project . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Reference corpus . . . . . . . . . . . . . . . . . . 4 1.1.2 CMRS framework . . . . . . . . . . . . . . . . . . 5 1.2 Project tasks and documentation outline . . . . . . . . . 7 1.3 Text collection, text bank, corpus . . . . . . . . . . . . . . 12 1.3.1 Text collection/archive/repository . . . . . . . . 12 1.3.2 Text bank . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.3 Corpus . . . . . . . . . . . . . . . . . . . . . . . . 13 This chapter gives an overview of DK-CLARIN work package 2.1 Reference corpus of general language as a whole and sketches major concepts of this work package and corpus projects carried out by DSL (and DSN) in general. 1.1 Aim of the project The aim of the project 1 is twofold: Corpus: Gather a reference corpus of general Danish according to certain design principles. To achieve this goal, it is necessary to develop a framework for managing the construction process of corpora. This leads to the second aim: Framework: Establish a Corpus Management and Retrieval System (CMRS) as a framework for building and analyzing corpora. The result of this is an in-house (DSL/DSN) collection of methods, tools, and means of structured data storage, together with this comprehensive documentation. The CMRS concept established in WP 2.1 aims at being transferable to other corpus management and retrieval scenarios. 1.1.1 Reference corpus The reference corpus of general Danish that is gathered as part of this DK- CLARIN project (WP 2.1) comprises 45 million running tokens and is textually mixed, although – as a matter of rather limited resources – texts from periodicals like newspapers and magazines are preferred as they are more straightforward to process especially if the come in some kind of XML format as is the case for texts from one of DSL’s main sources, Infomedia. 2 1 Throughout the documentation the project refers to DK-CLARIN work package 2.1, i.e. building a reference corpus of general language. 2 http://infomedia.dk/
1.1. Aim of the project 5 Metadata The texts of the corpus are provided with metadata given as XML-formatted headers. These ensure that texts can be filtered according to different textual parameters. However, the level of detail is restricted to the general information attached to the material when it is received from the text supplier. Markup The tokens of the material are tagged with information on lemma forms, part-of-speech, and inflectional markers. Accessibility A copy of the corpus is included in the DK-CLARIN repository of resources and tools. It is made publicly accessible through a webbased concordancer as well. 1.1.2 CMRS framework As the Corpus Management and Retrieval System is an essential prerequisite for the accomplishment of the project – its character being that of a “corpus factory” ––, we will start with an overview of this system and give a brief description of its components. Figure 1.1 shows the compositional structure or “architecture” of the CMRS. The figure follows largely the Fundamental Modeling Concepts framework, FMC. However, some minor modifications have been made to a few symbols of the notation framework, in particular, the symbol for read/write access to a storage (two rounded unidirectional arrows in FMC) has been replaced by a strong white bidirectional arrow that consumes less space and can be applied more flexibly. FMC’s channel concept (usually denoted by a small circle) has been replaced by a somewhat broader concept of interfaces through which communication between compound components working together as a system and other systems or users outside take place. Figure 1.2 shows the types of interfaces used in this notation. 3 As illustrated in Figure 1.1, the major data repository of the CMRS is the text bank that holds four different types of data: a collection of the texts themselves, a collection of metadata describing the general characteristics of each individual text, a collection of annotations providing various linguistic information on text token level, and finally supplier details providing information on contacts, text deliverances, and agreements. The text bank is fed with textual material from text suppliers through import handlers, that is, transducers that convert the various text formats into the specific one(s) needed in the CMRS. Texts may be manually complemented with metadata through a metadata editor, the same goes for supplier details. Annotating the text material with linguistic information is performed by the corpus processing unit that utilizes linguistic data drawn from external resources and processed by lexical transducers. The corpus processing 3 See www.fmc-modeling.org for a quick introduction and pointers to further reading.
Page 1 and 2: The Corpus Thread “Reference corp
Page 3: Chapter 1 Aim and concepts Sketchin
Page 7 and 8: 1.2. Project tasks and documentatio
Page 13 and 14: 1.3. Text collection, text bank, co
Page 15 and 16: Chapter 2 The text bank A platform
Page 17 and 18: 2.2. Implementation 17 and metadata
Page 19 and 20: Chapter 3 Text metadata What the he
Page 21 and 22: 3.1. Concepts 21 Section 3.3 starts
Page 23 and 24: 3.2. Header structure 23 (file des
Page 25 and 26: 3.2. Header structure 25 great verb
Page 27 and 28: 3.2. Header structure 27 The follow
Page 29 and 30: 3.2. Header structure 29 deprecated
Page 31 and 32: 3.2. Header structure 31 the text i
Page 33 and 34: 3.2. Header structure 33 Another el
Page 35 and 36: 3.2. Header structure 35 appDesc
Page 37 and 38: 3.2. Header structure 37 informal p
Page 39 and 40: 3.2. Header structure 39 3.2.3.4 Te
Page 41 and 42: 3.3. Filling in the header 41 revi
Page 43 and 44: 3.3. Filling in the header 43 samp
Page 45 and 46: 3.3. Filling in the header 45 and 9
Page 47 and 48: 3.3. Filling in the header 47 descr
Page 49 and 50: 3.3. Filling in the header 49 ⊲ a
Page 51 and 52: 3.3. Filling in the header 51 Legal
Page 53 and 54: 3.3. Filling in the header 53 ⊲ a
Page 55 and 56:
3.3. Filling in the header 55 Legal
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
3.3. Filling in the header 61 Value
Page 63 and 64:
Page 65 and 66:
3.3. Filling in the header 65 ⊲ p
Page 67 and 68:
3.3. Filling in the header 67 Prope
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
3.3. Filling in the header 83 3.3.3
Page 85 and 86:
4.1. Basic considerations 85 Outlin
Page 87 and 88:
4.2. Formatting text 87 russiske hi
Page 89 and 90:
4.2. Formatting text 89 Mirganjan
Page 91 and 92:
4.2. Formatting text 91 udvikle ude
Page 93 and 94:
4.2. Formatting text 93 4.2.3.4 Add
Page 95 and 96:
4.2. Formatting text 95 i dag . T
Page 97 and 98:
Part III Collecting 97
Page 99 and 100:
5.1. Implementation 99 5.5.2 Implem
Page 101 and 102:
5.1. Implementation 101 [request c
Page 103 and 104:
5.2. Header constructor:make-header
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
5.3. Pre-tokenizer:pretokenize 115
Page 117 and 118:
5.3. Pre-tokenizer:pretokenize 117
Page 119 and 120:
5.4. Text id registry:register-text
Page 121 and 122:
5.4. Text id registry:register-text
Page 123 and 124:
5.5. id dispatcher:make-id 123 0 11
Page 125 and 126:
Part IV Markup 125
Page 127 and 128:
6.1. Requirements 127 Outline of th
Page 129 and 130:
6.2. Survey 129 1. IMS’s TreeTagg
Page 131 and 132:
6.2. Survey 131 Code: Java. Assessm
Page 133 and 134:
6.2. Survey 133 9. alias-i’s Ling
Page 135 and 136:
6.3. Case study 135 lemmatization c
Page 137 and 138:
6.4. Further reading 137 disambigua
Page 139 and 140:
7.1. Modifications of the PAROLE Co
Page 141 and 142:
7.2. The ePOS tag set for Danish 14
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Chapter 8 The full-form lexicon Sam
Page 155 and 156:
8.1. Enhancing existing material 15
Page 157 and 158:
8.1. Enhancing existing material 15
Page 159 and 160:
Part V Deployment 159
Page 161 and 162:
9.2. Text material 161 Type Source
Page 163 and 164:
10.2. Funktionaliteter 163 2. The u
Page 165 and 166:
10.2. Funktionaliteter 165 fortolke
Page 167 and 168:
10.2. Funktionaliteter 167 teknolog
Page 169 and 170:
10.2. Funktionaliteter 169 Implemen
Page 171 and 172:
10.2. Funktionaliteter 171 forbinde
Page 173 and 174:
10.2. Funktionaliteter 173 10.2.9 K
Page 175 and 176:
Bibliography Andersen, M. S., Asmus
show all

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

Create successful ePaper yourself

Delete template?

Save as template?