The Corpus Thread - Det Danske Sprog- og Litteraturselskab
The Corpus Thread - Det Danske Sprog- og Litteraturselskab
The Corpus Thread - Det Danske Sprog- og Litteraturselskab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Text metadata 20<br />
Outline of this chapter<br />
This chapter describes how the metadata part of text items can be expressed<br />
by means of a TEI P5 header whereas Chapter 4 describes the text<br />
part proper. One major aim of the header design described in this chapter<br />
is to integrate header information from text items in existing corpora of<br />
Danish language, i.e. the <strong>Corpus</strong> of the Danish Dictionary and PAROLE-DK,<br />
KORPUS 2000, other corpus-relevant material from DOT/DSL, as well as<br />
the LGP and LSP corpora of written Danish which are compiled as part of<br />
DK-CLARIN.<br />
3.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
3.2 Header structure . . . . . . . . . . . . . . . . . . . . . . . 22<br />
3.2.1 <strong>The</strong> file description . . . . . . . . . . . . . . . . . 23<br />
3.2.2 <strong>The</strong> encoding description . . . . . . . . . . . . . 33<br />
3.2.3 <strong>The</strong> profile description . . . . . . . . . . . . . . . 36<br />
3.2.4 <strong>The</strong> revision description . . . . . . . . . . . . . . 40<br />
3.3 Filling in the header . . . . . . . . . . . . . . . . . . . . . 41<br />
3.3.1 Full header template . . . . . . . . . . . . . . . . 41<br />
3.3.2 Value sets for header standard information . . . 44<br />
3.3.3 Additional value sets for text classification . . . . 83<br />
Guide to reading this chapter<br />
<strong>The</strong> structure of the header is oriented towards that one used by the BNC<br />
Burnard (2007) and PAROLE-DK Keson (1998b) but tries to avoid idiosyncrasies<br />
not covered by TEI P5 as well as modifications of the TEI header<br />
schema.<br />
Section 3.1 summarizes some corpus linguistic concepts used throughout<br />
the DK-CLARIN project, which are described in further detail in Chapter<br />
1.<br />
Section 3.2 gives a general account of the header structure of headers<br />
of text items to be included in the <strong>Corpus</strong> Text Bank, CTB. 2 <strong>The</strong> description<br />
of the CTB header structure is in its starting point strongly inspired by that<br />
one given in Burnard (2007). This section constitutes the major part of this<br />
chapter.<br />
2 <strong>The</strong> CTB is a text repository of written texts that are candidates to be included in a<br />
linguistic corpus. <strong>The</strong> CTB has been developed by WP 2.1 in order to better process and<br />
organize potential corpus text material, see III. It must not be confused with the general<br />
DK-CLARIN repository developed in WP 5 that is supposed to support various data types<br />
(e.g. texts, images, lexicons) and various formats.