18.07.2013 Views

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Text metadata 20<br />

Outline of this chapter<br />

This chapter describes how the metadata part of text items can be expressed<br />

by means of a TEI P5 header whereas Chapter 4 describes the text<br />

part proper. One major aim of the header design described in this chapter<br />

is to integrate header information from text items in existing corpora of<br />

Danish language, i.e. the <strong>Corpus</strong> of the Danish Dictionary and PAROLE-DK,<br />

KORPUS 2000, other corpus-relevant material from DOT/DSL, as well as<br />

the LGP and LSP corpora of written Danish which are compiled as part of<br />

DK-CLARIN.<br />

3.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.2 Header structure . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.2.1 <strong>The</strong> file description . . . . . . . . . . . . . . . . . 23<br />

3.2.2 <strong>The</strong> encoding description . . . . . . . . . . . . . 33<br />

3.2.3 <strong>The</strong> profile description . . . . . . . . . . . . . . . 36<br />

3.2.4 <strong>The</strong> revision description . . . . . . . . . . . . . . 40<br />

3.3 Filling in the header . . . . . . . . . . . . . . . . . . . . . 41<br />

3.3.1 Full header template . . . . . . . . . . . . . . . . 41<br />

3.3.2 Value sets for header standard information . . . 44<br />

3.3.3 Additional value sets for text classification . . . . 83<br />

Guide to reading this chapter<br />

<strong>The</strong> structure of the header is oriented towards that one used by the BNC<br />

Burnard (2007) and PAROLE-DK Keson (1998b) but tries to avoid idiosyncrasies<br />

not covered by TEI P5 as well as modifications of the TEI header<br />

schema.<br />

Section 3.1 summarizes some corpus linguistic concepts used throughout<br />

the DK-CLARIN project, which are described in further detail in Chapter<br />

1.<br />

Section 3.2 gives a general account of the header structure of headers<br />

of text items to be included in the <strong>Corpus</strong> Text Bank, CTB. 2 <strong>The</strong> description<br />

of the CTB header structure is in its starting point strongly inspired by that<br />

one given in Burnard (2007). This section constitutes the major part of this<br />

chapter.<br />

2 <strong>The</strong> CTB is a text repository of written texts that are candidates to be included in a<br />

linguistic corpus. <strong>The</strong> CTB has been developed by WP 2.1 in order to better process and<br />

organize potential corpus text material, see III. It must not be confused with the general<br />

DK-CLARIN repository developed in WP 5 that is supposed to support various data types<br />

(e.g. texts, images, lexicons) and various formats.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!