18.07.2013 Views

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

The Corpus Thread - Det Danske Sprog- og Litteraturselskab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.3. Pre-tokenizer:pretokenize 115<br />

groups) on the basis of TEI-P5-WP2 in case they need more sophisticated<br />

or customized tokenization, see .<br />

A key issue when implementing a tokenizer is which characters should<br />

be considered punctuation (i.e. non-word characters). Punctuation characters<br />

should be characterized by having no semantic impact on the words<br />

in the text. In other words, if a punctuation character is removed from the<br />

text, this should in no way change the meaning of the words in the text (denotational,<br />

connotational, or otherwise). If, for example, currency symbols,<br />

degree symbols and so on were to be deleted, it would cause a loss of meaning.<br />

For this reason such characters are not considered punctuation.<br />

5.3.1.1 List of punctuation characters<br />

On the basis of discussions in the WP2 project group, which agreed upon using<br />

the punctuation characters listed in Wikipedia, cf.http://da.wikipedia.org/wiki/Tegns<br />

andhttp://en.wikipedia.org/wiki/Punctuation, the list of punctuation<br />

characters used in the pre-tokenizer is limited to the following:<br />

Table 5.1: Punctuation characters<br />

Character Description Code Variants<br />

’ apostrophe 0027<br />

( bracket, round,<br />

opening<br />

) bracket, round,<br />

closing<br />

[ bracket, square,<br />

opening<br />

] bracket, square,<br />

closing<br />

{ bracket, curly,<br />

opening<br />

0028<br />

0029<br />

005B<br />

005D<br />

007B<br />

} bracket, curly, closing 007D<br />

: colon 003A<br />

, comma 002C<br />

- dash/hyphen 002D 2013 en dash: –<br />

2014 em dash: —<br />

/ slash 002F 005C backslash: \<br />

Table continues on next page. . .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!