25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.1 Unicode and Characters: Uppercasing, Lowercasing and Sorting<br />

Contrary to a common sense understanding of Unicode, the conceptual design of<br />

Unicode avoids the notion of character, since this is a language-specific notion, and<br />

languages are not covered by Unicode. Unicode refers instead to code elements (),<br />

which frequently coincide with characters, but also contain combining characters<br />

such as diacritics. Characters and code elements further differ, if ligatures (Dutch ‘ij’,<br />

Spanish ‘ll’, ‘ch’, Belorussian Lacinka ‘dz’) are to be treated as one character in a<br />

language. Uppercasing of ligatures is thus essentially undefined, and will produce from<br />

‘xy’ uniformly either ‘Xy’ or ‘XY’, without knowing the requirements of the writing<br />

system. It is thus obvious that specifying the character set of writing systems and<br />

describing the mapping between the characters (e.g., for uppercasing and lowercasing)<br />

is one principle task in XNLRDF, just as lowercasing, for example, is an important step<br />

in the normalisation of a string (e.g., for a lexicon lookup or information retrieval).<br />

Similarly, the sorting of characters, the second operation defined in Unicode (e.g.,<br />

for the purpose of presenting dictionary entries or creating indices), depends on the<br />

writing system, and can only be approximately defined on the basis of the script.<br />

Thus, Unicode might successfully sort ‘a’ before ‘b’, but already the position of ‘á’<br />

after ‘a’ or after ‘z’ is specific to each writing system. Another example is the Spanish<br />

‘ll.’ Although it is no longer considered a character, it maintains its specific position<br />

between ‘l’ and ‘m’ in a sorted list. Thus, sorting requires basic writing system-specific<br />

information, which XNLRDF sets out to describe. What this example also shows is<br />

that the definition of collating sequences for the characters of a writing system is<br />

independent from the status of the character (base character, composed character,<br />

contracted character, contracted non-character, context-sensitive character, foreign<br />

character, swap character, etc.).<br />

2.2 Linguistic Information: What Else?<br />

The operations covered by Unicode are limited, and most NLP-applications would<br />

require much more linguistic knowledge when processing documents in potentially<br />

unknown writing systems. First, an application might need to identify the encoding<br />

(e.g. KOI-R), the script (Cyrillic), the language (Russian), the standard (civil script),<br />

and orthography (after 1917) of a document. Part of this information might be drawn<br />

from the metadata available in the document, from the Unicode range, or the URL<br />

of a document (in our example, http://xyz.xyz.ru), but filling in the remaining gaps,<br />

(e.g., mapping from the encoding KOI-R to the language Russian, from the language<br />

to potential scripts, or from a script to a language) requires background information<br />

about the legacy encodings and writing systems. This background information is<br />

193

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!