PDF (Online Text) - EURAC
PDF (Online Text) - EURAC
PDF (Online Text) - EURAC
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.1 Unicode and Characters: Uppercasing, Lowercasing and Sorting<br />
Contrary to a common sense understanding of Unicode, the conceptual design of<br />
Unicode avoids the notion of character, since this is a language-specific notion, and<br />
languages are not covered by Unicode. Unicode refers instead to code elements (),<br />
which frequently coincide with characters, but also contain combining characters<br />
such as diacritics. Characters and code elements further differ, if ligatures (Dutch ‘ij’,<br />
Spanish ‘ll’, ‘ch’, Belorussian Lacinka ‘dz’) are to be treated as one character in a<br />
language. Uppercasing of ligatures is thus essentially undefined, and will produce from<br />
‘xy’ uniformly either ‘Xy’ or ‘XY’, without knowing the requirements of the writing<br />
system. It is thus obvious that specifying the character set of writing systems and<br />
describing the mapping between the characters (e.g., for uppercasing and lowercasing)<br />
is one principle task in XNLRDF, just as lowercasing, for example, is an important step<br />
in the normalisation of a string (e.g., for a lexicon lookup or information retrieval).<br />
Similarly, the sorting of characters, the second operation defined in Unicode (e.g.,<br />
for the purpose of presenting dictionary entries or creating indices), depends on the<br />
writing system, and can only be approximately defined on the basis of the script.<br />
Thus, Unicode might successfully sort ‘a’ before ‘b’, but already the position of ‘á’<br />
after ‘a’ or after ‘z’ is specific to each writing system. Another example is the Spanish<br />
‘ll.’ Although it is no longer considered a character, it maintains its specific position<br />
between ‘l’ and ‘m’ in a sorted list. Thus, sorting requires basic writing system-specific<br />
information, which XNLRDF sets out to describe. What this example also shows is<br />
that the definition of collating sequences for the characters of a writing system is<br />
independent from the status of the character (base character, composed character,<br />
contracted character, contracted non-character, context-sensitive character, foreign<br />
character, swap character, etc.).<br />
2.2 Linguistic Information: What Else?<br />
The operations covered by Unicode are limited, and most NLP-applications would<br />
require much more linguistic knowledge when processing documents in potentially<br />
unknown writing systems. First, an application might need to identify the encoding<br />
(e.g. KOI-R), the script (Cyrillic), the language (Russian), the standard (civil script),<br />
and orthography (after 1917) of a document. Part of this information might be drawn<br />
from the metadata available in the document, from the Unicode range, or the URL<br />
of a document (in our example, http://xyz.xyz.ru), but filling in the remaining gaps,<br />
(e.g., mapping from the encoding KOI-R to the language Russian, from the language<br />
to potential scripts, or from a script to a language) requires background information<br />
about the legacy encodings and writing systems. This background information is<br />
193