17.05.2014 Views

PDFlib TET PDF IFilter 4.0 Manual

PDFlib TET PDF IFilter 4.0 Manual

PDFlib TET PDF IFilter 4.0 Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.4.3 Unicode Normalization<br />

The Unicode standard defines four normalization forms which are based on the notions<br />

of canonical equivalence and compatibility equivalence. All normalization forms put<br />

combining marks in a specific order and apply decomposition and composition in different<br />

ways:<br />

> Normalization Form C (NFC) applies canonical decomposition followed by canonical<br />

composition.<br />

> Normalization Form D (NFD) applies canonical decomposition.<br />

> Normalization Form KC (NFKC) applies compatibility decomposition followed by canonical<br />

composition.<br />

> Normalization Form KD (NFKD) applies compatibility decomposition.<br />

The normalization forms are specified in Unicode Standard Annex #15 »Unicode Normalization<br />

Forms« (see www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G21796 and<br />

www.unicode.org/reports/tr15/).<br />

<strong>TET</strong> <strong>PDF</strong> <strong>IFilter</strong> supports all four Unicode normalization forms. Unicode normalization<br />

can be controlled via the normalize document option, e.g.<br />

normalize=nfc<br />

<strong>TET</strong> <strong>PDF</strong> <strong>IFilter</strong> does not apply normalization by default. Because of the possible interaction<br />

between the decompose and normalize options, setting the normalize option to a value<br />

different from none disables the default decompositions.<br />

The choice of normalization form depends on the application’s requirements. For<br />

example, some databases expect text in NFC which also the preferred format for Unicode<br />

text on the Web. Table 2.6 demonstrates the effect of Normalization on various<br />

characters.<br />

Table 2.6 Unicode normalization forms: examples<br />

before<br />

normalization NFC NFD NFKC NFKD<br />

<br />

U+00C4<br />

<br />

U+00C4<br />

<br />

U+0041 U+0308<br />

<br />

U+00C4<br />

<br />

U+0041 U+0308<br />

<br />

U+0041 U+0308<br />

<br />

U+00C4<br />

<br />

U+0041 U+0308<br />

<br />

U+00C4<br />

<br />

U+0041 U+0308<br />

<br />

U+0308 U+0041<br />

<br />

U+0308 U+0041<br />

<br />

U+0308 U+0041<br />

<br />

U+0308 U+0041<br />

<br />

U+0308 U+0041<br />

<br />

U+FB01<br />

<br />

U+FB01<br />

<br />

U+FB01<br />

<br />

<br />

U+0066 U+0069<br />

<br />

<br />

U+0066 U+0069<br />

<br />

<br />

U+0033 U+2075<br />

<br />

<br />

U+0033 U+2075<br />

<br />

<br />

U+0033 U+2075<br />

<br />

<br />

U+0033 U+0035<br />

<br />

<br />

U+0033 U+0035<br />

<br />

U+212B<br />

<br />

U+00C5<br />

<br />

U+0041 U+030A<br />

<br />

U+00C5<br />

<br />

U+0041 U+030A<br />

<br />

U+2122<br />

<br />

U+2122<br />

<br />

U+2122<br />

<br />

<br />

U+0054 U+004D<br />

<br />

<br />

U+0054 U+004D<br />

2.4 Unicode Postprocessing 33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!