17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Compatibility decomposition. Characters which are compatibility equivalent represent<br />

the same abstract character, but may differ in appearance or behavior. Examples<br />

include isolated forms of Arabic characters (e.g.<br />

<br />

) vs. context-specific shaped forms<br />

U+FEB2<br />

<br />

U+FEB4<br />

(e.g. , , ). Compatibility equivalent characters differ in formatting. Removing<br />

this formatting information implies loss of information, but may simplify processing<br />

for certain types of applications (e.g. searching). Compatibility decompositions<br />

remove the formatting information.<br />

In the Unicode code charts compatibility mappings are marked with the symbol<br />

ALMOST EQUAL TO<br />

, followed by the decomposition name (or »tag«) in angle<br />

brackets, e.g. . If no tag name is provided, is assumed. The tag names<br />

are identical to the option names in Table 7.5. As can be seen in some of the examples,<br />

the result of a decomposition may convert a single character to a sequence of multiple<br />

characters.<br />

Note While all entries in Table 7.5 describe compatibility decompositions, the »compat« tag includes<br />

only »other« compatibility decompositions, i.e. those without a specific name.<br />

Note Keep in mind that PDF documents may already map glyphs to the decomposed sequence instead<br />

of the non-decomposed Unicode value. In this situation the decompose option will not affect<br />

the output.<br />

Decomposition examples. Decompositions in <strong>TET</strong> can be controlled with the document<br />

option decompose. A decomposition can be restricted to operate only on some, but<br />

not all Unicode characters. The subset on which a decomposition operates is called its<br />

domain. Table 7.5 lists the suboptions for all Unicode decompositions along with examples.<br />

The following examples for the decompose option must be supplied in the option list<br />

for <strong>TET</strong>_open_document( ). The decomposition names in the decompose option list are<br />

taken from Table 7.5.<br />

Disable all decompositions:<br />

decompose={none}<br />

<br />

U+FEB3<br />

<br />

U+00C4 U+2248<br />

Preserve wide (double-byte or zenkaku) and hankaku (narrow) characters:<br />

decompose={wide=_none narrow=_none}<br />

Map all canonical equivalents to their counterparts:<br />

decompose={canonical=_all}<br />

The following option list enables the circle decomposition, but disables all other decompositions:<br />

decompose={none circle=_all}<br />

<br />

U+0633<br />

7.3 Unicode Postprocessing 101

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!