17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Since the PDF document may map presentation forms either to the isolated Unicode<br />

character or one of the presentation forms (e.g. in the document’s ToUnicode CMap),<br />

<strong>TET</strong> cannot guarantee that the output contains presentation forms even when decompositions<br />

are disabled.<br />

Table 6.2 Processing Arabic presentation forms with the decompose option<br />

description and option list<br />

Decompose final, initial, isolated, and medial presentation forms:<br />

no decompose option (default) or<br />

decompose=none<br />

or<br />

decompose=<br />

{final=_all medial=_all initial=_all isolated=_all}<br />

Preserve final, initial, isolated, and medial presentation forms:<br />

decompose=<br />

{final=_none medial=_none initial=_none isolated=_none}<br />

before<br />

decomposition<br />

<br />

U+FEB2<br />

<br />

U+FEB3<br />

<br />

U+FD0E<br />

<br />

U+FEB4<br />

<br />

U+FEB2<br />

<br />

U+FEB3<br />

<br />

U+FD0E<br />

<br />

U+FEB4<br />

after decomposition<br />

(in logical order)<br />

<br />

U+0633<br />

<br />

U+0633<br />

<br />

<br />

U+0633 U+0631<br />

<br />

U+0633<br />

<br />

U+FEB2<br />

<br />

U+FEB3<br />

<br />

U+FD0E<br />

<br />

U+FEB4<br />

Remove Arabic Tatweel character. The Tatweel character U+0640 (also called kashida)<br />

is often used in Arabic text to stretch words so that they completely fill the line. Since<br />

the Tatweel doesn’t carry any text information itself it is usually not required in the extracted<br />

text. By default, <strong>TET</strong> removes Tatweel characters from the extracted text. As<br />

shown in Table 6.3 the fold option can be used to preserve Tatweel characters (see Section<br />

7.3.1, »Unicode Folding«, page 97).<br />

Table 6.3 Processing the Tatweel character U+0640 with the fold option<br />

description and option list before folding after folding<br />

Remove Arabic Tatweel characters: no fold option (default) or<br />

<br />

n/a<br />

fold={{[U+0640] remove}} or fold={default}<br />

U+0640<br />

Preserve Arabic Tatweel characters (which are removed by default):<br />

fold={{[U+0640] preserve}}<br />

<br />

U+0640<br />

<br />

U+0640<br />

6.4 Bidirectional Arabic and Hebrew <strong>Text</strong> 83

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!