PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Since the PDF document may map presentation forms either to the isolated Unicode<br />
character or one of the presentation forms (e.g. in the document’s ToUnicode CMap),<br />
<strong>TET</strong> cannot guarantee that the output contains presentation forms even when decompositions<br />
are disabled.<br />
Table 6.2 Processing Arabic presentation forms with the decompose option<br />
description and option list<br />
Decompose final, initial, isolated, and medial presentation forms:<br />
no decompose option (default) or<br />
decompose=none<br />
or<br />
decompose=<br />
{final=_all medial=_all initial=_all isolated=_all}<br />
Preserve final, initial, isolated, and medial presentation forms:<br />
decompose=<br />
{final=_none medial=_none initial=_none isolated=_none}<br />
before<br />
decomposition<br />
<br />
U+FEB2<br />
<br />
U+FEB3<br />
<br />
U+FD0E<br />
<br />
U+FEB4<br />
<br />
U+FEB2<br />
<br />
U+FEB3<br />
<br />
U+FD0E<br />
<br />
U+FEB4<br />
after decomposition<br />
(in logical order)<br />
<br />
U+0633<br />
<br />
U+0633<br />
<br />
<br />
U+0633 U+0631<br />
<br />
U+0633<br />
<br />
U+FEB2<br />
<br />
U+FEB3<br />
<br />
U+FD0E<br />
<br />
U+FEB4<br />
Remove Arabic Tatweel character. The Tatweel character U+0640 (also called kashida)<br />
is often used in Arabic text to stretch words so that they completely fill the line. Since<br />
the Tatweel doesn’t carry any text information itself it is usually not required in the extracted<br />
text. By default, <strong>TET</strong> removes Tatweel characters from the extracted text. As<br />
shown in Table 6.3 the fold option can be used to preserve Tatweel characters (see Section<br />
7.3.1, »Unicode Folding«, page 97).<br />
Table 6.3 Processing the Tatweel character U+0640 with the fold option<br />
description and option list before folding after folding<br />
Remove Arabic Tatweel characters: no fold option (default) or<br />
<br />
n/a<br />
fold={{[U+0640] remove}} or fold={default}<br />
U+0640<br />
Preserve Arabic Tatweel characters (which are removed by default):<br />
fold={{[U+0640] preserve}}<br />
<br />
U+0640<br />
<br />
U+0640<br />
6.4 Bidirectional Arabic and Hebrew <strong>Text</strong> 83