PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.8 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />
option<br />
checkglyphlists<br />
decompose<br />
encodinghint<br />
description<br />
(Boolean) If true, <strong>TET</strong> will check all builtin glyphmapping rules with condition=allfonts before text extraction<br />
starts. Otherwise the global glyphmapping rules will not be applied. This option slows down<br />
processing, but is useful for certain kinds of TeX documents with glyph names which cannot be mapped<br />
to Unicode by default. Default: false<br />
(Keyword or option list) Unicode decompositions which will be applied to all characters which have a<br />
specified Unicode decomposition tag and are part of the specified Unicode set. These conditions are provided<br />
in the suboption name and value. Decompositions can be used to either remove or preserve the distinction<br />
between equivalent Unicode characters (see Section 7.3, »Unicode Postprocessing«, page 97).<br />
Default:see »Default decompositions«, page 103. However, if the normalize option has a value other<br />
than none, all default decompositions are disabled, i.e. setting the normalize option sets the default to<br />
decompose=none. However, user-specified decompositions can still be applied.<br />
The following keywords can be supplied instead of a list:<br />
none No decompositions will be applied.<br />
default The default decompositions (see »Default decompositions«, page 103) will be applied before<br />
other specified decompositions.<br />
The following suboptions for decompositions are supported:<br />
canonical, circle, compat, final, font, fraction, initial, isolated, medial, narrow, nobreak, small, square,<br />
sub, super, vertical, wide<br />
Each of these suboptions accepts a string or keyword which specifies the decomposition’s domain, i.e. the<br />
set of Unicode characters to which the decomposition will be applied. A string specifies a Unicode set for<br />
the domain. This can be used to restrict decompositions to subsets of the characters with the specified<br />
decomposition tag. Characters outside the domain will not be modified.<br />
As an alternative to a string for a Unicode set the following keywords can be supplied:<br />
_all<br />
_none<br />
The set of all Unicode characters, i.e. the decomposition will be applied to all characters with<br />
the specified decomposition tag.<br />
The empty set, i.e. the decomposition will not be applied at all.<br />
(String 1 ) The name of an encoding which will be used to determine Unicode mappings for glyph names<br />
which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The<br />
keyword none can be used to disable all predefined rules. Default: winansi<br />
162 Chapter 10: <strong>TET</strong> Library API Reference