17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 10.8 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />

option<br />

checkglyphlists<br />

decompose<br />

encodinghint<br />

description<br />

(Boolean) If true, <strong>TET</strong> will check all builtin glyphmapping rules with condition=allfonts before text extraction<br />

starts. Otherwise the global glyphmapping rules will not be applied. This option slows down<br />

processing, but is useful for certain kinds of TeX documents with glyph names which cannot be mapped<br />

to Unicode by default. Default: false<br />

(Keyword or option list) Unicode decompositions which will be applied to all characters which have a<br />

specified Unicode decomposition tag and are part of the specified Unicode set. These conditions are provided<br />

in the suboption name and value. Decompositions can be used to either remove or preserve the distinction<br />

between equivalent Unicode characters (see Section 7.3, »Unicode Postprocessing«, page 97).<br />

Default:see »Default decompositions«, page 103. However, if the normalize option has a value other<br />

than none, all default decompositions are disabled, i.e. setting the normalize option sets the default to<br />

decompose=none. However, user-specified decompositions can still be applied.<br />

The following keywords can be supplied instead of a list:<br />

none No decompositions will be applied.<br />

default The default decompositions (see »Default decompositions«, page 103) will be applied before<br />

other specified decompositions.<br />

The following suboptions for decompositions are supported:<br />

canonical, circle, compat, final, font, fraction, initial, isolated, medial, narrow, nobreak, small, square,<br />

sub, super, vertical, wide<br />

Each of these suboptions accepts a string or keyword which specifies the decomposition’s domain, i.e. the<br />

set of Unicode characters to which the decomposition will be applied. A string specifies a Unicode set for<br />

the domain. This can be used to restrict decompositions to subsets of the characters with the specified<br />

decomposition tag. Characters outside the domain will not be modified.<br />

As an alternative to a string for a Unicode set the following keywords can be supplied:<br />

_all<br />

_none<br />

The set of all Unicode characters, i.e. the decomposition will be applied to all characters with<br />

the specified decomposition tag.<br />

The empty set, i.e. the decomposition will not be applied at all.<br />

(String 1 ) The name of an encoding which will be used to determine Unicode mappings for glyph names<br />

which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The<br />

keyword none can be used to disable all predefined rules. Default: winansi<br />

162 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!