17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 10.8 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />

option<br />

fold<br />

glyphmapping<br />

lineseparator<br />

normalize<br />

inmemory<br />

description<br />

(Keyword or list of lists; the first element of each inner list is a Unicode set or keyword, the second element<br />

is a Unichar or a keyword) Apply a folding (equivalence mapping) to all characters in a folding domain<br />

specified as a Unicode set. The foldings will be applied to all text except separator characters added<br />

by the lineseparator or wordseparator options (see see Section 7.3, »Unicode Postprocessing«, page 97).<br />

Default: see Table 7.3, page 99.<br />

The following keyword can be supplied instead of a list:<br />

none No foldings will be applied.<br />

The following keyword can be supplied instead of a sublist:<br />

default The default foldings will be applied before other specified foldings.<br />

The first element of each list specifies the folding’s domain, i.e. the set of Unicode characters to which the<br />

folding will be applied. A string specifies a Unicode set for the domain. If a character is included in multiple<br />

sets specified within the fold option, the first matching set definition has priority over all others. In<br />

order to avoid problems it is recommended to use disjoint sets.<br />

As an alternative to a Unicode set the following keyword can be supplied:<br />

_dehyphenation<br />

The folding will be applied to hyphen characters which have been found within hyphenated<br />

words at line breaks. These characters will be flagged in the attributes member returned by<br />

<strong>TET</strong>_get_char_info( ) and the @dehyphenation attribute in <strong>TET</strong>ML.<br />

The second element in each list contains the target character or action for the folding. It is specified with<br />

one of the following variants:<br />

(Unichar) Replace all characters in the domain with the specified Unicode character.<br />

remove All characters in the domain will be removed.<br />

preserve The characters in the domain will not be modified.<br />

unknownchar<br />

Replace all characters in the domain with the character specified in the unknownchar option.<br />

(List of option lists) A list of option lists where each option list describes a glyph mapping method for one<br />

or more font/encoding combinations which cannot reliably be mapped with standard methods. The<br />

mappings will be used in least-recently-set order. If the last option list contains the font name wildcard<br />

»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table<br />

10.9. All glyph mappings which match a particular font name will be applied to this font. (default: predefined<br />

internal glyph rules will be applied).<br />

Note that glyph mapping rules can also be specified as an external resource in the UPR file (see Section<br />

5.2, »Resource Configuration and File Searching«, page 61).<br />

(Unichar; Only for granularity=page) Character to be inserted between lines 2 . Default: U+000A<br />

(Keyword) Normalize the text output to one of the Unicode normalization forms (default: none):<br />

none Do not apply any normalization.<br />

nfc Normalization Form C (NFC): canonical decomposition followed by canonical composition<br />

nfd Normalization Form D (NFD): canonical decomposition<br />

nfkc Normalization Form KC (NFKC): compatibility decomposition followed by canonical composition<br />

nfkd Normalization Form KD (NFKD): compatibility decomposition<br />

Since the Unicode normalization forms involve canonical and compatibility decompositions, combinations<br />

of the options decompose and normalize must be constructed carefully. Setting the normalize option<br />

to a value different from none sets the decomposition default to decompose=none. The normalize<br />

option is processed after the decompose option.<br />

(Boolean; Only for <strong>TET</strong>_open_document( )) If true, <strong>TET</strong> will load the complete file into memory and process<br />

it from there. This can result in a tremendous performance gain on some systems (especially MVS) at<br />

the expense of memory usage. If false, individual parts of the document will be read from disk as needed.<br />

Default: false<br />

10.7 Document Functions 163

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!