PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.8 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />
option<br />
fold<br />
glyphmapping<br />
lineseparator<br />
normalize<br />
inmemory<br />
description<br />
(Keyword or list of lists; the first element of each inner list is a Unicode set or keyword, the second element<br />
is a Unichar or a keyword) Apply a folding (equivalence mapping) to all characters in a folding domain<br />
specified as a Unicode set. The foldings will be applied to all text except separator characters added<br />
by the lineseparator or wordseparator options (see see Section 7.3, »Unicode Postprocessing«, page 97).<br />
Default: see Table 7.3, page 99.<br />
The following keyword can be supplied instead of a list:<br />
none No foldings will be applied.<br />
The following keyword can be supplied instead of a sublist:<br />
default The default foldings will be applied before other specified foldings.<br />
The first element of each list specifies the folding’s domain, i.e. the set of Unicode characters to which the<br />
folding will be applied. A string specifies a Unicode set for the domain. If a character is included in multiple<br />
sets specified within the fold option, the first matching set definition has priority over all others. In<br />
order to avoid problems it is recommended to use disjoint sets.<br />
As an alternative to a Unicode set the following keyword can be supplied:<br />
_dehyphenation<br />
The folding will be applied to hyphen characters which have been found within hyphenated<br />
words at line breaks. These characters will be flagged in the attributes member returned by<br />
<strong>TET</strong>_get_char_info( ) and the @dehyphenation attribute in <strong>TET</strong>ML.<br />
The second element in each list contains the target character or action for the folding. It is specified with<br />
one of the following variants:<br />
(Unichar) Replace all characters in the domain with the specified Unicode character.<br />
remove All characters in the domain will be removed.<br />
preserve The characters in the domain will not be modified.<br />
unknownchar<br />
Replace all characters in the domain with the character specified in the unknownchar option.<br />
(List of option lists) A list of option lists where each option list describes a glyph mapping method for one<br />
or more font/encoding combinations which cannot reliably be mapped with standard methods. The<br />
mappings will be used in least-recently-set order. If the last option list contains the font name wildcard<br />
»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table<br />
10.9. All glyph mappings which match a particular font name will be applied to this font. (default: predefined<br />
internal glyph rules will be applied).<br />
Note that glyph mapping rules can also be specified as an external resource in the UPR file (see Section<br />
5.2, »Resource Configuration and File Searching«, page 61).<br />
(Unichar; Only for granularity=page) Character to be inserted between lines 2 . Default: U+000A<br />
(Keyword) Normalize the text output to one of the Unicode normalization forms (default: none):<br />
none Do not apply any normalization.<br />
nfc Normalization Form C (NFC): canonical decomposition followed by canonical composition<br />
nfd Normalization Form D (NFD): canonical decomposition<br />
nfkc Normalization Form KC (NFKC): compatibility decomposition followed by canonical composition<br />
nfkd Normalization Form KD (NFKD): compatibility decomposition<br />
Since the Unicode normalization forms involve canonical and compatibility decompositions, combinations<br />
of the options decompose and normalize must be constructed carefully. Setting the normalize option<br />
to a value different from none sets the decomposition default to decompose=none. The normalize<br />
option is processed after the decompose option.<br />
(Boolean; Only for <strong>TET</strong>_open_document( )) If true, <strong>TET</strong> will load the complete file into memory and process<br />
it from there. This can result in a tremendous performance gain on some systems (especially MVS) at<br />
the expense of memory usage. If false, individual parts of the document will be read from disk as needed.<br />
Default: false<br />
10.7 Document Functions 163