PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.3 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />
option<br />
checkglyphlists<br />
encodinghint<br />
glyphmapping<br />
infomode<br />
keeppua<br />
inmemory<br />
password<br />
repair<br />
requiredmode<br />
shrug<br />
description<br />
(Boolean) If true, <strong>TET</strong> will check all builtin glyphmapping rules with condition=allfonts before text extraction<br />
starts. Otherwise the global glyphmapping rules will not be applied. This option slows down<br />
processing, but is useful for certain kinds of TeX documents with glyph names which cannot be mapped<br />
to Unicode by default. Default: false<br />
(String 1 ) The name of an encoding which will be used to determine Unicode mappings for glyph names<br />
which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The<br />
keyword none can be used to disable all predefined rules. Default: winansi<br />
(List of option lists) A list of option lists where each option list describes a glyph mapping method for one<br />
or more font/encoding combinations which cannot reliably be mapped with standard methods. The<br />
mappings will be used in least-recently-set order. If the last option list contains the fontname wildcard<br />
»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table<br />
10.4. All glyph mappings which match a particular font name will be applied to this font. (default: predefined<br />
internal glyph rules will be applied).<br />
Note that glyph mapping rules can also be specified as an external resource in the UPR file (see Section<br />
5.2, »Resource Configuration and File Searching«, page 51).<br />
(Boolean) Deprecated, use requiredmode<br />
(Boolean) If true, PUA (Private Use Area) values will be returned as such; otherwise they will be mapped<br />
to the Unicode replacement character (see option unknownchar). Default: false<br />
(Boolean; Only for <strong>TET</strong>_open_document( )) If true, <strong>TET</strong> will load the complete file into memory and process<br />
it from there. This can result in a tremendous performance gain on some systems (especially MVS) at<br />
the expense of memory usage. If false, individual parts of the document will be read from disk as needed.<br />
Default: false<br />
(String; Maximum string length: 32 characters) The user or master password for encrypted documents. If<br />
the document’s permission settings allow text copying then the user password is sufficient, otherwise<br />
the master password must be supplied.<br />
See Section 9.6, »Encrypted PDF Documents«, page 119, to find out how to query a document’s encryption<br />
status, and pCOS operations which can be applied even without knowing the user or master password.<br />
The shrug option can be used to enable content extraction from protected documents under<br />
certain conditions (see Section 5.1, »Indexing protected PDF Documents«, page 49).<br />
(Keyword) Specifies how to treat damaged PDF documents. Repairing a document takes more time than<br />
normal parsing, but may allow processing of certain damaged PDFs. Note that some documents may be<br />
damaged beyond repair (default: auto):<br />
force Unconditionally try to repair the document, regardless of whether or not it has problems.<br />
auto Repair the document only if problems are detected while opening the PDF.<br />
none<br />
No attempt will be made at repairing the document. If there are problems in the PDF the<br />
function call will fail.<br />
(Keyword) The minimum pcosmode (minimum/restricted/full) which is acceptable when opening the<br />
document. The call will fail if the resulting pcosmode (see Section 9.6, »Encrypted PDF Documents«, page<br />
119) would be lower than the required mode. If the call succeeds it is guaranteed that the resulting pcosmode<br />
is at least the one specified in this option. However, it may be higher; e.g. requiredmode=minimum<br />
for an unencrypted document will result in full mode. Default: full<br />
(Boolean) If true, the shrug feature will be activated to enable content extraction from protected documents<br />
under certain conditions (see Chapter 5.1, »Indexing protected PDF Documents«, page 49). By using<br />
the shrug option you assert that you will honor the PDF document author’s rights. Default: false<br />
130 Chapter 10: <strong>TET</strong> Library API Reference