17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 10.3 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />

option<br />

checkglyphlists<br />

encodinghint<br />

glyphmapping<br />

infomode<br />

keeppua<br />

inmemory<br />

password<br />

repair<br />

requiredmode<br />

shrug<br />

description<br />

(Boolean) If true, <strong>TET</strong> will check all builtin glyphmapping rules with condition=allfonts before text extraction<br />

starts. Otherwise the global glyphmapping rules will not be applied. This option slows down<br />

processing, but is useful for certain kinds of TeX documents with glyph names which cannot be mapped<br />

to Unicode by default. Default: false<br />

(String 1 ) The name of an encoding which will be used to determine Unicode mappings for glyph names<br />

which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The<br />

keyword none can be used to disable all predefined rules. Default: winansi<br />

(List of option lists) A list of option lists where each option list describes a glyph mapping method for one<br />

or more font/encoding combinations which cannot reliably be mapped with standard methods. The<br />

mappings will be used in least-recently-set order. If the last option list contains the fontname wildcard<br />

»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table<br />

10.4. All glyph mappings which match a particular font name will be applied to this font. (default: predefined<br />

internal glyph rules will be applied).<br />

Note that glyph mapping rules can also be specified as an external resource in the UPR file (see Section<br />

5.2, »Resource Configuration and File Searching«, page 51).<br />

(Boolean) Deprecated, use requiredmode<br />

(Boolean) If true, PUA (Private Use Area) values will be returned as such; otherwise they will be mapped<br />

to the Unicode replacement character (see option unknownchar). Default: false<br />

(Boolean; Only for <strong>TET</strong>_open_document( )) If true, <strong>TET</strong> will load the complete file into memory and process<br />

it from there. This can result in a tremendous performance gain on some systems (especially MVS) at<br />

the expense of memory usage. If false, individual parts of the document will be read from disk as needed.<br />

Default: false<br />

(String; Maximum string length: 32 characters) The user or master password for encrypted documents. If<br />

the document’s permission settings allow text copying then the user password is sufficient, otherwise<br />

the master password must be supplied.<br />

See Section 9.6, »Encrypted PDF Documents«, page 119, to find out how to query a document’s encryption<br />

status, and pCOS operations which can be applied even without knowing the user or master password.<br />

The shrug option can be used to enable content extraction from protected documents under<br />

certain conditions (see Section 5.1, »Indexing protected PDF Documents«, page 49).<br />

(Keyword) Specifies how to treat damaged PDF documents. Repairing a document takes more time than<br />

normal parsing, but may allow processing of certain damaged PDFs. Note that some documents may be<br />

damaged beyond repair (default: auto):<br />

force Unconditionally try to repair the document, regardless of whether or not it has problems.<br />

auto Repair the document only if problems are detected while opening the PDF.<br />

none<br />

No attempt will be made at repairing the document. If there are problems in the PDF the<br />

function call will fail.<br />

(Keyword) The minimum pcosmode (minimum/restricted/full) which is acceptable when opening the<br />

document. The call will fail if the resulting pcosmode (see Section 9.6, »Encrypted PDF Documents«, page<br />

119) would be lower than the required mode. If the call succeeds it is guaranteed that the resulting pcosmode<br />

is at least the one specified in this option. However, it may be higher; e.g. requiredmode=minimum<br />

for an unencrypted document will result in full mode. Default: full<br />

(Boolean) If true, the shrug feature will be activated to enable content extraction from protected documents<br />

under certain conditions (see Chapter 5.1, »Indexing protected PDF Documents«, page 49). By using<br />

the shrug option you assert that you will honor the PDF document author’s rights. Default: false<br />

130 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!