17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 10.3 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />

option<br />

description<br />

tetml (Option list) <strong>TET</strong>ML output will be initiated, and can be created page by page with <strong>TET</strong>_process_page( ).<br />

The following suboptions are supported:<br />

elements (List of Boolean) Specify whether certain <strong>TET</strong>ML elements will be included in the output<br />

(default: all true):<br />

docinfo The /<strong>TET</strong>/Document/DocInfo element<br />

docxmp The /<strong>TET</strong>/Document/Metadata element<br />

options The elements /<strong>TET</strong>/Document/Options and /<strong>TET</strong>/Document/Pages/Page/Options<br />

encodingname<br />

(Keyword) The name to use in the XML encoding declaration of the text declaration of the<br />

generated <strong>TET</strong>ML. The output will always be created in UTF-8 (default: UTF-8):<br />

_none No encoding declaration will be created; the output will still be in UTF-8 format.<br />

UTF-8 The declaration encoding="UTF-8" will be created.<br />

Any other encoding name will be used literally in the encoding declaration. The client is<br />

responsible for supplying a suitable encoding name and converting the generated <strong>TET</strong>ML<br />

(which is UTF-8) to the specified encoding after <strong>TET</strong> finished <strong>TET</strong>ML output.<br />

filename (String) The name of the <strong>TET</strong>ML file. If no filename is supplied, output will be created in<br />

memory, and can be retrieved with <strong>TET</strong>_get_xml_data( ). If the function call fails (i.e. the PDF<br />

input document could not successfully be opened), no <strong>TET</strong>ML output will be created.<br />

version (Integer) Version number of the DTD or schema for the generated <strong>TET</strong>ML output (default: 3):<br />

2 Use the DTD for <strong>TET</strong> 2.x (which uses version 1 internally)<br />

3 Use the schema for <strong>TET</strong> 3.0<br />

unknownchar<br />

usehostfonts<br />

(Unichar) The character to be used as a replacement for unknown characters which cannot be mapped to<br />

Unicode (see Section 6.5, »Unicode Pipeline«, page 68) . Default: U+FFFD (Replacement Character)<br />

(Boolean) If true, data for fonts which are not embedded, but are required for determining Unicode<br />

mappings will be searched on the Mac or Windows host operating system. Default: true<br />

1. See footnote 2 in Table 10.4<br />

10.4 Document Functions 131

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!