PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.3 Document options for <strong>TET</strong>_open_document( ) and <strong>TET</strong>_open_document_callback( )<br />
option<br />
description<br />
tetml (Option list) <strong>TET</strong>ML output will be initiated, and can be created page by page with <strong>TET</strong>_process_page( ).<br />
The following suboptions are supported:<br />
elements (List of Boolean) Specify whether certain <strong>TET</strong>ML elements will be included in the output<br />
(default: all true):<br />
docinfo The /<strong>TET</strong>/Document/DocInfo element<br />
docxmp The /<strong>TET</strong>/Document/Metadata element<br />
options The elements /<strong>TET</strong>/Document/Options and /<strong>TET</strong>/Document/Pages/Page/Options<br />
encodingname<br />
(Keyword) The name to use in the XML encoding declaration of the text declaration of the<br />
generated <strong>TET</strong>ML. The output will always be created in UTF-8 (default: UTF-8):<br />
_none No encoding declaration will be created; the output will still be in UTF-8 format.<br />
UTF-8 The declaration encoding="UTF-8" will be created.<br />
Any other encoding name will be used literally in the encoding declaration. The client is<br />
responsible for supplying a suitable encoding name and converting the generated <strong>TET</strong>ML<br />
(which is UTF-8) to the specified encoding after <strong>TET</strong> finished <strong>TET</strong>ML output.<br />
filename (String) The name of the <strong>TET</strong>ML file. If no filename is supplied, output will be created in<br />
memory, and can be retrieved with <strong>TET</strong>_get_xml_data( ). If the function call fails (i.e. the PDF<br />
input document could not successfully be opened), no <strong>TET</strong>ML output will be created.<br />
version (Integer) Version number of the DTD or schema for the generated <strong>TET</strong>ML output (default: 3):<br />
2 Use the DTD for <strong>TET</strong> 2.x (which uses version 1 internally)<br />
3 Use the schema for <strong>TET</strong> 3.0<br />
unknownchar<br />
usehostfonts<br />
(Unichar) The character to be used as a replacement for unknown characters which cannot be mapped to<br />
Unicode (see Section 6.5, »Unicode Pipeline«, page 68) . Default: U+FFFD (Replacement Character)<br />
(Boolean) If true, data for fonts which are not embedded, but are required for determining Unicode<br />
mappings will be searched on the Mac or Windows host operating system. Default: true<br />
1. See footnote 2 in Table 10.4<br />
10.4 Document Functions 131