17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 10.5 Options for <strong>TET</strong>_convert_to_unicode( )<br />

option<br />

charref<br />

bom<br />

errorpolicy<br />

escapesequence<br />

inflate<br />

outputformat<br />

description<br />

(Boolean) If true, enable substitution of numeric and character entity references and glyph name references.<br />

Default: false<br />

(Keyword; ignored for outputformat=utf32; for Unicode-aware language bindings only none is allowed)<br />

Policy for adding a byte order mark (BOM) to the output string. Supported keywords (default: none):<br />

add Add a BOM.<br />

keep Add a BOM if the input string has a BOM.<br />

none Don’t add a BOM.<br />

optimize<br />

Add a BOM except if outputformat=utf8 or ebcdicutf8 and the output string contains only<br />

characters in the range < U+007F.<br />

(Keyword) Behavior in case of conversion errors (default: exception):<br />

return The replacement character U+FFFD will be used if a character reference cannot be resolved or<br />

a builtin code or glyph ID doesn’t exist in the specified font. An empty string will be returned<br />

in case of conversion errors.<br />

exception An exception will be thrown in case of conversion errors.<br />

(Boolean) If true, enable substitution of escape sequences in strings. Default: false<br />

(Boolean; only for inputformat=utf8; will be ignored if outputformat=utf8) If true, an invalid UTF-8<br />

input string will not trigger an exception, but rather an inflated byte string in the specified output format<br />

will be generated. This may be useful for debugging. Default: false<br />

(Keyword) Unicode text format of the generated string: utf8, ebcdicutf8, utf16, utf16le, utf16be,<br />

utf32. An empty string is equivalent to utf16. Default: utf16<br />

Unicode-aware language bindings: the output format will be forced to utf16.<br />

C++ language binding: only the following output formats are allowed: ebcdicutf8, utf8, utf16,<br />

utf32.<br />

156 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!