17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

handles must be valid for the corresponding type of object, etc. Some examples for simple<br />

values (the first line shows a password string containing a blank character):<br />

<strong>TET</strong>_open_document( ): password {secret string}<br />

<strong>TET</strong>_open_document( ): lineseparator={ CRLF }<br />

List values. List values consist of multiple values, which may be simple values or list<br />

values in turn. Lists are bracketed with { and }. Example:<br />

<strong>TET</strong>_set_option( ):<br />

searchpath={/usr/lib/tet d:\tet}<br />

Note The backslash \ character requires special handling in many programming languages<br />

Rectangles. A rectangle is a list of four float values specifying the coordinates of the<br />

lower left and upper right corners of a rectangle. Rectangle coordinates will be interpreted<br />

in the standard or user coordinate system (see Section 6.3, »Page and <strong>Text</strong> Geometry«,<br />

page 64). Example:<br />

<strong>TET</strong>_open_page( ): includebox = {{0 0 500 100} {0 500 500 600}}<br />

Character references in option lists. Some environments require the programmer to<br />

write source code in 8-bit encodings. This makes it cumbersome to include isolated Unicode<br />

characters in 8-bit encoded text without changing all characters in the text to<br />

multi-byte encoding. In order to aid developers in this situation, <strong>TET</strong> supports character<br />

references, a method known from markup languages such as SGML and HTML.<br />

<strong>TET</strong> supports all numeric character references and character entity references defined<br />

in HTML 4.0, but in option lists they must be used without the ’&’ and ’;’ decoration.<br />

Numeric character references can be supplied in decimal or hexadecimal notation<br />

for the character’s Unicode value. The following are examples for valid character references<br />

along with a description of the resulting character:<br />

#173 soft hyphen<br />

#xAD<br />

soft hyphen<br />

shy<br />

soft hyphen<br />

In addition to the HTML-style references above <strong>TET</strong> supports the custom character entity<br />

names for control characters (see Table 10.1).<br />

Table 10.1 Custom character entity names for control characters<br />

Unicode character<br />

(VB equivalents)<br />

custom entity name<br />

Unicode character<br />

(VB equivalents)<br />

custom entity name<br />

U+0020 SP, space U+00AD SHY, shy<br />

U+00A0 NBSP, nbsp U+000B<br />

U+2028<br />

U+0009 (VbTab) HT, hortab U+000A (VbLf)<br />

U+000D (VbCr)<br />

U+000D and U+000A<br />

(VbCrLf)<br />

U+0085<br />

U+2029<br />

VT, verttab<br />

LS, linesep<br />

LF, linefeed<br />

CR, return<br />

CRLF<br />

NEL, newline<br />

PS, parasep<br />

U+002D HY, hyphen U+000C (VbFormFeed) FF, formfeed<br />

122 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!