17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

7.4 Supplementary Characters and Surrogates<br />

Supplementary characters outside Unicode’s Basic Multilingual Plane (BMP), i.e. those<br />

with Unicode values above U+FFFF, cannot be expressed as a single UTF-16 value, but require<br />

a pair of UTF-16 values called a surrogate pair. Examples of supplementary characters<br />

include certain mathematical and musical symbols at U+1DXXX as well as thousands<br />

of CJK extension characters starting at U+20000.<br />

<strong>TET</strong> interprets and maintains supplementary characters and provides access to the<br />

corresponding UTF-32 value even in language bindings where native Unicode strings<br />

support only UTF-16. The uv field returned by <strong>TET</strong>_get_char_info( ) for the leading surrogate<br />

value contains the corresponding UTF-32 value. This allows direct access to the<br />

UTF-32 value of a supplementary character even if you are working in a UTF-16 environment<br />

without any support for UTF-32.<br />

Leading (high) surrogates and trailing (low) surrogates are maintained. The string returned<br />

by <strong>TET</strong>_get_text( ) contains two UTF-16 values.<br />

106 Chapter 7: Advanced Unicode Handling

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!