17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

U+0020) for unmappable glyphs with the unknownchar option of <strong>TET</strong>_open_<br />

document( ).<br />

However, if the keeppua option of <strong>TET</strong>_open_document( ) is true, unknown characters<br />

will be mapped to increasing values in the Private Use Area (PUA), starting at<br />

U+F200. The same glyph name used in different fonts will end up with the same PUA<br />

value, while TrueType or OpenType glyph ids from different fonts will have different<br />

PUA values assigned.<br />

> If the font or PDF provides Unicode values, these may be contained in the Private Use<br />

Area (PUA). Since PUA characters are generally not very useful, <strong>TET</strong> will replace them<br />

with unknownchar (by default: U+FFFD). However, if the keeppua option of <strong>TET</strong>_open_<br />

document( ) is true, PUA values will be returned without any modification. This may<br />

be useful if you can deal with PUA values, e.g. for a specific font, or for all fonts from<br />

a specific font vendor.<br />

Since not all glyphs in a document may have proper Unicode values (e.g. custom symbols),<br />

<strong>TET</strong> may have to map some glyphs to unknownchar. Your code should be prepared<br />

for this character. If you don’t care about Unicode mapping problems you can simply ignore<br />

it, or use the unknownchar option of <strong>TET</strong>_open_document( ) to set a different character<br />

as a replacement for unmappable glyphs (e.g. the space character).<br />

In order to check for unmappable glyphs you can use the unknown field returned by<br />

<strong>TET</strong>_get_char_info( ).<br />

70 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!