17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Composite characters and sequences. Some glyphs map to a sequence of multiple<br />

characters. For example, ligatures will be mapped to multiple characters according to<br />

their constituent characters. However, composite characters (such as the Roman numeral<br />

in Figure 7.1) may or may not be split, subject to information in the font and PDF<br />

as well as the decompose document option (see Section 7.3, »Unicode Postprocessing«,<br />

page 97).<br />

If appropriate, <strong>TET</strong> will split composite characters into a sequence of constituent<br />

characters. The corresponding sequence will be part of the text returned by <strong>TET</strong>_get_<br />

text( ). For each character, details of the underlying glyph(s) can be obtained via <strong>TET</strong>_get_<br />

char_info( ), including the information whether the character is the start or continuation<br />

of a sequence. Position information will only be returned for the first character of a<br />

sequence. Subsequent characters of a sequence will not have any associated position or<br />

width information, but must be processed in combination with the first character.<br />

Characters without any corresponding glyph. Although every glyph on the page will<br />

be mapped to one or more corresponding Unicode characters, not all characters delivered<br />

by <strong>TET</strong> actually correspond to a glyph. Characters which correspond to a glyph are<br />

called real characters, others are called artificial characters. There are several classes of<br />

artificial characters which will be delivered although a directly corresponding glyph is<br />

not available:<br />

> A composite character (see above) will map to a sequence of multiple Unicode characters.<br />

While the first character in the sequence corresponds to the actual glyph, the<br />

remaining characters do not correspond to any glyph.<br />

> Separator characters inserted via the lineseparator/wordseparator options are artifacts<br />

without any corresponding glyph.<br />

7.1 Important Unicode Concepts 93

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!