17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Characters and glyphs. When dealing with text it is important to clearly distinguish<br />

the following concepts:<br />

> Characters are the smallest units which convey information in a language. Common<br />

examples are the letters in the Latin alphabet, Chinese ideographs, and Japanese syllables.<br />

Characters have a meaning: they are semantic entities.<br />

> Glyphs are different graphical variants which represent one or more particular characters.<br />

Glyphs have an appearance: they are representational entities.<br />

There is no one-to-one relationship between characters and glyphs. For example, a ligature<br />

is a single glyph which is represented by two or more separate characters. On the<br />

other hand, a specific glyph may be used to represent different characters depending on<br />

the context (some characters look identical, see Figure 6.2).<br />

Composite characters and sequences. Some glyphs map to a sequence of multiple<br />

characters. For example, some ligatures will be mapped to multiple characters according<br />

to their constituent characters. However, composite characters (such as the Roman<br />

numeral in Figure 6.2) may or may not be split, subject to information in the font and<br />

PDF. See Table 6.2, page 68, for a list of characters which will be post-processed by <strong>TET</strong>.<br />

If appropriate, <strong>TET</strong> will split composite characters into a sequence of constituent<br />

characters. The corresponding sequence will be part of the text returned by <strong>TET</strong>_get_<br />

text( ). For each character details of the underlying glyph can be obtained via <strong>TET</strong>_get_<br />

char_info( ), including the information whether the character is the start or continuation<br />

of a sequence. Position information will only be returned for the first character of a<br />

sequence. Subsequent characters of a sequence will not have any associated position or<br />

width information, but must be processed in combination with the first character.<br />

Characters without any corresponding glyph. Although every glyph on the page will<br />

be mapped to one or more corresponding Unicode characters, not all characters delivered<br />

by <strong>TET</strong> actually correspond to a glyph. Characters which correspond to a glyph are<br />

called real characters, others are called artificial characters. There are several classes of<br />

Characters<br />

Glyphs<br />

U+0067 LATIN SMALL LETTER G<br />

U+0066 LATIN SMALL LETTER F +<br />

U+0069 LATIN SMALL LETTER I<br />

U+2126 OHM SIGN or<br />

U+03A9 GREEK CAPITAL LETTER OMEGA<br />

U+2167 ROMAN NUMERAL EIGHT or<br />

U+0056 V U+0049 I U+0049 I U+0049 I<br />

Fig. 6.2<br />

Relationship of glyphs<br />

and characters<br />

62 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!