17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 Unicode Pipeline<br />

<strong>TET</strong> is completely based on the Unicode standard which is only concerned about characters,<br />

but not about glyphs. While text in PDF can be represented with a variety of font<br />

and encoding schemes, <strong>TET</strong> will abstract from glyphs and normalize all text to Unicode<br />

characters, regardless of the original text representation in the PDF. Converting the information<br />

found in the PDF to the corresponding Unicode values is called Unicode<br />

mapping, and is crucial for understanding the semantics of the text (as opposed to rendering<br />

a visual representation of the text on screen or paper). In order to provide proper<br />

Unicode mapping <strong>TET</strong> consults various data structures which are found in the PDF document,<br />

embedded or external font files, as well as builtin and user-supplied tables. In<br />

addition, it applies several methods to determine the Unicode mapping for non-standard<br />

glyph names.<br />

However, despite all efforts there are still PDF documents where some text cannot be<br />

mapped to Unicode. In order to deal with these cases <strong>TET</strong> offers a number of configuration<br />

features which can be used to control Unicode mapping for problematic PDF files.<br />

These features are discussed in Section 6.7, »Layout Analysis«, page 74.<br />

Post-processing for certain Unicode values. In some cases the Unicode values which<br />

are determined as a result of font and encoding processing will be modified by a postprocessing<br />

step, e.g. to split ligatures. Table 6.2 lists all Unicode values which are affected<br />

by post-processing.<br />

Table 6.2 Post-processing for various Unicode values<br />

UTF-16 values<br />

U+0000-U+001F,<br />

U+007F-U+009F<br />

U+0020, U+00A0,<br />

U+3000, U+2000-U+200B<br />

U+D800 - U+DBFF (high)<br />

U+DC00 - U+DFFF (low)<br />

surrogates<br />

U+E000-U+F8FF<br />

(Private Use Area, PUA)<br />

Processing<br />

Control characters will be removed. 1 Mapping to U+0000 may be useful for eliminating<br />

unwanted characters.<br />

Space characters will be mapped to U+0020.<br />

Leading (high) surrogates and trailing (low) surrogates will be maintained in the UTF-16<br />

output, and the corresponding UTF-32 value will be available in the uv field (see »Characters<br />

outside the BMP and surrogate handling«, page 69).<br />

PUA characters will be kept or replaced according to the keeppua option (see Section »Unmappable<br />

glyphs«).<br />

U+F600-U+F8FF<br />

(Adobe CUS)<br />

Will be mapped to the corresponding characters outside the CUS.<br />

U+FB00-U+FB17 Latin and Armenian ligatures will be decomposed into their constituent characters. 2<br />

U+FF01-U+FF5E,<br />

U+FFE0-U+FFE6<br />

U+FE30-U+FE6F<br />

undefined Unicode values<br />

Fullwidth ASCII and symbol variants will be mapped to the corresponding non-fullwidth<br />

characters. 3<br />

CJK compatibility forms (prerotated glyphs for vertical text) and small form variants<br />

(U+FE30-U+FE6F) will be mapped to the corresponding normal variants<br />

no modification<br />

1. Characters inserted via the wordseparator, lineseparator options are not subject to this removal.<br />

2. Ligatures in the Arabic and Hebrew presentation forms will not be decomposed.<br />

3. The following characters will be left unchanged: halfwidth CJK punctuation (U+FF61-U+FF64), Katakana variants (U+FF65-U+FF9F),<br />

Hangul variants (U+FFA0-U+FFDC), and symbol variants (U+FFE8-U+FFEE).<br />

68 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!