17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Options for text filtering. There are several situations where <strong>TET</strong> will modify the actual<br />

character values found on the page in order to make the results more useful. Most of<br />

these steps can be controlled via options. The following list gives an overview of all operations<br />

which may modify the text:<br />

> Dehyphenation will remove hyphen characters and combine the parts of a hyphenated<br />

word. This can be disabled with the dehyphenate suboption of the<br />

contentanalysis option for <strong>TET</strong>_open_page( ).<br />

> Redundant text which creates only visual artifacts such as shadow effects or artificial<br />

bold text will be removed. This can be disabled with the shadowdetect suboption of<br />

the contentanalysis option for <strong>TET</strong>_open_page( ).<br />

> Very small or very large text can be ignored. The limits can be controlled with the<br />

fontsizerange option of <strong>TET</strong>_open_page( ).<br />

> Unicode post-processing will replace certain Unicode characters with more familiar<br />

ones. For example, Latin ligatures will be replaced with their constituent characters,<br />

and fullwidth ASCII variants in CJK fonts will be replaced with the corresponding<br />

non-fullwidth characters. For details see Table 6.2, page 68.<br />

> Invisible text (text with textrendering=3) will be extracted by default, but this can be<br />

changed with the ignoreinvisibletext option of <strong>TET</strong>_open_page( ).<br />

> Glyphs which cannot be mapped to Unicode will be replaced with the Unicode character<br />

defined in the unknownchar option of <strong>TET</strong>_open_document( ). See section »Unmappable<br />

glyphs«, page 69.<br />

Characters outside the BMP and surrogate handling. Characters outside Unicode’s Basic<br />

Multilingual Plane (BMP), i.e. those with Unicode values above 0xFFFF, cannot be expressed<br />

as a single UTF-16 value, but require a pair of UTF-16 values called a surrogate<br />

pair. Examples of characters outside the BMP include certain mathematical and musical<br />

symbols at U+1DXXX as well as thousands of CJK extension characters starting at<br />

U+20000.<br />

<strong>TET</strong> interprets and maintains surrogates, and allows access to the corresponding<br />

UTF-32 value even in programming languages where native Unicode strings support<br />

only UTF-16. Leading (high) surrogates and trailing (low) surrogates will be maintained.<br />

The string returned by <strong>TET</strong>_get_text( ) will return two UTF-16 values where the leading<br />

value will be treated as a real character which carries the glyph properties (size, font,<br />

etc.). The trailing value will be treated as an artificial character without any corresponding<br />

glyph.<br />

The uv field returned by <strong>TET</strong>_get_char_info( ) for the leading surrogate value will contain<br />

the corresponding UTF-32 value. This allows direct access to the UTF-32 value of a<br />

non-BMP character even if you are working in an UTF-16 environment without any support<br />

for UTF-32. The uv field for the trailing surrogate value will be 0.<br />

Unmappable glyphs. There are several reasons why text in a PDF cannot reliably be<br />

mapped to Unicode. If the document does not contain a ToUnicode CMap with Unicode<br />

mapping information for the codes used on the page, Unicode values can be missing for<br />

various reasons:<br />

> Type 1 fonts may contain unknown glyph names, and TrueType, OpenType, or CID<br />

fonts may be addressed with glyph ids without any Unicode values in the font or<br />

PDF. By default, <strong>TET</strong> will assign the Unicode Replacement Character U+FFFD to these<br />

characters. You can select a different replacement character (e.g. the space character<br />

6.5 Unicode Pipeline 69

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!