PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Options for text filtering. There are several situations where <strong>TET</strong> will modify the actual<br />
character values found on the page in order to make the results more useful. Most of<br />
these steps can be controlled via options. The following list gives an overview of all operations<br />
which may modify the text:<br />
> Dehyphenation will remove hyphen characters and combine the parts of a hyphenated<br />
word. This can be disabled with the dehyphenate suboption of the<br />
contentanalysis option for <strong>TET</strong>_open_page( ).<br />
> Redundant text which creates only visual artifacts such as shadow effects or artificial<br />
bold text will be removed. This can be disabled with the shadowdetect suboption of<br />
the contentanalysis option for <strong>TET</strong>_open_page( ).<br />
> Very small or very large text can be ignored. The limits can be controlled with the<br />
fontsizerange option of <strong>TET</strong>_open_page( ).<br />
> Unicode post-processing will replace certain Unicode characters with more familiar<br />
ones. For example, Latin ligatures will be replaced with their constituent characters,<br />
and fullwidth ASCII variants in CJK fonts will be replaced with the corresponding<br />
non-fullwidth characters. For details see Table 6.2, page 68.<br />
> Invisible text (text with textrendering=3) will be extracted by default, but this can be<br />
changed with the ignoreinvisibletext option of <strong>TET</strong>_open_page( ).<br />
> Glyphs which cannot be mapped to Unicode will be replaced with the Unicode character<br />
defined in the unknownchar option of <strong>TET</strong>_open_document( ). See section »Unmappable<br />
glyphs«, page 69.<br />
Characters outside the BMP and surrogate handling. Characters outside Unicode’s Basic<br />
Multilingual Plane (BMP), i.e. those with Unicode values above 0xFFFF, cannot be expressed<br />
as a single UTF-16 value, but require a pair of UTF-16 values called a surrogate<br />
pair. Examples of characters outside the BMP include certain mathematical and musical<br />
symbols at U+1DXXX as well as thousands of CJK extension characters starting at<br />
U+20000.<br />
<strong>TET</strong> interprets and maintains surrogates, and allows access to the corresponding<br />
UTF-32 value even in programming languages where native Unicode strings support<br />
only UTF-16. Leading (high) surrogates and trailing (low) surrogates will be maintained.<br />
The string returned by <strong>TET</strong>_get_text( ) will return two UTF-16 values where the leading<br />
value will be treated as a real character which carries the glyph properties (size, font,<br />
etc.). The trailing value will be treated as an artificial character without any corresponding<br />
glyph.<br />
The uv field returned by <strong>TET</strong>_get_char_info( ) for the leading surrogate value will contain<br />
the corresponding UTF-32 value. This allows direct access to the UTF-32 value of a<br />
non-BMP character even if you are working in an UTF-16 environment without any support<br />
for UTF-32. The uv field for the trailing surrogate value will be 0.<br />
Unmappable glyphs. There are several reasons why text in a PDF cannot reliably be<br />
mapped to Unicode. If the document does not contain a ToUnicode CMap with Unicode<br />
mapping information for the codes used on the page, Unicode values can be missing for<br />
various reasons:<br />
> Type 1 fonts may contain unknown glyph names, and TrueType, OpenType, or CID<br />
fonts may be addressed with glyph ids without any Unicode values in the font or<br />
PDF. By default, <strong>TET</strong> will assign the Unicode Replacement Character U+FFFD to these<br />
characters. You can select a different replacement character (e.g. the space character<br />
6.5 Unicode Pipeline 69