PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
7.2 Unicode Preprocessing (Filtering)<br />
<strong>TET</strong> applies several filters to remove text which is unlikely to be useful. These filters<br />
modify the text before applying any Unicode postprocessing steps. While some filters<br />
are always active, others require the Wordfinder and are therefore active only for<br />
granularity=word or above.<br />
7.2.1 Filters for all Granularities<br />
The following filters can be used with all granularities.<br />
<strong>Text</strong> in unwieldy font sizes. Very small or very large text can optionally be ignored, e.g.<br />
large characters in the background of the page. The limits can be controlled with the<br />
fontsizerange page option. By default, text in all font sizes will be extracted.<br />
The following page option limits the range of font sizes for extracted text from 10 to<br />
50 points; text in other font sizes will be ignored:<br />
fontsizerange={10 50}<br />
Invisible text. Invisible text (i.e. text with textrendering=3) is extracted by default. Note<br />
that text in PDF may be invisible for various other reasons than the textrendering property,<br />
e.g. the text color is identical to the background color, the text may be obscured by<br />
other objects on the page, etc. The behavior described here relates only to text with<br />
textrendering=3. This PDF technique is commonly used for the results of OCR where the<br />
text sits invisibly »behind« the scanned raster image.<br />
Invisible text can be identified with the textrendering member of the <strong>TET</strong>_char_info<br />
structure returned by <strong>TET</strong>_get_char_info( ) (see Table 10.15, page 179), or with the Glyph/<br />
@textrendering attribute in <strong>TET</strong>ML.<br />
Use the following page option if you want to ignore invisible text:<br />
ignoreinvisibletext=true<br />
Completely ignore text with certain font names or font types. In some situations it<br />
may be useful to completely ignore text in one ore more fonts specified by name, e.g. a<br />
symbolic font which does not contribute any meaningful text. As an alternative, the<br />
problematic fonts can also be specified by font type. This is mainly useful for Type 3<br />
fonts which are sometimes used for ornaments. This filter can be controlled via the<br />
remove suboption of the glyphmapping document option.<br />
E.g. ignore all text in Type 3 fonts:<br />
glyphmapping={{fonttype={Type3} remove}}<br />
Ignore all text in the Webdings, Wingdings, Wingdings 2, and Wingdings 3 fonts:<br />
glyphmapping={{fontname=Webdings remove} {fontname=Wingdings* remove}}<br />
The conditions for font name and font type can also be combined, e.g. ignore text in all<br />
Type 3 fonts starting with the letter A:<br />
glyphmapping={{fonttype={Type3} fontname=A* remove}}<br />
94 Chapter 7: Advanced Unicode Handling