17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

7.2 Unicode Preprocessing (Filtering)<br />

<strong>TET</strong> applies several filters to remove text which is unlikely to be useful. These filters<br />

modify the text before applying any Unicode postprocessing steps. While some filters<br />

are always active, others require the Wordfinder and are therefore active only for<br />

granularity=word or above.<br />

7.2.1 Filters for all Granularities<br />

The following filters can be used with all granularities.<br />

<strong>Text</strong> in unwieldy font sizes. Very small or very large text can optionally be ignored, e.g.<br />

large characters in the background of the page. The limits can be controlled with the<br />

fontsizerange page option. By default, text in all font sizes will be extracted.<br />

The following page option limits the range of font sizes for extracted text from 10 to<br />

50 points; text in other font sizes will be ignored:<br />

fontsizerange={10 50}<br />

Invisible text. Invisible text (i.e. text with textrendering=3) is extracted by default. Note<br />

that text in PDF may be invisible for various other reasons than the textrendering property,<br />

e.g. the text color is identical to the background color, the text may be obscured by<br />

other objects on the page, etc. The behavior described here relates only to text with<br />

textrendering=3. This PDF technique is commonly used for the results of OCR where the<br />

text sits invisibly »behind« the scanned raster image.<br />

Invisible text can be identified with the textrendering member of the <strong>TET</strong>_char_info<br />

structure returned by <strong>TET</strong>_get_char_info( ) (see Table 10.15, page 179), or with the Glyph/<br />

@textrendering attribute in <strong>TET</strong>ML.<br />

Use the following page option if you want to ignore invisible text:<br />

ignoreinvisibletext=true<br />

Completely ignore text with certain font names or font types. In some situations it<br />

may be useful to completely ignore text in one ore more fonts specified by name, e.g. a<br />

symbolic font which does not contribute any meaningful text. As an alternative, the<br />

problematic fonts can also be specified by font type. This is mainly useful for Type 3<br />

fonts which are sometimes used for ornaments. This filter can be controlled via the<br />

remove suboption of the glyphmapping document option.<br />

E.g. ignore all text in Type 3 fonts:<br />

glyphmapping={{fonttype={Type3} remove}}<br />

Ignore all text in the Webdings, Wingdings, Wingdings 2, and Wingdings 3 fonts:<br />

glyphmapping={{fontname=Webdings remove} {fontname=Wingdings* remove}}<br />

The conditions for font name and font type can also be combined, e.g. ignore text in all<br />

Type 3 fonts starting with the letter A:<br />

glyphmapping={{fonttype={Type3} fontname=A* remove}}<br />

94 Chapter 7: Advanced Unicode Handling

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!