17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.3 Recommendations for common Scenarios<br />

<strong>TET</strong> offers a variety of options which you can use to control various aspects of operation.<br />

In this section we provide some recommendations for typical <strong>TET</strong> application scenarios.<br />

Please refer to Chapter 10, »<strong>TET</strong> Library API Reference«, page 121, for details on the<br />

functions and options mentioned below.<br />

Optimizing performance. In some situations, particularly when indexing PDF for<br />

search engines, text extraction speed is crucial, and may play a more important role<br />

than optimal output. The default settings of <strong>TET</strong> have been selected to achieve the best<br />

possible output, but can be adjusted to speed up processing. Some tips for choosing options<br />

in <strong>TET</strong>_open_page( ) to maximize text extraction throughput:<br />

> docstyle=searchengine<br />

Several internal parameters will be set to speed up operation by reducing the output<br />

quality in a way which does not affect the indexing process for search engines.<br />

> skipengines={image}<br />

If image extraction is not required internal image processing can be skipped in order<br />

to speed up operation.<br />

> contentanalysis={merge=0}<br />

This will disable the expensive strip and zone merging step, and reduces processing<br />

times for typical files to ca. 60% compared to default settings. However, documents<br />

where the contents are scattered across the pages in arbitrary order may result in<br />

some text which is not extracted in logical order.<br />

> contentanalysis={shadowdetect=false}<br />

This will disable detection of redundant shadow and fake bold text, which can also<br />

reduce processing times.<br />

Words vs. line layout vs. reflowable text. Different applications will prefer different<br />

kinds of output (hyphenated words will always be dehyphenated with these settings):<br />

> Individual words (ignore layout): a search engine may not be interested in any layout-related<br />

aspects, but only the words comprising the text. In this situation use<br />

granularity=word in <strong>TET</strong>_open_page( ) to retrieve one word per call to <strong>TET</strong>_get_text( ).<br />

> Keep line layout: use granularity=page in <strong>TET</strong>_open_page( ) for extracting the full text<br />

contents of a page in a single call to <strong>TET</strong>_get_text( ). <strong>Text</strong> lines will be separated with a<br />

linefeed character to retain the existing line structure.<br />

> Reflowable text: in order to avoid line breaks and facilitate reflowing of the extracted<br />

text use contentanalysis={lineseparator=U+0020} and granularity=page in <strong>TET</strong>_open_<br />

page( ). The full page contents can be fetched with a single call to <strong>TET</strong>_get_text( ).<br />

Zones will be separated with a linefeed character, and a space character will be inserted<br />

between the lines in a zone.<br />

Writing a search engine or indexer. Indexers are usually not interested in the position<br />

of text on the page (unless they provide search term highlighting). In many cases they<br />

will tolerate errors which occur in Unicode mapping, and process whatever text contents<br />

they can get. Recommendations:<br />

> Use granularity=word in <strong>TET</strong>_open_page( ).<br />

> If the application knows how to process punctuation characters you can keep them<br />

with the adjacent text by setting the following page option:<br />

contentanalysis={punctuationbreaks=false}<br />

54 Chapter 5: Configuration

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!