PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
5.3 Recommendations for common Scenarios<br />
<strong>TET</strong> offers a variety of options which you can use to control various aspects of operation.<br />
In this section we provide some recommendations for typical <strong>TET</strong> application scenarios.<br />
Please refer to Chapter 10, »<strong>TET</strong> Library API Reference«, page 121, for details on the<br />
functions and options mentioned below.<br />
Optimizing performance. In some situations, particularly when indexing PDF for<br />
search engines, text extraction speed is crucial, and may play a more important role<br />
than optimal output. The default settings of <strong>TET</strong> have been selected to achieve the best<br />
possible output, but can be adjusted to speed up processing. Some tips for choosing options<br />
in <strong>TET</strong>_open_page( ) to maximize text extraction throughput:<br />
> docstyle=searchengine<br />
Several internal parameters will be set to speed up operation by reducing the output<br />
quality in a way which does not affect the indexing process for search engines.<br />
> skipengines={image}<br />
If image extraction is not required internal image processing can be skipped in order<br />
to speed up operation.<br />
> contentanalysis={merge=0}<br />
This will disable the expensive strip and zone merging step, and reduces processing<br />
times for typical files to ca. 60% compared to default settings. However, documents<br />
where the contents are scattered across the pages in arbitrary order may result in<br />
some text which is not extracted in logical order.<br />
> contentanalysis={shadowdetect=false}<br />
This will disable detection of redundant shadow and fake bold text, which can also<br />
reduce processing times.<br />
Words vs. line layout vs. reflowable text. Different applications will prefer different<br />
kinds of output (hyphenated words will always be dehyphenated with these settings):<br />
> Individual words (ignore layout): a search engine may not be interested in any layout-related<br />
aspects, but only the words comprising the text. In this situation use<br />
granularity=word in <strong>TET</strong>_open_page( ) to retrieve one word per call to <strong>TET</strong>_get_text( ).<br />
> Keep line layout: use granularity=page in <strong>TET</strong>_open_page( ) for extracting the full text<br />
contents of a page in a single call to <strong>TET</strong>_get_text( ). <strong>Text</strong> lines will be separated with a<br />
linefeed character to retain the existing line structure.<br />
> Reflowable text: in order to avoid line breaks and facilitate reflowing of the extracted<br />
text use contentanalysis={lineseparator=U+0020} and granularity=page in <strong>TET</strong>_open_<br />
page( ). The full page contents can be fetched with a single call to <strong>TET</strong>_get_text( ).<br />
Zones will be separated with a linefeed character, and a space character will be inserted<br />
between the lines in a zone.<br />
Writing a search engine or indexer. Indexers are usually not interested in the position<br />
of text on the page (unless they provide search term highlighting). In many cases they<br />
will tolerate errors which occur in Unicode mapping, and process whatever text contents<br />
they can get. Recommendations:<br />
> Use granularity=word in <strong>TET</strong>_open_page( ).<br />
> If the application knows how to process punctuation characters you can keep them<br />
with the adjacent text by setting the following page option:<br />
contentanalysis={punctuationbreaks=false}<br />
54 Chapter 5: Configuration