17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.6 Layout Analysis<br />

<strong>TET</strong> analyses the layout of text on the page in order to determine the best possible order<br />

of text extraction. This automatic process can be assisted by several options. If you have<br />

advance knowledge of the nature of the processed documents you can improve the text<br />

extraction results by supplying suitable options.<br />

Document styles. Several internal parameters are available for processing documents<br />

of different layout and style. For example, newspaper pages tend to contain lots of text<br />

in multiple columns, while business reports often contain comments in the margins,<br />

etc. <strong>TET</strong> contains predefined settings for several types of document. These settings can<br />

be activated with an option list for <strong>TET</strong>_open_page( ) which looks similar to the following:<br />

docstyle=papers<br />

If the type of input documents is known it is strongly recommended to supply suitable<br />

values of the docstyle page option and (if applicable) also the layouthint page option.<br />

Supplying the docstyle option activates an advanced layout recognition algorithm.<br />

However, supplying an unsuitable value for this option may actually create worse results.<br />

The following types are available for the docstyle option (Table 6.4 contains typical<br />

examples for some document styles):<br />

> Book: typical book layouts with regular pages<br />

> Business: business documents<br />

> Cad: technical or architectural drawings which are typically heavily fragmented<br />

> Fancy: fancy pages with complex and sometimes irregular layout<br />

> Forms: structured forms<br />

> Generic: the most general document class without any further qualification<br />

> Magazines: magazine articles, usually with three or more columns and interspersed<br />

images and graphics<br />

> Papers: newspapers with many columns, large pages and small type<br />

> Science: scientific articles, usually with two or more columns and interspersed images,<br />

formulae, tables, etc.<br />

> Search engine: this class does not refer to a specific type of input document, but rather<br />

optimizes <strong>TET</strong> for the typical requirements of indexers for search engines. Some<br />

layout detection features are disabled to deliver only the raw text and speed up processing.<br />

For example, table and page structure recognition are disabled.<br />

> Space grid: this class is targeted at list-oriented reports which are often generated on<br />

mainframe systems. The characteristic of this document class is that the visual layout<br />

is generated with space characters instead of explicit positioning of text. When<br />

processing this kind of document text extraction can be accelerated since some processing<br />

steps (e.g. shadow detection) can be skipped.<br />

Choosing the most appropriate document style can speed up processing and enhance<br />

text extraction results.<br />

Complex layouts. Some classes of documents often use very elaborate page layouts.<br />

For example, with magazines and periodicals <strong>TET</strong> may not be able to properly determine<br />

the relationship of columns on the page. In such situations it is possible to en-<br />

88 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!