17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.7 Layout Analysis<br />

<strong>TET</strong> analyses the layout of text on the page in order to determine the best possible order<br />

of text extraction. This automatic process can be assisted by several options. If you have<br />

advance knowledge of the nature of the processed documents you can improve the text<br />

extraction results by supplying suitable options.<br />

Document styles. Several internal parameters are available for processing documents<br />

of different layout and style. For example, newspaper pages tend to contain lots of text<br />

in multiple columns, while business reports often contain comments in the margins,<br />

etc. <strong>TET</strong> contains predefined settings for several types of document. These settings can<br />

be activated with an option list for <strong>TET</strong>_open_page( ) which looks similar to the following:<br />

docstyle=papers<br />

The following types are available for the docstyle option (Table 6.3 contains typical examples<br />

for some document styles):<br />

> Book: typical book layouts with regular pages<br />

> Business: business documents<br />

> Fancy: fancy pages with complex and sometimes irregular layout<br />

> Forms: structured forms<br />

> Generic: the most general document class without any further qualification<br />

> Magazines: magazine articles, usually with three or more columns and interspersed<br />

images and graphics<br />

> Papers: newspapers with many columns, large pages and small type<br />

> Science: scientific articles, usually with two or more columns and interspersed images,<br />

formulae, tables, etc.<br />

> Search engine: this class does not refer to a specific type of input document, but rather<br />

optimizes <strong>TET</strong> for the typical requirements of indexers for search engines. Some<br />

layout detection features will be disabled to deliver only the raw text and speed up<br />

processing. For example, table and page structure recognition will be disabled.<br />

Choosing the most appropriate document style for can speed up processing and enhance<br />

text extraction results.<br />

Table detection. <strong>TET</strong> detects tabular structures on the page and structures the table<br />

contents in rows, columns and cells. Information about tables detected on the page is<br />

not provided directly by the API, but is only available in <strong>TET</strong>ML output as in the following<br />

example:<br />

<br />

<br />

<br />

<br />

<br />

5<br />

<br />

<br />

<br />

.<br />

<br />

<br />

74 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!