17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 Content Analysis<br />

PDF documents provide the semantics (Unicode mapping) of individual text characters<br />

as well as their position on the page. However, they generally do not convey information<br />

about words, lines, columns or other high-level text units. The fragments comprising<br />

text on a page may contain individual characters, syllables, words, lines, or an arbitrary<br />

mixture thereof, without any explicit marks designating the start or end of a<br />

word, line, or column.<br />

To make matters worse, the ordering of text fragments on the page may be different<br />

from the logical (reading) order. There are no rules for the order in which portions of<br />

text are placed on the page. For example, a page containing two columns of text could<br />

be produced by creating the first line in the left column, followed by the first line of the<br />

right column, the second line of the left column, the second line of the right column etc.<br />

However, logical order requires all text in the left column to be processed before the<br />

text in the right column is processed. Extracting text from such documents by simply<br />

replaying the instructions on the PDF page generally provides undesirable results since<br />

the logical structure of the text is lost.<br />

<strong>TET</strong>’s content analysis engine analyzes the contents, position, and relationship of<br />

text fragments in order to achieve the following goals:<br />

> create words from characters, and insert separator characters between words if desired<br />

> remove redundant text, such as duplicates which are only present to create a shadow<br />

effect<br />

> recombine the parts of hyphenated words which span more than one line<br />

> identify text columns (zones)<br />

> sort text fragments within a zone, as well as zones within a page<br />

These operations are discussed in more detail below, as well as options which provide<br />

some control over content processing.<br />

<strong>Text</strong> granularity. The granularity option of <strong>TET</strong>_open_page( ) specifies the amount of<br />

text which is returned by a single call to <strong>TET</strong>_get_text( ):<br />

> With granularity=glyph each fragment contains the result of mapping one glyph,<br />

which may be more than one character (e.g. for ligatures). In this mode content analysis<br />

is disabled. <strong>TET</strong> will return the original text fragments on the page in their original<br />

order. Although this is the fastest mode, it is only useful if the <strong>TET</strong> client intends<br />

to do sophisticated postprocessing (or is only interested in the text position, but not<br />

in its logical structure) since the text may be scattered all over the page.<br />

> With granularity=word the Wordfinder algorithm will group characters into logical<br />

words. Each fragment contains a word. Isolated punctuation characters (comma, colon,<br />

question mark, quotes, etc.) are returned as separate fragments by default, while<br />

multiple sequential punctuation characters are grouped as a single word (e.g. a series<br />

of period characters which simulates a dotted line). However, punctuation treatment<br />

can be changed (see »Word boundary detection for Western text« below).<br />

> With granularity=line the words identified by the Wordfinder are grouped into lines.<br />

If dehyphenation is enabled (which is the default) the parts of hyphenated words at<br />

the end of a line are combined, and the full dehyphenated word is part of the line.<br />

> With granularity=page all words on the page are returned in a single fragment.<br />

84 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!