17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Separator characters are inserted between multiple words, lines, or zones if the chosen<br />

granularity is larger than the respective unit. For example, with granularity=word there’s<br />

no need to insert separator characters since each call to <strong>TET</strong>_get_text( ) will return exactly<br />

one word.<br />

The separator characters can be specified with the wordseparator, lineseparator options<br />

of <strong>TET</strong>_open_page( ) (use U+0000 to disable a separator), for example:<br />

lineseparator=U+000A<br />

By default, all content processing operations are disabled for granularity=glyph, and enabled<br />

for all other granularity settings. However, more fine-grain control is possible via<br />

separate options (see below).<br />

Word boundary detection for Western text. The Wordfinder, which is enabled for all<br />

granularity modes except glyph, creates logical words from multiple glyphs which may<br />

be scattered all over the page in no particular order. Word boundaries for Western text<br />

are identified by two criteria:<br />

> A sophisticated algorithm analyzes the geometric relationship among glyphs to find<br />

character groups which together form a word. The algorithm takes into account a variety<br />

of properties and special cases in order to accurately identify words even in<br />

complicated layouts and for arbitrary text ordering on the page.<br />

> Some characters, such as space and punctuation characters (e.g. colon, comma, full<br />

stop, parentheses) are considered a word boundary, regardless of their width and position.<br />

If the punctuationbreaks option in <strong>TET</strong>_open_page( ) is set to false, the Wordfinder<br />

will no longer treat punctuation characters as word boundaries:<br />

contentanalysis={punctuationbreaks=false}<br />

Ignoring punctuation characters for word boundary detection can, for example, be useful<br />

for maintaining Web URLs where period and slash characters are usually considered<br />

part of a word (see Figure 6.5).<br />

Note Word boundary detection for text with ideographic characters works differently; see Section<br />

6.3.2, »Word Boundaries for CJK <strong>Text</strong>«, page 79, for more information.<br />

Fig. 6.5<br />

The default setting punctuationbreaks=true<br />

will separate the parts of URLs (top), while<br />

punctuationbreaks=false will keep the parts<br />

together (bottom).<br />

6.5 Content Analysis 85

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!