17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Dehyphenation. Hyphenated<br />

words at the end of a line are usually<br />

not desired for applications<br />

which process the extracted text<br />

on a logical level. <strong>TET</strong> will therefore<br />

dehyphenate, or recombine<br />

the parts of a hyphenated word.<br />

More precisely, if a word at the<br />

end of a line ends with a hyphen<br />

character and the first word on<br />

the next line starts with a lowercase<br />

character, the hyphen will be<br />

removed and the first part of the<br />

word will be combined with the part on the next line, provided there is at least one<br />

more line in the same zone. Dash characters (as opposed to hyphens) will be left unmodified.<br />

The parts of a hyphenated word will not be modified, only the hyphen will be<br />

removed. Dehyphenation can be disabled with the following option list for <strong>TET</strong>_open_<br />

page( ):<br />

contentanalysis={dehyphenate=false}<br />

Shadow and fake bold text removal. PDF documents sometimes include redundant<br />

text which does not contribute to the semantics of a page, but creates certain visual effects<br />

only. Shadow text effects are usually achieved by placing two or more copies of the<br />

actual text on top of each other, where a small displacement is applied. Applying<br />

opaque coloring to each layer of text provides a visual appearance where the majority<br />

of the text in lower layers is obscured, while the visible portions create a shadow effect.<br />

Similarly,<br />

word processing<br />

applications<br />

sometimes support<br />

a feature for<br />

creating artificial bold text. In order to create bold text appearance even if a bold font is<br />

not available, the text is placed repeatedly on the page in the same color. Using a very<br />

small displacement the appearance of bold text is simulated.<br />

Shadow simulation, artificial bold text, and similar visual artifacts create severe<br />

problems when reusing the extracted text since redundant text contents which contribute<br />

only to the visual appearance will be processed although the text does not contribute<br />

to the page contents.<br />

If the wordfinder is enabled, <strong>TET</strong> will identify and remove such redundant visual artifacts<br />

by default. Shadow removal can be disabled with the following option list for<br />

<strong>TET</strong>_open_page( ):<br />

contentanalysis={shadowdetect=false}<br />

6.6 Content Analysis 73

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!