PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Dehyphenation. Hyphenated<br />
words at the end of a line are usually<br />
not desired for applications<br />
which process the extracted text<br />
on a logical level. <strong>TET</strong> will therefore<br />
dehyphenate, or recombine<br />
the parts of a hyphenated word.<br />
More precisely, if a word at the<br />
end of a line ends with a hyphen<br />
character and the first word on<br />
the next line starts with a lowercase<br />
character, the hyphen will be<br />
removed and the first part of the<br />
word will be combined with the part on the next line, provided there is at least one<br />
more line in the same zone. Dash characters (as opposed to hyphens) will be left unmodified.<br />
The parts of a hyphenated word will not be modified, only the hyphen will be<br />
removed. Dehyphenation can be disabled with the following option list for <strong>TET</strong>_open_<br />
page( ):<br />
contentanalysis={dehyphenate=false}<br />
Shadow and fake bold text removal. PDF documents sometimes include redundant<br />
text which does not contribute to the semantics of a page, but creates certain visual effects<br />
only. Shadow text effects are usually achieved by placing two or more copies of the<br />
actual text on top of each other, where a small displacement is applied. Applying<br />
opaque coloring to each layer of text provides a visual appearance where the majority<br />
of the text in lower layers is obscured, while the visible portions create a shadow effect.<br />
Similarly,<br />
word processing<br />
applications<br />
sometimes support<br />
a feature for<br />
creating artificial bold text. In order to create bold text appearance even if a bold font is<br />
not available, the text is placed repeatedly on the page in the same color. Using a very<br />
small displacement the appearance of bold text is simulated.<br />
Shadow simulation, artificial bold text, and similar visual artifacts create severe<br />
problems when reusing the extracted text since redundant text contents which contribute<br />
only to the visual appearance will be processed although the text does not contribute<br />
to the page contents.<br />
If the wordfinder is enabled, <strong>TET</strong> will identify and remove such redundant visual artifacts<br />
by default. Shadow removal can be disabled with the following option list for<br />
<strong>TET</strong>_open_page( ):<br />
contentanalysis={shadowdetect=false}<br />
6.6 Content Analysis 73