PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The tetml option enables or disables coordinate- and font-related information in the<br />
Glyph element. The following page option list enables font details in the Glyph element,<br />
but suppresses coordinate details:<br />
tetml={ glyphdetails={nogeometry font} }<br />
The following page option list instructs <strong>TET</strong> to combine punctuation characters with<br />
the adjacent words, i.e. punctuation characters are no longer treated as individual<br />
words:<br />
contentanalysis={nopunctuationbreaks}<br />
The following page option makes sense only for page mode. It changes the default separator<br />
character from linefeed to space:<br />
contentanalysis={lineseparator=U+0020}<br />
All page options which have been supplied when creating <strong>TET</strong>ML will be recorded in the<br />
/<strong>TET</strong>/Document/Pages/Page/Options elements (individually for each page) unless disabled<br />
with the following document option:<br />
tetml={ elements={nooptions} }<br />
Exception handling. If an error happens during PDF parsing <strong>TET</strong> will generally try to<br />
repair or ignore the problem if possible, or throw an exception otherwise. However,<br />
when generating <strong>TET</strong>ML output with <strong>TET</strong> PDF parsing problems will usually be reported<br />
as an Exception element in the <strong>TET</strong>ML:<br />
Object ’objects[49]/Subtype’ does not exist<br />
Applications should be prepared to deal with Exception elements instead of the expected<br />
elements when processing <strong>TET</strong>ML output.<br />
Problems which prevent the generation of the <strong>TET</strong>ML output file (e.g. no write permission<br />
for the output file) will still trigger an exception, and no valid <strong>TET</strong>ML output<br />
will be created.<br />
94 Chapter 8: <strong>TET</strong> Markup Language (<strong>TET</strong>ML)