17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The tetml option enables or disables coordinate- and font-related information in the<br />

Glyph element. The following page option list enables font details in the Glyph element,<br />

but suppresses coordinate details:<br />

tetml={ glyphdetails={nogeometry font} }<br />

The following page option list instructs <strong>TET</strong> to combine punctuation characters with<br />

the adjacent words, i.e. punctuation characters are no longer treated as individual<br />

words:<br />

contentanalysis={nopunctuationbreaks}<br />

The following page option makes sense only for page mode. It changes the default separator<br />

character from linefeed to space:<br />

contentanalysis={lineseparator=U+0020}<br />

All page options which have been supplied when creating <strong>TET</strong>ML will be recorded in the<br />

/<strong>TET</strong>/Document/Pages/Page/Options elements (individually for each page) unless disabled<br />

with the following document option:<br />

tetml={ elements={nooptions} }<br />

Exception handling. If an error happens during PDF parsing <strong>TET</strong> will generally try to<br />

repair or ignore the problem if possible, or throw an exception otherwise. However,<br />

when generating <strong>TET</strong>ML output with <strong>TET</strong> PDF parsing problems will usually be reported<br />

as an Exception element in the <strong>TET</strong>ML:<br />

Object ’objects[49]/Subtype’ does not exist<br />

Applications should be prepared to deal with Exception elements instead of the expected<br />

elements when processing <strong>TET</strong>ML output.<br />

Problems which prevent the generation of the <strong>TET</strong>ML output file (e.g. no write permission<br />

for the output file) will still trigger an exception, and no valid <strong>TET</strong>ML output<br />

will be created.<br />

94 Chapter 8: <strong>TET</strong> Markup Language (<strong>TET</strong>ML)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!