17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Alphabetical list of words in the document along with their page number:<br />

A<br />

about 2 7 8<br />

access 8 12<br />

accessible 11<br />

achieving 9 12<br />

Acrobat 2 5 7 8 9 10 11 14 15 17<br />

ActiveX 2<br />

actual 9 12<br />

actually 11 12 14<br />

addition 9<br />

Additional 12<br />

additions 17<br />

address 9 12<br />

addressed 9<br />

addressing 9<br />

Adobe 2 5 8 12 14<br />

...<br />

Extract XMP metadata. The metadata.xsl stylesheet expects <strong>TET</strong>ML input in any mode.<br />

It targets XMP metadata on the document level, and extracts some metadata properties<br />

from the XMP. PDF attachments (including PDF packages and portfolios) in the document<br />

will be processed recursively:<br />

dc:creator = <strong>PDFlib</strong> GmbH<br />

xmp:CreatorTool = FrameMaker 7.0<br />

Extract table contents in CSV format. The table.xsl stylesheet expects <strong>TET</strong>ML input in<br />

word, wordplus, or page mode. It extracts the contents of a specified table and creates a<br />

CSV file (comma-separated values) which contains the table contents. CSV files can be<br />

opened with all spreadsheet applications. This may be useful to repurpose the contents<br />

of tables in PDF documents.<br />

Convert <strong>TET</strong>ML to HTML. The tetml2html.xsl stylesheet expects <strong>TET</strong>ML input in<br />

wordplus mode. It converts the <strong>TET</strong>ML to HTML which can be displayed in a browser. The<br />

converter does not attempt to generate an identical visual representation of the PDF<br />

document, but creates heading elements (H1, H2, etc.) based on configurable font sizes.<br />

It also maps table elements in <strong>TET</strong>ML to the corresponding HTML table constructs to visualize<br />

tables in the browser. The converter also creates a table of contents at the beginning<br />

of the HTML page, where each entry is based on some heading in the document,<br />

and contains an active link which jumps to the corresponding heading.<br />

Extract raw text from <strong>TET</strong>ML. The textonly.xsl stylesheet expects <strong>TET</strong>ML input in any<br />

mode. It extracts the raw text contents by fetching all <strong>Text</strong> elements while ignoring all<br />

other elements. PDF attachments (including PDF packages and portfolios) in the document<br />

will be processed recursively.<br />

8.5 XSLT Samples 103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!