17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

XSLT samples. The <strong>TET</strong> distribution contains several XSLT stylesheets. They demonstrate<br />

how to process <strong>TET</strong>ML to achieve various goals:<br />

> concordance.xsl: create list of unique words in a document sorted by descending frequency.<br />

> fontfilter.xsl: List all words in a document which use a particular font in a size larger<br />

than a specified value.<br />

> fontfinder.xsl: For all fonts in a document, list all occurrences along with page number<br />

and position information.<br />

> fontstat.xsl: generate font and glyph statistics.<br />

> index.xsl: create an alphabetically sorted »back-of-the-book« index.<br />

> metadata.xsl: extract selected fields from document-level XMP metadata included in<br />

<strong>TET</strong>ML.<br />

> solr.xsl: generate input for the Solr enterprise search server.<br />

> table.xsl: Extract a table to CSV file (comma-separated values).<br />

> tetml2html.xsl: convert <strong>TET</strong>ML to simple HTML.<br />

> textonly.xsl: extract the raw text from <strong>TET</strong>ML input.<br />

<strong>TET</strong> Cookbook. The <strong>TET</strong> Cookbook is a collection of source code examples for solving<br />

specific application problems with the <strong>TET</strong> library. The Cookbook examples are written<br />

in the Java language, but can easily be adjusted to other programming languages since<br />

the <strong>TET</strong> API is almost identical for all supported language bindings. Some Cookbook<br />

samples are written in the XSLT language.The <strong>TET</strong> Cookbook is organized in the following<br />

groups:<br />

> <strong>Text</strong>: samples related to text extraction<br />

> Font: samples related to text with a focus on font properties<br />

> Image: samples related to image extraction<br />

> <strong>TET</strong> & <strong>PDFlib</strong>+PDI: samples which extract information from a PDF with <strong>TET</strong> and construct<br />

a new PDF based on the original PDF and the extracted information. These<br />

samples require the <strong>PDFlib</strong>+PDI product in addition to <strong>TET</strong>.<br />

> <strong>TET</strong>ML: XSLT samples for processing <strong>TET</strong>ML<br />

> Special: other samples<br />

The <strong>TET</strong> Cookbook is available at the following URL:<br />

www.pdflib.com/tet-cookbook.<br />

pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface<br />

which is integrated in <strong>TET</strong>. It is available at the following URL:<br />

www.pdflib.com/pcos-cookbook.<br />

Details of the pCOS interface are documented in Chapter 9, »The pCOS Interface«,<br />

page 105.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!