PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
XSLT samples. The <strong>TET</strong> distribution contains several XSLT stylesheets. They demonstrate<br />
how to process <strong>TET</strong>ML to achieve various goals:<br />
> concordance.xsl: create list of unique words in a document sorted by descending frequency.<br />
> fontfilter.xsl: List all words in a document which use a particular font in a size larger<br />
than a specified value.<br />
> fontfinder.xsl: For all fonts in a document, list all occurrences along with page number<br />
and position information.<br />
> fontstat.xsl: generate font and glyph statistics.<br />
> index.xsl: create an alphabetically sorted »back-of-the-book« index.<br />
> metadata.xsl: extract selected fields from document-level XMP metadata included in<br />
<strong>TET</strong>ML.<br />
> solr.xsl: generate input for the Solr enterprise search server.<br />
> table.xsl: Extract a table to CSV file (comma-separated values).<br />
> tetml2html.xsl: convert <strong>TET</strong>ML to simple HTML.<br />
> textonly.xsl: extract the raw text from <strong>TET</strong>ML input.<br />
<strong>TET</strong> Cookbook. The <strong>TET</strong> Cookbook is a collection of source code examples for solving<br />
specific application problems with the <strong>TET</strong> library. The Cookbook examples are written<br />
in the Java language, but can easily be adjusted to other programming languages since<br />
the <strong>TET</strong> API is almost identical for all supported language bindings. Some Cookbook<br />
samples are written in the XSLT language.The <strong>TET</strong> Cookbook is organized in the following<br />
groups:<br />
> <strong>Text</strong>: samples related to text extraction<br />
> Font: samples related to text with a focus on font properties<br />
> Image: samples related to image extraction<br />
> <strong>TET</strong> & <strong>PDFlib</strong>+PDI: samples which extract information from a PDF with <strong>TET</strong> and construct<br />
a new PDF based on the original PDF and the extracted information. These<br />
samples require the <strong>PDFlib</strong>+PDI product in addition to <strong>TET</strong>.<br />
> <strong>TET</strong>ML: XSLT samples for processing <strong>TET</strong>ML<br />
> Special: other samples<br />
The <strong>TET</strong> Cookbook is available at the following URL:<br />
www.pdflib.com/tet-cookbook.<br />
pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface<br />
which is integrated in <strong>TET</strong>. It is available at the following URL:<br />
www.pdflib.com/pcos-cookbook.<br />
Details of the pCOS interface are documented in Chapter 9, »The pCOS Interface«,<br />
page 105.