17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.5 XSLT Samples<br />

The <strong>TET</strong> distribution includes several XSLT stylesheets which demonstrate the power of<br />

XSLT applied to <strong>TET</strong>ML, and can be used as a starting point for <strong>TET</strong>ML applications. This<br />

section provides an overview of the XSLT samples and presents sample output. Section<br />

8.4, »Transforming <strong>TET</strong>ML with XSLT«, page 97 discusses many options for deploying<br />

the XSLT stylesheets. More details regarding the functionality and inner workings of the<br />

stylesheets can be found in comments in the XSLT code. Some general aspects of the<br />

stylesheet samples:<br />

> Most XSLT samples support parameters which can used to control various processing<br />

details. These parameters can be set within the XSLT code or overridden from the<br />

environment (e.g. ant).<br />

> Most XSLT samples require <strong>TET</strong>ML input in a certain text mode (e.g. word mode, see<br />

»<strong>TET</strong>ML text modes«, page 92, for details). In order to protect themselves from wrong<br />

input, they check whether the supplied <strong>TET</strong>ML input conforms to the requirement,<br />

and report an error otherwise.<br />

> Some XSLT samples recursively process PDF attachments in the document (this is<br />

mentioned in the descriptions below). Most samples ignore PDF attachments,<br />

though. They are written in a way so that they can easily be expanded to process attachments<br />

as well. It is sufficient to select the interesting elements within the<br />

Attachments element; the relevant xsl:template elements themselves don’t have to be<br />

modified.<br />

> All XSLT samples work with XSLT 1. While some samples could be simplified using<br />

features from XSLT 2, we wanted to stick to XSLT 1 for better usability.<br />

Create a concordance. The concordance.xsl stylesheet expects <strong>TET</strong>ML input in word or<br />

wordplus mode. It creates a concordance, i.e. a list of unique words in a document sorted<br />

by descending frequency. This may be useful to create a concordance for linguistic analysis,<br />

cross-references for translators, consistency checks, etc.<br />

List of words in the document along with the number of occurrences:<br />

the 207<br />

font 107<br />

of 100<br />

a 92<br />

in 83<br />

and 75<br />

fonts 64<br />

PDF 60<br />

FontReporter 58<br />

...<br />

Font filtering. The fontfilter.xsl stylesheet expects <strong>TET</strong>ML input in glyph or wordplus<br />

mode. It lists all words in a document which use a particular font in a size larger than a<br />

specified value. This may be useful to detect certain font/size combinations or for quality<br />

control. The same concept can be used to create a table of contents based on text portions<br />

which use a large font size.<br />

<strong>Text</strong> containing font 'TheSansBold-Plain' with size greater than 10:<br />

[TheSansBold-Plain/24] Contents<br />

8.5 XSLT Samples 101

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!