PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
9.3 <strong>TET</strong>ML Elements and the <strong>TET</strong>ML Schema<br />
A formal XML schema description (XSD) for all <strong>TET</strong>ML elements and attributes as well as<br />
their relationships is contained in the <strong>TET</strong> distribution. The <strong>TET</strong>ML namespace is the following:<br />
http://www.pdflib.com/XML/<strong>TET</strong>3/<strong>TET</strong>-3.0<br />
The schema can be downloaded from the following URL on the Web:<br />
http://www.pdflib.com/XML/<strong>TET</strong>3/<strong>TET</strong>-3.0.xsd<br />
Both <strong>TET</strong>ML namespace and schema location are present in the root element of each<br />
<strong>TET</strong>ML document.<br />
Table 9.3 describes the role of all <strong>TET</strong>ML elements. Attributes which have been introduced<br />
with <strong>TET</strong> 4.1 and <strong>TET</strong> 4.0 are marked. Figure 9.1 visualizes the XML hierarchy of<br />
<strong>TET</strong>ML elements.<br />
Table 9.3 <strong>TET</strong>ML elements and attributes<br />
<strong>TET</strong>ML element<br />
Attachment<br />
Attachments<br />
Box<br />
Cell<br />
ColorSpace<br />
ColorSpaces<br />
Content<br />
Creation<br />
DocInfo<br />
Document<br />
description and attributes<br />
For PDF attachments describes the contents in a nested Document element. For non-PDF attachments<br />
only the name will be listed, but no contents.<br />
Attributes: name, level, pagenumber<br />
Container of Attachment elements<br />
Describes the coordinates of a Word. The attributes llx and lly describe the lower left corner, urx<br />
and ury describe the upper right corner of the Box in standard PDF coordinates. If the Box represents<br />
a rectangle with edges parallel to the page edges, the four values llx,lly, urx,ury describe<br />
the lower left and upper right corners; otherwise the coordinates of all four corners are<br />
present. A word may contain multiple Box elements, e.g. a hyphenated word which spans multiple<br />
lines of text, or a word which starts with a large character.<br />
Attributes: llx, lly 1 , urx, ury 1 , ulx, uly 1 , lrx, lry 1<br />
Describes the contents of a single table cell.<br />
Attribute: colSpan<br />
Describes a PDF colorspace.<br />
Attributes: alternate, base, components, id, name<br />
Container of ColorSpace elements<br />
Describes the page contents as a hierarchical structure.<br />
Attributes: granularity, dehyphenation (<strong>TET</strong> 4.0), dropcap (<strong>TET</strong> 4.0), font, geometry, shadow<br />
(<strong>TET</strong> 4.0), sub (<strong>TET</strong> 4.0), sup (<strong>TET</strong> 4.0)<br />
Describes the date and operating system platform for the <strong>TET</strong> execution, plus the version number<br />
of <strong>TET</strong>.<br />
Attributes: platform, tetVersion, date<br />
Predefined and custom document info entries<br />
Describes general document information including PDF file name and size, PDF version number.<br />
Attributes: filename, pageCount, filesize, linearized, pdfVersion, pdfa (<strong>TET</strong> 4.0: new values<br />
for PDF/A-2; <strong>TET</strong> 4.1: new values for PDF/A-3), pdfe (<strong>TET</strong> 4.0; <strong>TET</strong> 4.1: new values for PDF/E-2),<br />
pdfua (<strong>TET</strong> 4.1), pdfvt (<strong>TET</strong> 4.1), pdfx (<strong>TET</strong> 4.1: enumerated values), tagged<br />
9.3 <strong>TET</strong>ML Elements and the <strong>TET</strong>ML Schema 131