17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

9.3 <strong>TET</strong>ML Elements and the <strong>TET</strong>ML Schema<br />

A formal XML schema description (XSD) for all <strong>TET</strong>ML elements and attributes as well as<br />

their relationships is contained in the <strong>TET</strong> distribution. The <strong>TET</strong>ML namespace is the following:<br />

http://www.pdflib.com/XML/<strong>TET</strong>3/<strong>TET</strong>-3.0<br />

The schema can be downloaded from the following URL on the Web:<br />

http://www.pdflib.com/XML/<strong>TET</strong>3/<strong>TET</strong>-3.0.xsd<br />

Both <strong>TET</strong>ML namespace and schema location are present in the root element of each<br />

<strong>TET</strong>ML document.<br />

Table 9.3 describes the role of all <strong>TET</strong>ML elements. Attributes which have been introduced<br />

with <strong>TET</strong> 4.1 and <strong>TET</strong> 4.0 are marked. Figure 9.1 visualizes the XML hierarchy of<br />

<strong>TET</strong>ML elements.<br />

Table 9.3 <strong>TET</strong>ML elements and attributes<br />

<strong>TET</strong>ML element<br />

Attachment<br />

Attachments<br />

Box<br />

Cell<br />

ColorSpace<br />

ColorSpaces<br />

Content<br />

Creation<br />

DocInfo<br />

Document<br />

description and attributes<br />

For PDF attachments describes the contents in a nested Document element. For non-PDF attachments<br />

only the name will be listed, but no contents.<br />

Attributes: name, level, pagenumber<br />

Container of Attachment elements<br />

Describes the coordinates of a Word. The attributes llx and lly describe the lower left corner, urx<br />

and ury describe the upper right corner of the Box in standard PDF coordinates. If the Box represents<br />

a rectangle with edges parallel to the page edges, the four values llx,lly, urx,ury describe<br />

the lower left and upper right corners; otherwise the coordinates of all four corners are<br />

present. A word may contain multiple Box elements, e.g. a hyphenated word which spans multiple<br />

lines of text, or a word which starts with a large character.<br />

Attributes: llx, lly 1 , urx, ury 1 , ulx, uly 1 , lrx, lry 1<br />

Describes the contents of a single table cell.<br />

Attribute: colSpan<br />

Describes a PDF colorspace.<br />

Attributes: alternate, base, components, id, name<br />

Container of ColorSpace elements<br />

Describes the page contents as a hierarchical structure.<br />

Attributes: granularity, dehyphenation (<strong>TET</strong> 4.0), dropcap (<strong>TET</strong> 4.0), font, geometry, shadow<br />

(<strong>TET</strong> 4.0), sub (<strong>TET</strong> 4.0), sup (<strong>TET</strong> 4.0)<br />

Describes the date and operating system platform for the <strong>TET</strong> execution, plus the version number<br />

of <strong>TET</strong>.<br />

Attributes: platform, tetVersion, date<br />

Predefined and custom document info entries<br />

Describes general document information including PDF file name and size, PDF version number.<br />

Attributes: filename, pageCount, filesize, linearized, pdfVersion, pdfa (<strong>TET</strong> 4.0: new values<br />

for PDF/A-2; <strong>TET</strong> 4.1: new values for PDF/A-3), pdfe (<strong>TET</strong> 4.0; <strong>TET</strong> 4.1: new values for PDF/E-2),<br />

pdfua (<strong>TET</strong> 4.1), pdfvt (<strong>TET</strong> 4.1), pdfx (<strong>TET</strong> 4.1: enumerated values), tagged<br />

9.3 <strong>TET</strong>ML Elements and the <strong>TET</strong>ML Schema 131

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!