17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

9 The pCOS Interface<br />

The pCOS (<strong>PDFlib</strong> Comprehensive Object Syntax) interface provides a simple and elegant<br />

facility for retrieving arbitrary information from all sections of a PDF document which<br />

do not describe page contents, such as page dimensions, metadata, interactive elements,<br />

etc. pCOS users are assumed to have some basic knowledge of internal PDF structures<br />

and dictionary keys, but do not have to deal with PDF syntax and parsing details.<br />

We strongly recommend that pCOS users obtain a copy of the PDF Reference, which is<br />

available as follows:<br />

Adobe Systems Incorporated: PDF Reference, Sixth Edition: Version 1.7. Downloadable<br />

PDF from www.adobe.com/devnet/pdf/pdf_reference.html.<br />

9.1 Simple pCOS Examples<br />

Cookbook A collection of pCOS coding fragments for solving specific problems can be found in the<br />

pCOS Cookbook.<br />

Assuming a valid PDF document handle is available, the pCOS functions <strong>TET</strong>_pcos_get_<br />

number( ), <strong>TET</strong>_pcos_get_string( ), and <strong>TET</strong>_pcos_get_stream( ) can be used to retrieve information<br />

from a PDF using the pCOS path syntax. Table 9.1 lists some common pCOS<br />

paths and their meaning.<br />

Table 9.1 pCOS paths for commonly used PDF objects<br />

pCOS path type explanation<br />

length:pages number number of pages in the document<br />

/Info/Title string document info field Title<br />

/Root/Metadata stream XMP stream with the document’s metadata<br />

fonts[...]/name string name of a font; the number of entries can be retrieved with length:fonts<br />

fonts[...]/vertical boolean check a font for vertical writing mode<br />

fonts[...]/embedded boolean embedding status of a font<br />

pages[...]/width number width of the visible area of the page<br />

Number of pages.<br />

The total number of pages in a document can be queried as follows:<br />

pagecount = tet.pcos_get_number(doc, "length:pages");<br />

Document information fields can be retrieved with the follow-<br />

Document info fields.<br />

ing code sequence:<br />

objtype = tet.pcos_get_string(doc, "type:/Info/Title");<br />

if (objtype.equals("string"))<br />

{<br />

/* Document info key found */<br />

title = tet.pcos_get_string(doc, "/Info/Title");<br />

}<br />

9.1 Simple pCOS Examples 105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!