17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5 Configuration<br />

5.1 Indexing protected PDF Documents<br />

PDF documents may use the following means of protection:<br />

> A user password may be required for opening the document.<br />

> Permission settings may allow or prevent certain actions, including extracting page<br />

contents.<br />

> A master password may be required for changing permission settings or one of the<br />

passwords.<br />

<strong>TET</strong> honors PDF permission settings. The password and permission status can be queried<br />

with the pCOS paths encrypt/master, encrypt/user, encrypt/nocopy, etc. as demonstrated<br />

in the dumper sample. pCOS also offers the pcosmode pseudo object which can<br />

be used to determine which operations are allowed for a particular document.<br />

Content extraction status. By default, text and image extraction is possible with <strong>TET</strong> if<br />

the document can successfully be opened (this is no longer true if the requiredmode option<br />

of <strong>TET</strong>_open_document( ) was supplied). Depending on the nocopy permission setting,<br />

content extraction may or may not be allowed in restricted pCOS mode (content<br />

extraction is always allowed in full pCOS mode). The following conditional can be used<br />

to check whether content extraction is allowed:<br />

if ((int) tet.pcos_get_number(doc, "pcosmode") == 2 ||<br />

((int) tet.pcos_get_number(doc, "pcosmode") == 1 &&<br />

(int) tet.pcos_get_number(doc, "encrypt/nocopy") == 0))<br />

{<br />

/* content extraction allowed */<br />

}<br />

The need for processing protected documents. PDF permission settings help document<br />

authors to enforce their rights as creators of content, and users of PDF documents<br />

must respect the rights of the document author when extracting text or image contents.<br />

By default, <strong>TET</strong> will operate in restricted mode and refuse to extract any contents<br />

from such protected documents. However, content extraction does not in all cases automatically<br />

constitute a violation of the author’s rights. Situations where content extraction<br />

may be acceptable include the following:<br />

> Small amounts of content are extracted for quoting (»fair use«).<br />

> Organizations may want to check incoming or outgoing documents for certain keywords<br />

(document screening).<br />

> The document author himself may have lost the master password.<br />

> Search engines index protected documents without making the document contents<br />

available to the user directly (only indirectly by providing a link to the original PDF).<br />

The last example is particularly important: even if users are not allowed to extract the<br />

contents of a protected PDF, they should be able to locate the document in an enterprise<br />

or Web-based search. It may be acceptable to extract the contents if the extracted text is<br />

not directly made available to the user, but only used to feed the search engine’s index<br />

so that the document can be found. Since the user only gets access to the original pro-<br />

5.1 Indexing protected PDF Documents 49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!