PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
5 Configuration<br />
5.1 Indexing protected PDF Documents<br />
PDF documents may use the following means of protection:<br />
> A user password may be required for opening the document.<br />
> Permission settings may allow or prevent certain actions, including extracting page<br />
contents.<br />
> A master password may be required for changing permission settings or one of the<br />
passwords.<br />
<strong>TET</strong> honors PDF permission settings. The password and permission status can be queried<br />
with the pCOS paths encrypt/master, encrypt/user, encrypt/nocopy, etc. as demonstrated<br />
in the dumper sample. pCOS also offers the pcosmode pseudo object which can<br />
be used to determine which operations are allowed for a particular document.<br />
Content extraction status. By default, text and image extraction is possible with <strong>TET</strong> if<br />
the document can successfully be opened (this is no longer true if the requiredmode option<br />
of <strong>TET</strong>_open_document( ) was supplied). Depending on the nocopy permission setting,<br />
content extraction may or may not be allowed in restricted pCOS mode (content<br />
extraction is always allowed in full pCOS mode). The following conditional can be used<br />
to check whether content extraction is allowed:<br />
if ((int) tet.pcos_get_number(doc, "pcosmode") == 2 ||<br />
((int) tet.pcos_get_number(doc, "pcosmode") == 1 &&<br />
(int) tet.pcos_get_number(doc, "encrypt/nocopy") == 0))<br />
{<br />
/* content extraction allowed */<br />
}<br />
The need for processing protected documents. PDF permission settings help document<br />
authors to enforce their rights as creators of content, and users of PDF documents<br />
must respect the rights of the document author when extracting text or image contents.<br />
By default, <strong>TET</strong> will operate in restricted mode and refuse to extract any contents<br />
from such protected documents. However, content extraction does not in all cases automatically<br />
constitute a violation of the author’s rights. Situations where content extraction<br />
may be acceptable include the following:<br />
> Small amounts of content are extracted for quoting (»fair use«).<br />
> Organizations may want to check incoming or outgoing documents for certain keywords<br />
(document screening).<br />
> The document author himself may have lost the master password.<br />
> Search engines index protected documents without making the document contents<br />
available to the user directly (only indirectly by providing a link to the original PDF).<br />
The last example is particularly important: even if users are not allowed to extract the<br />
contents of a protected PDF, they should be able to locate the document in an enterprise<br />
or Web-based search. It may be acceptable to extract the contents if the extracted text is<br />
not directly made available to the user, but only used to feed the search engine’s index<br />
so that the document can be found. Since the user only gets access to the original pro-<br />
5.1 Indexing protected PDF Documents 49