17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.4 Document Functions<br />

C++ int open_document(string filename, string optlist)<br />

C# Java int open_document(String filename, String optlist)<br />

Perl PHP long <strong>TET</strong>_open_document(resource tet, string filename, string optlist)<br />

VB Function open_document(filename As String, optlist As String) As Long<br />

C<br />

int <strong>TET</strong>_open_document(<strong>TET</strong> *tet, const char *filename, int len, const char *optlist)<br />

Open a disk-based or virtual PDF document for content extraction.<br />

filename (Name string, but Unicode file names are only supported on Windows) Absolute<br />

or relative name of the PDF input file to be processed. The file will be searched in all<br />

directories specified in the searchpath resource category. On Windows it is OK to use<br />

UNC paths or mapped network drives. In PHP Unicode filenames must be UTF-8.<br />

In non-Unicode language bindings file names with len = 0 will be interpreted in the<br />

current system codepage unless they are preceded by a UTF-8 BOM, in which case they<br />

will be interpreted as UTF-8 or EBCDIC-UTF-8.<br />

len (Only C language binding) Length of filename (in bytes) for UTF-16 strings. If len = 0<br />

a null-terminated string must be provided.<br />

optlist An option list specifying document options according to Table 10.3. The following<br />

options can be used:encodinghint, glyphmapping, keeppua, inmemory, password, repair,<br />

requiredmode, shrug, tetml, unknownchar, usehostfonts.<br />

Returns<br />

Details<br />

-1 on error, or a document handle otherwise. For example, it is an error if the input document<br />

or the <strong>TET</strong>ML output file cannot be opened. If -1 is returned it is recommended to<br />

call <strong>TET</strong>_get_errmsg( ) to find out more details about the error.<br />

Within a single <strong>TET</strong> object an arbitrary number of documents may be kept open at the<br />

same time. However, a single <strong>TET</strong> object must not be used in multiple threads simultaneously<br />

without any locking mechanism for synchronizing the access.<br />

Encryption: if the document is encrypted its user password must be supplied in the<br />

password option if the permission settings allow content extraction. The document’s<br />

master password must be supplied if the permission settings do not allow content extraction.<br />

If the requiredmode option has been specified, documents can be opened even<br />

without the appropriate password, but operations are restricted. The shrug option can<br />

be used to enable content extraction from protected documents under certain conditions<br />

(see Section 5.1, »Indexing protected PDF Documents«, page 49).<br />

Supported file systems on iSeries: <strong>TET</strong> has been tested with PC type file systems only.<br />

Therefore input and output files should reside in PC type files in the IFS (Integrated File<br />

System). The QSYS.lib file system for input files has not been tested and is not supported.<br />

Since QSYS.lib files are mostly used for record-based or database objects, unpredictable<br />

behavior may be the result if you use <strong>TET</strong> with QSYS.lib objects. <strong>TET</strong> file I/O is always<br />

stream-based, not record-based.<br />

10.4 Document Functions 129

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!