PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.4 <strong>TET</strong> Connector for Oracle<br />
The <strong>TET</strong> connector for Oracle attaches <strong>TET</strong> to an Oracle database so that PDF documents<br />
can be indexed and queried with Oracle <strong>Text</strong>. The PDF documents can be referenced via<br />
their path name in the database, or directly stored in the database as BLOBs.<br />
Note Protected documents can be indexed with the shrug option under certain conditions (see<br />
Chapter 5.1, »Extracting Content from protected PDF«, page 59, for details). This is prepared in<br />
the Connector files, but you must manually enable this option.<br />
Requirements and installation. The <strong>TET</strong> connector has been tested with Oracle 10i and<br />
Oracle 11g. In order use the <strong>TET</strong> connector you must specify the AL32UTF8 database character<br />
set when creating the database. This is always the case for the Universal edition of<br />
Oracle Express (but not for the Western European edition). AL32UTF8 is the database<br />
character set recommended by Oracle, and also works best with <strong>TET</strong> for indexing PDF<br />
documents. However, it is also possible to connect <strong>TET</strong> to Oracle <strong>Text</strong> with other character<br />
sets according to one of the following methods:<br />
> Starting with Oracle <strong>Text</strong> 11.1.0.7 the database can perform the required character set<br />
conversion. Please refer to the section »Using USER_FILTER with Charset and Format<br />
Columns« in the Oracle <strong>Text</strong> 11.1.0.7 documentation, available at<br />
download.oracle.com/docs/cd/B28359_01/text.111/b28304/cdatadic.htm#sthref497.<br />
> With Oracle <strong>Text</strong> 11.1.0.6 or earlier the UTF-8 text generated by the <strong>TET</strong> filter script<br />
must be converted to the database character set. This can be achieved by adding a<br />
character set conversion command to tetfilter.sh:<br />
Unix: call iconv (open-source software) or uconv (part of the free ICU Unicode library)<br />
Windows: call a suitable code page converter in tetfilter.bat.<br />
In order to take advantage of the <strong>TET</strong> Connector for Oracle you must make the <strong>TET</strong> filter<br />
script available to Oracle as follows:<br />
> Copy the <strong>TET</strong> filter script to a directory where Oracle can find it:<br />
Unix: copy connectors/Oracle/tetfilter.sh to $ORACLE_HOME/ctx/bin<br />
Windows: copy connectors/Oracle/tetfilter.bat to %ORACLE_HOME%\bin<br />
> Make sure that the <strong>TET</strong>DIR variable in the <strong>TET</strong> filter script (tetfilter.sh or tetfilter.bat, respectively)<br />
points to the <strong>TET</strong> installation directory.<br />
> If required you can supply more <strong>TET</strong> options for the global, document, or page level<br />
in the <strong>TET</strong>OPT, DOCOPT, and PAGEOPT variables (see Chapter 10, »<strong>TET</strong> Library API Reference«,<br />
page 141, for option list details). This is especially useful for supplying the <strong>TET</strong><br />
license key, e.g.:<br />
<strong>TET</strong>OPT="license=aaaaaaa-bbbbbb-cccccc-dddddd-eeeeee"<br />
See Section 0.2, »Applying the <strong>TET</strong> License Key«, page 8, for more options for supplying<br />
the <strong>TET</strong> license key.<br />
Granting privileges to the Oracle user. The examples below assume an Oracle user<br />
with appropriate privileges to create and query an index. The following commands<br />
grant appropriate privileges to the user HR (these commands must be issued as system<br />
and must be adjusted as appropriate):<br />
SQL> GRANT CTXAPP TO HR;<br />
SQL> GRANT EXECUTE ON CTX_CLS TO HR;<br />
SQL> GRANT EXECUTE ON CTX_DDL TO HR;<br />
4.4 <strong>TET</strong> Connector for Oracle 49