17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

SQL> GRANT EXECUTE ON CTX_DOC TO HR;<br />

SQL> GRANT EXECUTE ON CTX_OUTPUT TO HR;<br />

SQL> GRANT EXECUTE ON CTX_QUERY TO HR;<br />

SQL> GRANT EXECUTE ON CTX_REPORT TO HR;<br />

SQL> GRANT EXECUTE ON CTX_THES TO HR;<br />

Example A: Store path names of PDF documents in the database. This example stores<br />

file name references to the indexed PDF documents in the database. Proceed as follows:<br />

> Change to the following directory in a command prompt:<br />

/connectors/Oracle<br />

> Adjust the tetpath variable in the tetsetup_a.sql script so that it points to the directory<br />

where <strong>TET</strong> is installed.<br />

> Prepare the database: using Oracle’s sqlplus program create the table pdftable_a, fill<br />

this table with path names of PDF documents, and create the index tetindex_a (note<br />

that the contents of the tetsetup_a.sql script are slightly platform-dependent because<br />

of different path syntax):<br />

SQL> @tetsetup_a.sql<br />

> Query the database using the index:<br />

SQL> select * from pdftable_a where CONTAINS(pdffile, 'Whitepaper', 1) > 0;<br />

> Update the index (required after adding more documents):<br />

SQL> execute ctx_ddl.sync_index('tetindex_a')<br />

> Optionally clean up the database (remove the index and table):<br />

SQL> @tetcleanup_a.sql<br />

Example B: Store PDF documents as BLOBs in the database and add metadata. This<br />

examples stores the actual PDF documents as BLOBs in the database. In addition to the<br />

PDF data some metadata is extracted with the pCOS interface and stored in dedicated<br />

database columns. The tet_pdf_loader Java program stores the PDF documents as BLOBs<br />

in the database. In order to demonstrate metadata handling the program uses the pCOS<br />

interface to extract the document title (via the pCOS path /Info/Title) and the number of<br />

pages in the document (via the pCOS path length:pages). The document title and the<br />

page count will be stored in separate columns in the database. Proceed as follows to run<br />

this example:<br />

> Change to the following directory in a command prompt:<br />

/connectors/Oracle<br />

> Prepare the database: using Oracle’s sqlplus program create the table pdftable_b and<br />

the corresponding index tetindex_b:<br />

SQL> @tetsetup_b.sql<br />

> Populate the database: fill the table with PDF documents and metadata via JDBC<br />

(note that this is not possible with stored procedures). The ant build file supplied<br />

with the <strong>TET</strong> package expects the ojdbc14.jar file for the Oracle JDBC driver in the<br />

same directory as the tet_pdf_loader.java source code. Specify a suitable JDBC connection<br />

string with the ant command. The build file contains a description of all properties<br />

that can be used to specify options for the Ant build. You can supply values for<br />

42 Chapter 4: <strong>TET</strong> Connectors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!