17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6 <strong>Text</strong> <strong>Extraction</strong><br />

6.1 Document Domains<br />

PDF documents may contain text in many other places than only the page contents.<br />

While most applications will deal with the page contents only, in many situations other<br />

document domains may be relevant as well.<br />

While the page contents can be retrieved with the workhorse functions <strong>TET</strong>_get_<br />

text( ) and <strong>TET</strong>_get_image( ), the integrated pCOS interface plays a crucial role for retrieving<br />

text from other document domains.<br />

In the remaining section we provide information on domain searching with the <strong>TET</strong><br />

library and <strong>TET</strong>ML. In addition, we will summarize how to search these document domains<br />

with Acrobat 8/9. This is important to locate search hits in Acrobat.<br />

<strong>Text</strong> on the page. Page contents are the main source of text in PDF. <strong>Text</strong> on a page is<br />

rendered with fonts and encoded using one of the many encoding techniques available<br />

in PDF.<br />

> How to display with Acrobat 8/9: page contents are always visible<br />

> How to search a single PDF with Acrobat 8/9: Edit, Find or Edit, Search. <strong>TET</strong> may be able<br />

to process the text in documents where Acrobat does not correctly map glyphs to<br />

Unicode values. In this situation you can use the <strong>TET</strong> Plugin which is based on <strong>TET</strong>.<br />

(see Section 4.1, »Free <strong>TET</strong> Plugin for Adobe Acrobat«, page 35). The <strong>TET</strong> Plugin offers<br />

its own search dialog via Plug-Ins, <strong>PDFlib</strong> <strong>TET</strong> Plugin... <strong>TET</strong> Find. However, this is not intended<br />

as a full-blown search facility.<br />

> How to search multiple PDFs with Acrobat 8/9: Edit Search and in Where would you like<br />

to search? select All PDF Documents in, and browse to a folder with PDF documents.<br />

> Sample code for the <strong>TET</strong> library: extractor mini sample<br />

> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/Pages/Page<br />

Predefined document info entries. Classical document info entries are key/value<br />

pairs.<br />

> How to display with Acrobat 8/9: File, Properties...<br />

> How to search a single PDF with Acrobat 8/9: not available<br />

> How to search multiple PDFs with Acrobat 8/9: click Edit, Search and Use Advanced<br />

Search Options. In the Look In: pull-down select a folder of PDF documents and in the<br />

pull-down menu Use these additional criteria select one of Date Created, Date Modified,<br />

Author, Title, Subject, Keywords.<br />

> Sample code for the <strong>TET</strong> library: dumper mini sample<br />

> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/DocInfo<br />

Custom document info entries. Custom document info entries can be defined in addition<br />

to the standard entries.<br />

> How to display with Acrobat 8/9: File, Properties..., Custom (not available in the free<br />

Adobe Reader)<br />

> How to search with Acrobat 8/9: not available<br />

> Sample code for the <strong>TET</strong> library: dumper mini sample<br />

> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/DocInfo/Custom<br />

6.1 Document Domains 57

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!