PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6 <strong>Text</strong> <strong>Extraction</strong><br />
6.1 Document Domains<br />
PDF documents may contain text in many other places than only the page contents.<br />
While most applications will deal with the page contents only, in many situations other<br />
document domains may be relevant as well.<br />
While the page contents can be retrieved with the workhorse functions <strong>TET</strong>_get_<br />
text( ) and <strong>TET</strong>_get_image( ), the integrated pCOS interface plays a crucial role for retrieving<br />
text from other document domains.<br />
In the remaining section we provide information on domain searching with the <strong>TET</strong><br />
library and <strong>TET</strong>ML. In addition, we will summarize how to search these document domains<br />
with Acrobat 8/9. This is important to locate search hits in Acrobat.<br />
<strong>Text</strong> on the page. Page contents are the main source of text in PDF. <strong>Text</strong> on a page is<br />
rendered with fonts and encoded using one of the many encoding techniques available<br />
in PDF.<br />
> How to display with Acrobat 8/9: page contents are always visible<br />
> How to search a single PDF with Acrobat 8/9: Edit, Find or Edit, Search. <strong>TET</strong> may be able<br />
to process the text in documents where Acrobat does not correctly map glyphs to<br />
Unicode values. In this situation you can use the <strong>TET</strong> Plugin which is based on <strong>TET</strong>.<br />
(see Section 4.1, »Free <strong>TET</strong> Plugin for Adobe Acrobat«, page 35). The <strong>TET</strong> Plugin offers<br />
its own search dialog via Plug-Ins, <strong>PDFlib</strong> <strong>TET</strong> Plugin... <strong>TET</strong> Find. However, this is not intended<br />
as a full-blown search facility.<br />
> How to search multiple PDFs with Acrobat 8/9: Edit Search and in Where would you like<br />
to search? select All PDF Documents in, and browse to a folder with PDF documents.<br />
> Sample code for the <strong>TET</strong> library: extractor mini sample<br />
> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/Pages/Page<br />
Predefined document info entries. Classical document info entries are key/value<br />
pairs.<br />
> How to display with Acrobat 8/9: File, Properties...<br />
> How to search a single PDF with Acrobat 8/9: not available<br />
> How to search multiple PDFs with Acrobat 8/9: click Edit, Search and Use Advanced<br />
Search Options. In the Look In: pull-down select a folder of PDF documents and in the<br />
pull-down menu Use these additional criteria select one of Date Created, Date Modified,<br />
Author, Title, Subject, Keywords.<br />
> Sample code for the <strong>TET</strong> library: dumper mini sample<br />
> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/DocInfo<br />
Custom document info entries. Custom document info entries can be defined in addition<br />
to the standard entries.<br />
> How to display with Acrobat 8/9: File, Properties..., Custom (not available in the free<br />
Adobe Reader)<br />
> How to search with Acrobat 8/9: not available<br />
> Sample code for the <strong>TET</strong> library: dumper mini sample<br />
> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/DocInfo/Custom<br />
6.1 Document Domains 57