PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6 <strong>Text</strong> <strong>Extraction</strong><br />
6.1 PDF Document Domains<br />
PDF documents may contain text in many other places than only the page contents.<br />
While most applications deal with the page contents only, in many situations other<br />
document domains may be relevant as well.<br />
While the page contents can be retrieved with the workhorse functions <strong>TET</strong>_get_<br />
text( ) and <strong>TET</strong>_get_image( ), the integrated pCOS interface plays a crucial role for retrieving<br />
text from other document domains.<br />
In the remaining section we provide information on domain searching with the <strong>TET</strong><br />
library and <strong>TET</strong>ML. In addition, we summarize how to search these document domains<br />
with Acrobat X/XI. This is important to locate search hits in Acrobat.<br />
<strong>Text</strong> on the page. Page contents are the main source of text in PDF. <strong>Text</strong> on a page is<br />
rendered with fonts and encoded using one of the many encoding techniques available<br />
in PDF.<br />
> How to display with Acrobat: page contents are always visible<br />
> How to search a single PDF with Acrobat X/XI: Edit, Find or Edit, [Advanced] Search. <strong>TET</strong><br />
may be able to process the text in documents where Acrobat does not correctly map<br />
glyphs to Unicode values. In this situation you can use the <strong>TET</strong> Plugin which is based<br />
on <strong>TET</strong> (see Section 4.1, »Free <strong>TET</strong> Plugin for Adobe Acrobat«, page 43). The <strong>TET</strong> Plugin<br />
offers its own search dialog via Plug-Ins, <strong>PDFlib</strong> <strong>TET</strong> Plugin... <strong>TET</strong> Find. However, it is not<br />
intended as a full-blown search facility.<br />
> How to search multiple PDFs with Acrobat X/XI: Edit, [Advanced] Search and in Where<br />
would you like to search? select All PDF Documents in, and browse to a folder with PDF<br />
documents.<br />
> Sample code for the <strong>TET</strong> library: extractor mini sample<br />
> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/Pages/Page<br />
Predefined document info entries. Traditional document info entries are key/value<br />
pairs.<br />
> How to display with Acrobat X/XI: File, Properties...<br />
> How to search a single PDF with Acrobat X/XI: not available<br />
> How to search multiple PDFs with Acrobat X/XI: click Edit, [Advanced] Search and Show<br />
More Options near the bottom of the dialog. In the Look In: pull-down select a folder of<br />
PDF documents and in the pull-down menu Use these additional criteria select one of<br />
Date Created, Date Modified, Author, Title, Subject, Keywords.<br />
> Sample code for the <strong>TET</strong> library: dumper mini sample<br />
> <strong>TET</strong>ML element: /<strong>TET</strong>/Document/DocInfo<br />
Custom document info entries. Custom document info entries can be defined in addition<br />
to the standard entries.<br />
> How to display with Acrobat X/XI: File, Properties..., Custom (not available in the free<br />
Adobe Reader)<br />
> How to search with Acrobat X/XI: not available<br />
6.1 PDF Document Domains 69