PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Geometry. The geometry features may be useful for some applications:<br />
> The <strong>TET</strong>_get_char_info( ) interface is only required if you need the position of text on<br />
the page, the respective font name, or other details. If you are not interested in text<br />
coordinates calling <strong>TET</strong>_get_text( ) will be sufficient.<br />
> If you have advance information about the layout of pages you can use the includebox<br />
and/or excludebox options in <strong>TET</strong>_open_page( ) to get rid of headers, footers, or<br />
similar items which are not part of the main text.<br />
Unknown characters. If <strong>TET</strong> is unable to determine the appropriate Unicode mapping<br />
for one or more characters it will represent it with the Unicode replacement character<br />
U+FFFD. If your application is not concerned about unmappable characters you can<br />
simply discard all occurrences of this character. Applications which require more finegrain<br />
results could take the corresponding font into account, and use it to decide on<br />
processing of unmappable characters. Use the following document option to replace all<br />
unmapped characters with a question mark:<br />
unknownchar=?<br />
The special character U+0000 means »no character«. Use the following document option<br />
to remove all unmapped characters from the output:<br />
unknownchar=U+0000<br />
Legal documents. When dealing with legal documents there is usually zero tolerance<br />
for wrong Unicode mappings since they might alter the content or interpretation of a<br />
document. In many cases the text position is not required, and the text must be extracted<br />
word by word. Recommendations:<br />
> Use the granularity=word option in <strong>TET</strong>_open_page( ).<br />
> Use the password option with the appropriate document password in <strong>TET</strong>_open_<br />
document( ) if you must process documents which require a password for opening, or<br />
if content extraction is not allowed in the permission settings.<br />
> For absolute text fidelity: stop processing as soon as the unknown field in the character<br />
info structure returned by <strong>TET</strong>_get_char_info( ) is 1, or if the Unicode replacement<br />
character U+FFFD is part of the string returned by <strong>TET</strong>_get_text( ). In <strong>TET</strong>ML with one<br />
of the text modes glyph or wordplus you can identify this situation by the following<br />
attribute in the Glyph element:<br />
unknown="true"<br />
Do not set the unknownchar option to any common character since you may be unable<br />
to distinguish it from correctly mapped characters without checking the<br />
unknown field.<br />
> Also to ensure text fidelity you may want to disable text extraction for text which is<br />
not visible on the page:<br />
ignoreinvisibletext=true<br />
Processing documents with <strong>PDFlib</strong>+PDI. When using <strong>PDFlib</strong>+PDI to process PDF documents<br />
on a per-page basis you can integrate <strong>TET</strong> for controlling the splitting or merging<br />
process. For example, you could split a PDF document based on the contents of a page. If<br />
you have control over the creation process you can insert separator pages with suitable<br />
processing instructions in the text. The <strong>TET</strong> Cookbook contains examples for analyzing<br />
documents with <strong>TET</strong> and then processing them with <strong>PDFlib</strong>+PDI.<br />
5.3 Recommendations for common Scenarios 55