17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Geometry. The geometry features may be useful for some applications:<br />

> The <strong>TET</strong>_get_char_info( ) interface is only required if you need the position of text on<br />

the page, the respective font name, or other details. If you are not interested in text<br />

coordinates calling <strong>TET</strong>_get_text( ) will be sufficient.<br />

> If you have advance information about the layout of pages you can use the includebox<br />

and/or excludebox options in <strong>TET</strong>_open_page( ) to get rid of headers, footers, or<br />

similar items which are not part of the main text.<br />

Unknown characters. If <strong>TET</strong> is unable to determine the appropriate Unicode mapping<br />

for one or more characters it will represent it with the Unicode replacement character<br />

U+FFFD. If your application is not concerned about unmappable characters you can<br />

simply discard all occurrences of this character. Applications which require more finegrain<br />

results could take the corresponding font into account, and use it to decide on<br />

processing of unmappable characters. Use the following document option to replace all<br />

unmapped characters with a question mark:<br />

unknownchar=?<br />

The special character U+0000 means »no character«. Use the following document option<br />

to remove all unmapped characters from the output:<br />

unknownchar=U+0000<br />

Legal documents. When dealing with legal documents there is usually zero tolerance<br />

for wrong Unicode mappings since they might alter the content or interpretation of a<br />

document. In many cases the text position is not required, and the text must be extracted<br />

word by word. Recommendations:<br />

> Use the granularity=word option in <strong>TET</strong>_open_page( ).<br />

> Use the password option with the appropriate document password in <strong>TET</strong>_open_<br />

document( ) if you must process documents which require a password for opening, or<br />

if content extraction is not allowed in the permission settings.<br />

> For absolute text fidelity: stop processing as soon as the unknown field in the character<br />

info structure returned by <strong>TET</strong>_get_char_info( ) is 1, or if the Unicode replacement<br />

character U+FFFD is part of the string returned by <strong>TET</strong>_get_text( ). In <strong>TET</strong>ML with one<br />

of the text modes glyph or wordplus you can identify this situation by the following<br />

attribute in the Glyph element:<br />

unknown="true"<br />

Do not set the unknownchar option to any common character since you may be unable<br />

to distinguish it from correctly mapped characters without checking the<br />

unknown field.<br />

> Also to ensure text fidelity you may want to disable text extraction for text which is<br />

not visible on the page:<br />

ignoreinvisibletext=true<br />

Processing documents with <strong>PDFlib</strong>+PDI. When using <strong>PDFlib</strong>+PDI to process PDF documents<br />

on a per-page basis you can integrate <strong>TET</strong> for controlling the splitting or merging<br />

process. For example, you could split a PDF document based on the contents of a page. If<br />

you have control over the creation process you can insert separator pages with suitable<br />

processing instructions in the text. The <strong>TET</strong> Cookbook contains examples for analyzing<br />

documents with <strong>TET</strong> and then processing them with <strong>PDFlib</strong>+PDI.<br />

5.3 Recommendations for common Scenarios 55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!