17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Geometry. The geometry features may be useful for some applications:<br />

> The <strong>TET</strong>_get_char_info( ) interface is only required if you need the position of text on<br />

the page, the respective font name, or other details. If you are not interested in text<br />

coordinates calling <strong>TET</strong>_get_text( ) will be sufficient.<br />

> If you have advance information about the layout of pages you can use the includebox<br />

and/or excludebox options in <strong>TET</strong>_open_page( ) to get rid of headers, footers, or<br />

similar items which are not part of the main text.<br />

Unknown characters. If <strong>TET</strong> is unable to determine the appropriate Unicode mapping<br />

for one or more characters it will represent it with the Unicode replacement character<br />

U+FFFD. If your application is not concerned about unmappable characters you can<br />

simply discard all occurrences of this character. Applications which require more finegrain<br />

results could take the corresponding font into account, and use it to decide on<br />

processing of unmappable characters. Use the following document option to replace all<br />

unmapped characters with a question mark:<br />

unknownchar=?<br />

Use the following document option to remove all unmapped characters from the output:<br />

fold={{[:Private_Use:] remove} {[U+FFFD] remove} default}<br />

Complex layouts. Some classes of documents often use very elaborate page layouts.<br />

For example, with magazines and periodicals <strong>TET</strong> may not be able to properly determine<br />

the relationship of columns on the page. In such situations it is possible to enhance<br />

the extracted text at the expense of processing time. Suitable options for this<br />

purpose are summarized in Section 6.6, »Layout Analysis«, page 88. See Table 10.12, page<br />

174, for more details on relevant options.<br />

Legal documents. When dealing with legal documents there is usually zero tolerance<br />

for wrong Unicode mappings since they might alter the content or interpretation of a<br />

document. In many cases the text position is not required, and the text must be extracted<br />

word by word. Recommendations:<br />

> Use the granularity=word option in <strong>TET</strong>_open_page( ).<br />

> Use the password option with the appropriate document password in <strong>TET</strong>_open_<br />

document( ) if you must process documents which require a password for opening, or<br />

the shrug option if content extraction is not allowed in the permission settings and<br />

you are in a legal position to extract text from the document (see »The »shrug« feature<br />

for protected documents«, page 60).<br />

> For absolute text fidelity: stop processing as soon as the unknown field in the character<br />

info structure returned by <strong>TET</strong>_get_char_info( ) is 1, or if the Unicode replacement<br />

character U+FFFD is part of the string returned by <strong>TET</strong>_get_text( ). In <strong>TET</strong>ML with one<br />

of the text modes glyph or wordplus you can identify this situation by the following<br />

attribute in the Glyph element:<br />

unknown="true"<br />

Do not set the unknownchar option to any common character since you may be unable<br />

to distinguish it from correctly mapped characters without checking the<br />

unknown field.<br />

66 Chapter 5: Configuration

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!