PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Geometry. The geometry features may be useful for some applications:<br />
> The <strong>TET</strong>_get_char_info( ) interface is only required if you need the position of text on<br />
the page, the respective font name, or other details. If you are not interested in text<br />
coordinates calling <strong>TET</strong>_get_text( ) will be sufficient.<br />
> If you have advance information about the layout of pages you can use the includebox<br />
and/or excludebox options in <strong>TET</strong>_open_page( ) to get rid of headers, footers, or<br />
similar items which are not part of the main text.<br />
Unknown characters. If <strong>TET</strong> is unable to determine the appropriate Unicode mapping<br />
for one or more characters it will represent it with the Unicode replacement character<br />
U+FFFD. If your application is not concerned about unmappable characters you can<br />
simply discard all occurrences of this character. Applications which require more finegrain<br />
results could take the corresponding font into account, and use it to decide on<br />
processing of unmappable characters. Use the following document option to replace all<br />
unmapped characters with a question mark:<br />
unknownchar=?<br />
Use the following document option to remove all unmapped characters from the output:<br />
fold={{[:Private_Use:] remove} {[U+FFFD] remove} default}<br />
Complex layouts. Some classes of documents often use very elaborate page layouts.<br />
For example, with magazines and periodicals <strong>TET</strong> may not be able to properly determine<br />
the relationship of columns on the page. In such situations it is possible to enhance<br />
the extracted text at the expense of processing time. Suitable options for this<br />
purpose are summarized in Section 6.6, »Layout Analysis«, page 88. See Table 10.12, page<br />
174, for more details on relevant options.<br />
Legal documents. When dealing with legal documents there is usually zero tolerance<br />
for wrong Unicode mappings since they might alter the content or interpretation of a<br />
document. In many cases the text position is not required, and the text must be extracted<br />
word by word. Recommendations:<br />
> Use the granularity=word option in <strong>TET</strong>_open_page( ).<br />
> Use the password option with the appropriate document password in <strong>TET</strong>_open_<br />
document( ) if you must process documents which require a password for opening, or<br />
the shrug option if content extraction is not allowed in the permission settings and<br />
you are in a legal position to extract text from the document (see »The »shrug« feature<br />
for protected documents«, page 60).<br />
> For absolute text fidelity: stop processing as soon as the unknown field in the character<br />
info structure returned by <strong>TET</strong>_get_char_info( ) is 1, or if the Unicode replacement<br />
character U+FFFD is part of the string returned by <strong>TET</strong>_get_text( ). In <strong>TET</strong>ML with one<br />
of the text modes glyph or wordplus you can identify this situation by the following<br />
attribute in the Glyph element:<br />
unknown="true"<br />
Do not set the unknownchar option to any common character since you may be unable<br />
to distinguish it from correctly mapped characters without checking the<br />
unknown field.<br />
66 Chapter 5: Configuration