PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
numerical glyph names created by various common applications and drivers. Since the<br />
same glyph names may be created for different encodings you can provide the<br />
encodinghint option to <strong>TET</strong>_open_document( ) in order to specify the target encoding for<br />
schematic glyph names encountered in the document. For example, if you know that<br />
the document contains Russian text, but the text cannot successfully be extracted for<br />
lack of information in the PDF, you can supply the option encodinghint= cp1250 to specify<br />
a Cyrillic codepage.<br />
In addition to the builtin rules for interpreting numerical glyph names you can define<br />
custom rules with the fontname and glyphrule suboptions of the glyphmapping option<br />
of <strong>TET</strong>_open_document( ). You must supply the following pieces of information:<br />
> The full or abbreviated name of the font to which the rule will be applied (fontname<br />
option)<br />
> A prefix for the glyph names, i.e. the characters before the numerical part (prefix suboption)<br />
> The base (decimal or hexadecimal) in which the numbers will be interpreted (base<br />
suboption)<br />
> The encoding in which to interpret the resulting numerical codes (encoding suboption)<br />
For example, if you determined (e.g. using <strong>PDFlib</strong> FontReporter) that the glyphs in the<br />
fonts T1, T2, T3, etc. are named c00, c01, c02, ..., cFF where each glyph name corresponds to<br />
the WinAnsi character at the respective hexadecimal position (00, ..., FF) use the following<br />
option for <strong>TET</strong>_open_document( ):<br />
glyphmapping {{fontname=T* glyphrule={prefix=c base=hex encoding=winansi} }}<br />
External font files and system fonts. If a PDF does not contain sufficient information<br />
for Unicode mapping and the font is not embedded, you can configure additional font<br />
data which <strong>TET</strong> will use to derive Unicode mappings. Font data may come from a True-<br />
Type or OpenType font file on disk, which can be configure with the fontoutline resource<br />
category. As an alternative on Mac and Windows systems, <strong>TET</strong> can access fonts which<br />
are installed on the host operating system. Access to these host fonts can be disabled<br />
with the usehostfonts option in <strong>TET</strong>_open_document( ).<br />
In order to configure a disk file for the WarnockPro font use the following call:<br />
<strong>TET</strong>_set_option(tet, "fontoutline {WarnockPro WarnockPro.otf}");<br />
See Section 5.2, »Resource Configuration and File Searching«, page 51, for more details<br />
on configuring external font files.<br />
80 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>