17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

numerical glyph names created by various common applications and drivers. Since the<br />

same glyph names may be created for different encodings you can provide the<br />

encodinghint option to <strong>TET</strong>_open_document( ) in order to specify the target encoding for<br />

schematic glyph names encountered in the document. For example, if you know that<br />

the document contains Russian text, but the text cannot successfully be extracted for<br />

lack of information in the PDF, you can supply the option encodinghint= cp1250 to specify<br />

a Cyrillic codepage.<br />

In addition to the builtin rules for interpreting numerical glyph names you can define<br />

custom rules with the fontname and glyphrule suboptions of the glyphmapping option<br />

of <strong>TET</strong>_open_document( ). You must supply the following pieces of information:<br />

> The full or abbreviated name of the font to which the rule will be applied (fontname<br />

option)<br />

> A prefix for the glyph names, i.e. the characters before the numerical part (prefix suboption)<br />

> The base (decimal or hexadecimal) in which the numbers will be interpreted (base<br />

suboption)<br />

> The encoding in which to interpret the resulting numerical codes (encoding suboption)<br />

For example, if you determined (e.g. using <strong>PDFlib</strong> FontReporter) that the glyphs in the<br />

fonts T1, T2, T3, etc. are named c00, c01, c02, ..., cFF where each glyph name corresponds to<br />

the WinAnsi character at the respective hexadecimal position (00, ..., FF) use the following<br />

option for <strong>TET</strong>_open_document( ):<br />

glyphmapping {{fontname=T* glyphrule={prefix=c base=hex encoding=winansi} }}<br />

External font files and system fonts. If a PDF does not contain sufficient information<br />

for Unicode mapping and the font is not embedded, you can configure additional font<br />

data which <strong>TET</strong> will use to derive Unicode mappings. Font data may come from a True-<br />

Type or OpenType font file on disk, which can be configure with the fontoutline resource<br />

category. As an alternative on Mac and Windows systems, <strong>TET</strong> can access fonts which<br />

are installed on the host operating system. Access to these host fonts can be disabled<br />

with the usehostfonts option in <strong>TET</strong>_open_document( ).<br />

In order to configure a disk file for the WarnockPro font use the following call:<br />

<strong>TET</strong>_set_option(tet, "fontoutline {WarnockPro WarnockPro.otf}");<br />

See Section 5.2, »Resource Configuration and File Searching«, page 51, for more details<br />

on configuring external font files.<br />

80 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!