17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.2 Resource Configuration and File Searching<br />

UPR files and resource categories. In some situations <strong>TET</strong> needs access to resources<br />

such as encoding definitions or glyph name mapping tables. In order to make resource<br />

handling platform-independent and customizable, a configuration file can be supplied<br />

for describing the available resources along with the names of their corresponding disk<br />

files. In addition to a static configuration file, dynamic configuration can be accomplished<br />

at runtime by adding resources with <strong>TET</strong>_set_option( ). For the configuration file<br />

a simple text format called Unix PostScript Resource (UPR) is used. The UPR file format as<br />

used by <strong>TET</strong> will be described below. <strong>TET</strong> supports the resource categories listed in Table<br />

5.1.<br />

Table 5.1 Resource categories (all file names must be specified in UTF-8)<br />

category format 1<br />

explanation<br />

cmap key=value Resource name and file name of a CMap<br />

codelist key=value Resource name and file name of a code list<br />

encoding key=value Resource name and file name of an encoding<br />

glyphlist key=value Resource name and file name of a glyph list<br />

glyphmapping option list An option list describing a glyph mapping method according to Table 10.4,<br />

page 132. This resource will be evaluated in <strong>TET</strong>_open_document( ), and<br />

the result will be appended after the mappings specified in the option<br />

glyphmapping of <strong>TET</strong>_open_document( ).<br />

hostfont key=value Name of a host font resource (key is the PDF font name; value is the UTF-8<br />

encoded host font name) to be used for an unembedded font<br />

fontoutline key=value Font and file name of a TrueType or OpenType font to be used for an unembedded<br />

font<br />

searchpath value Relative or absolute path name of directories containing data files<br />

1. While the UPR syntax requires an equal character ’=’ between the name and value, this character is neither required nor allowed<br />

when specifying resources with <strong>TET</strong>_set_option( ).<br />

The UPR file format. UPR files are text files with a very simple structure that can easily<br />

be written in a text editor or generated automatically. To start with, let’s take a look at<br />

some syntactical issues:<br />

> Lines can have a maximum of 255 characters.<br />

> A backslash ’\’ escapes newline characters. This may be used to extend lines.<br />

> An isolated period character ’ . ’ serves as a section terminator.<br />

> Comment lines may be introduced with a percent ’%’ character, and terminated by<br />

the end of the line.<br />

> Whitespace is ignored everywhere except in resource names and file names.<br />

UPR files consist of the following components:<br />

> A magic line for identifying the file. It has the following form:<br />

PS-Resources-1.0<br />

> A section listing all resource categories described in the file. Each line describes one<br />

resource category. The list is terminated by a line with a single period character.<br />

5.2 Resource Configuration and File Searching 51

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!