PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
glyphmapping {{fontname=Warnock* tounicodecmap=warnock}}<br />
Glyph list resources for simple fonts. Glyph lists (short for: glyph name lists) can be<br />
used to provide custom Unicode values for non-standard glyph names, or override the<br />
existing values for standard glyph names. A glyph list is a text file where each line describes<br />
a Unicode mapping for a single glyph name according to the following rules:<br />
> <strong>Text</strong> after a percent sign ’%’ will be ignored; this can be used for comments.<br />
> The first column contains the glyph name. Any glyph name used in a font can be<br />
used (i.e. even the Unicode values of standard glyph names can be overridden). In order<br />
to use the percent sign as part of a glyph name the sequence \% must be used<br />
(since the percent sign serves as the comment introducer).<br />
> At most one mapping for a particular glyph name is allowed; multiple mappings for<br />
the same glyph name will be treated as an error.<br />
> The remainder of the line contains up to 7 Unicode code points for the glyph name.<br />
The values can be supplied in decimal notation or (with the prefix x or 0x) in hexadecimal<br />
notation. UTF-32 is supported, i.e. surrogate pairs can be used.<br />
> Unprintable characters in glyph names can be inserted by using escape sequences<br />
for text files (see Section 5.2, »Resource Configuration and File Searching«, page 51)<br />
By convention, glyph lists use the file name suffix .gl. Glyph lists can be configured with<br />
the glyphlist resource. If no glyph list resource has been specified explicitly, <strong>TET</strong> will<br />
search for a file named .gl (where is the resource name) in the<br />
searchpath hierarchy (see Section 5.2, »Resource Configuration and File Searching«, page<br />
51, for details). In other words: if the resource name and the file name (without the .gl<br />
suffix) are identical you don’t have to configure the resource since <strong>TET</strong> will implicitly do<br />
the equivalent of the following call (where name is an arbitrary resource name):<br />
<strong>TET</strong>_set_option(tet, "glyphlist {name name.gl}");<br />
Due to the precedence rules for glyph mapping, glyph lists will not be consulted if the<br />
font contains a ToUnicode CMap. The following sample demonstrates the use of glyph<br />
lists:<br />
% Unicode values for glyph names used in TeX documents<br />
precedesequal<br />
similarequal<br />
negationslash<br />
union<br />
prime<br />
0x227C<br />
0x2243<br />
0x2044<br />
0x222A<br />
0x2032<br />
In order to apply a glyph list to all font names starting with CMSY use the following option<br />
for <strong>TET</strong>_open_document( ):<br />
glyphmapping {{fontname=CMSY* glyphlist=tarski}}<br />
Rules for interpreting numerical glyph names in simple fonts. Sometimes PDF documents<br />
contain glyphs with names which are not taken from some predefined list, but<br />
are generated algorithmically. This can be a »feature« of the application generating the<br />
PDF, or may be caused by a printer driver which converts fonts to another format: sometimes<br />
the original glyph names get lost in the process, and are replaced with schematic<br />
names such as G00, G01, G02, etc. <strong>TET</strong> contains builtin glyph name rules for processing<br />
6.8 Advanced Unicode Mapping Controls 79