17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.4 Support for Chinese, Japanese, and Korean <strong>Text</strong><br />

CJK support. <strong>TET</strong> supports Chinese, Japanese, and Korean (CJK) text, and will convert<br />

horizontal and vertical CJK text in arbitrary encodings (CMaps) to Unicode. <strong>TET</strong> supports<br />

all of Adobe’s CJK character collections, which cover all PDF CMaps used in PDF<br />

versions up to and including 1.6:<br />

> Simplified Chinese: Adobe-GB1-4<br />

> Traditional Chinese: Adobe-CNS1-4<br />

> Japanese: Adobe-Japan1-6<br />

> Korean: Adobe-Korea1-2<br />

The PDF CMaps in turn cover all of the CJK character encodings which are in use today,<br />

such as Shift-JIS, EUC, Big-5, KSC, and many others.<br />

Note In order to extract CJK text you must configure access to the CMap files which are shipped with<br />

<strong>TET</strong> according to Section 0.1, »Installing the Software«, page 7.<br />

Several groups of CJK characters will be modified (see Table 6.2, page 68, for details):<br />

> Fullwidth ASCII variants and fullwidth symbol variants will be mapped to the corresponding<br />

halfwidth characters.<br />

> CJK compatibility forms (prerotated glyphs for vertical text) and small form variants<br />

will be mapped to the corresponding normal variants.<br />

CJK font names which are encoded with locale-specific encodings (e.g. Japanese font<br />

names encoded in Shift-JIS) will also be normalized to Unicode. The wordfinder will<br />

treat all ideographic CJK characters as individual words, while Katakana characters will<br />

not be treated as word boundaries (a sequence of Katakana will be treated as a single<br />

word).<br />

CJK text with vertical writing mode. <strong>TET</strong> supports both horizontal and vertical writing<br />

modes, and performs all metrics calculations as appropriate for the respective writing<br />

mode. Keep the following in mind when dealing with text in vertical writing mode:<br />

> The glyph reference point in vertical writing mode is at the top center of the glyph<br />

box. The text position will advance downwards as determined by the font size and<br />

character spacing, regardless of the glyph width (see Figure 6.3).<br />

> The angle alpha will be 0˚ for standard vertical text. In other words, fonts with vertical<br />

writing mode and alpha=0° will progress downwards, i.e. in direction -90˚.<br />

> Because of the differences noted above client code must take the writing mode into<br />

account by using the pCOS code shown in Section 9.1, »Simple pCOS Examples«, page<br />

105 for determining the writing mode of a font. Note that not all text which appears<br />

vertically actually uses a font with vertical writing mode.<br />

> Prerotated glyphs for Latin characters and punctuation will be mapped to the corresponding<br />

unrotated Unicode character (see Table 6.2).<br />

6.4 Support for Chinese, Japanese, and Korean <strong>Text</strong> 67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!