PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6.4 Support for Chinese, Japanese, and Korean <strong>Text</strong><br />
CJK support. <strong>TET</strong> supports Chinese, Japanese, and Korean (CJK) text, and will convert<br />
horizontal and vertical CJK text in arbitrary encodings (CMaps) to Unicode. <strong>TET</strong> supports<br />
all of Adobe’s CJK character collections, which cover all PDF CMaps used in PDF<br />
versions up to and including 1.6:<br />
> Simplified Chinese: Adobe-GB1-4<br />
> Traditional Chinese: Adobe-CNS1-4<br />
> Japanese: Adobe-Japan1-6<br />
> Korean: Adobe-Korea1-2<br />
The PDF CMaps in turn cover all of the CJK character encodings which are in use today,<br />
such as Shift-JIS, EUC, Big-5, KSC, and many others.<br />
Note In order to extract CJK text you must configure access to the CMap files which are shipped with<br />
<strong>TET</strong> according to Section 0.1, »Installing the Software«, page 7.<br />
Several groups of CJK characters will be modified (see Table 6.2, page 68, for details):<br />
> Fullwidth ASCII variants and fullwidth symbol variants will be mapped to the corresponding<br />
halfwidth characters.<br />
> CJK compatibility forms (prerotated glyphs for vertical text) and small form variants<br />
will be mapped to the corresponding normal variants.<br />
CJK font names which are encoded with locale-specific encodings (e.g. Japanese font<br />
names encoded in Shift-JIS) will also be normalized to Unicode. The wordfinder will<br />
treat all ideographic CJK characters as individual words, while Katakana characters will<br />
not be treated as word boundaries (a sequence of Katakana will be treated as a single<br />
word).<br />
CJK text with vertical writing mode. <strong>TET</strong> supports both horizontal and vertical writing<br />
modes, and performs all metrics calculations as appropriate for the respective writing<br />
mode. Keep the following in mind when dealing with text in vertical writing mode:<br />
> The glyph reference point in vertical writing mode is at the top center of the glyph<br />
box. The text position will advance downwards as determined by the font size and<br />
character spacing, regardless of the glyph width (see Figure 6.3).<br />
> The angle alpha will be 0˚ for standard vertical text. In other words, fonts with vertical<br />
writing mode and alpha=0° will progress downwards, i.e. in direction -90˚.<br />
> Because of the differences noted above client code must take the writing mode into<br />
account by using the pCOS code shown in Section 9.1, »Simple pCOS Examples«, page<br />
105 for determining the writing mode of a font. Note that not all text which appears<br />
vertically actually uses a font with vertical writing mode.<br />
> Prerotated glyphs for Latin characters and punctuation will be mapped to the corresponding<br />
unrotated Unicode character (see Table 6.2).<br />
6.4 Support for Chinese, Japanese, and Korean <strong>Text</strong> 67