17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.3 Chinese, Japanese, and Korean <strong>Text</strong><br />

6.3.1 CJK Encodings and CMaps<br />

<strong>TET</strong> supports Chinese, Japanese, and Korean (CJK) text, and converts horizontal and vertical<br />

CJK text in arbitrary legacy encodings (CMaps) to Unicode. <strong>TET</strong> supports all of Adobe’s<br />

CJK character collections:<br />

> Simplified Chinese: Adobe-GB1-5<br />

> Traditional Chinese: Adobe-CNS1-5<br />

> Japanese: Adobe-Japan1-6<br />

> Korean: Adobe-Korea1-2<br />

The PDF CMaps in turn cover all of the CJK character encodings which are in use today,<br />

such as Shift-JIS, EUC, Big-5, KSC, and many others. CJK font names encoded with localespecific<br />

encodings (e.g. Japanese font names encoded in Shift-JIS) are normalized to Unicode.<br />

Note In order to extract CJK text which is encoded with legacy encodings you must configure access<br />

to the CMap files which are shipped with <strong>TET</strong> according to Section 0.1, »Installing the Software«,<br />

page 7.<br />

6.3.2 Word Boundaries for CJK <strong>Text</strong><br />

Word boundary detection for CJK text can be controlled with the ideographic page option:<br />

> With ideographic=split ideographic characters always constitute a word boundary, i.e.<br />

single ideographs are returned in granularity=word. While ideographic CJK characters<br />

are considered as word boundaries, Katakana characters are not treated as word<br />

boundaries.<br />

> With ideographic=keep ideographic characters generally don’t constitute a word<br />

boundary. Punctuation and the transition between ideographic and non-ideographic<br />

characters still constitute a word boundary. For granularity=word ideographic comma<br />

U+3001 and ideographic full stop U+3002 also constitute word boundaries. For<br />

granularity=page no line separator is inserted at the end of a line.<br />

For compatibility reasons the default value is ideographic=split, but it is strongly recommended<br />

to use ideographic=keep to improve text extraction for CJK text.<br />

6.3.3 Vertical Writing Mode<br />

<strong>TET</strong> supports both horizontal and vertical writing modes, and performs all metrics calculations<br />

as appropriate for the respective writing mode. Keep the following in mind<br />

when dealing with text in vertical writing mode:<br />

> The glyph reference point in vertical writing mode is at the top center of the glyph<br />

box. The text position will advance downwards as determined by the font size and<br />

character spacing, regardless of the glyph width (see Figure 6.3).<br />

> The angle alpha is 0˚ for standard vertical text. In other words, fonts with vertical<br />

writing mode and alpha=0° progress downwards, i.e. in direction -90˚.<br />

> Because of the differences noted above client code must take the writing mode into<br />

account by using the following pCOS code (note that not all text which appears vertically<br />

actually uses a font with vertical writing mode):<br />

6.3 Chinese, Japanese, and Korean <strong>Text</strong> 79

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!