17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.2 Unicode Concepts<br />

Unicode encoding forms (UTF formats). The Unicode standard assigns a number (code<br />

point) to each character. In order to use these numbers in computing, they must be represented<br />

in some way. In the Unicode standard this is called an encoding form (formerly:<br />

transformation format); this term should not be confused with font encodings. Unicode<br />

defines the following encoding forms:<br />

> UTF-8: This is a variable-width format where code points are represented by 1-4<br />

bytes. ASCII characters in the range U+0000...U+007F are represented by a single<br />

byte in the range 00...7F. Latin-1 characters in the range U+00A0...U+00FF are represented<br />

by two bytes, where the first byte is always 0xC2 or 0xC3 (these values represent<br />

 and à in Latin-1).<br />

> UTF-16: Code points in the Basic Multilingual Plane (BMP), i.e. characters in the range<br />

U+0000...U+FFFF are represented by a single 16-bit value. Code points in the supplementary<br />

planes, i.e. in the range U+10000...U+10FFFF, are represented by a pair of 16-<br />

bit values. Such pairs are called surrogate pairs. A surrogate pair consists of a highsurrogate<br />

value in the range D800...DBFF and a low-surrogate value in the range<br />

DC00...DFFF. High- and low-surrogate values can only appear as parts of surrogate<br />

pairs, but not in any other context.<br />

> UTF-32: Each code point is represented by a single 32-bit value.<br />

The Byte Order Mark (BOM). Computer architectures differ in the ordering of bytes,<br />

i.e. whether the bytes constituting a larger value (16- or 32-bit) are stored with the most<br />

significant byte first (big-endian) or the least significant byte first (little-endian). A common<br />

example for big-endian architectures is PowerPC, while the x86 architecture is little-endian.<br />

Since UTF-8 and UTF-16 are based on values which are larger than a single<br />

byte, the byte-ordering issue comes into play here. An encoding scheme (note the difference<br />

to encoding form above) specifies the encoding form plus the byte ordering. For<br />

example, UTF-16BE stands for UTF-16 with big-endian byte ordering. If the byte ordering<br />

is not known in advance it can be specified by means of the code point U+FEFF, which is<br />

called Byte Order Mark (BOM). Although a BOM is not required in UTF-8, it may be<br />

present as well, and can be used to identify a stream of bytes as UTF-8. Table 6.1 lists the<br />

representation of the BOM for various encoding forms.<br />

Table 6.1 Byte order marks for various Unicode encoding forms<br />

Encoding form Byte order mark (hex) graphical representation<br />

UTF-8 EF BB BF <br />

UTF-16 big-endian FE FF þÿ<br />

UTF-16 little-endian FF FE ÿþ<br />

UTF-32 big-endian 00 00 FE FF ? ? þÿ 1<br />

UTF-32 little-endian FF FE 00 00 ÿþ ? ? 1<br />

1. There is no standard graphical representation of null bytes.<br />

6.2 Unicode Concepts 61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!