23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Java</strong> I/O<br />

Text editors that work with non-ASCII character sets like MacRoman, Arabic, or Big-5<br />

Chinese can integrate with existing <strong>Java</strong> compilers by providing a preprocessing phase where<br />

the natively encoded data is translated to Unicode-escaped ASCII before being passed to<br />

Sun's javac compiler. Alternately, they can hand off the translation work to javac (1.1 and<br />

later) by using its -encoding flag. For example, to specify that the file MyClass.java is<br />

written in the ISO 8859-9 character set (essentially Latin-1 with the Turkish characters , , ,<br />

, , and replacing the Icelandic characters þ, , ý, Ý, ð, and Ð) you would type:<br />

% javac -encoding 8859_9 MyClass.java<br />

Table B.4 lists the encodings that <strong>Java</strong> 1.1 understands.<br />

14.4 UTF-8<br />

Since every Unicode character is encoded in exactly two bytes, Unicode is a fairly simple<br />

encoding. The first two bytes of a file are the first character. The next two bytes are the<br />

second character, and so on. This makes parsing Unicode data relatively simple compared to<br />

schemes that use variable-width characters. The downside is that Unicode is far from the most<br />

efficient encoding possible. In a file containing mostly English text, the high bytes of almost<br />

all the characters will be 0. These bytes can occupy as much as half of the file. If you're<br />

sending data across the network, Unicode data can take twice as long.<br />

A more efficient encoding can be achieved for files that are composed primarily of ASCII text<br />

by encoding the more common characters in fewer bytes. UTF-8 is one such format that<br />

encodes the non-null ASCII characters in a single byte, characters between 128 and 2047 and<br />

ASCII null in two bytes, and the remaining characters in three bytes. While theoretically this<br />

encoding might expand a file's size by 50%, because most text files contain primarily ASCII,<br />

in practice it's almost always a huge savings. Therefore, <strong>Java</strong> uses UTF-8 in string literals,<br />

identifiers, and other text data in compiled byte code. UTF-8 is also a common encoding for<br />

XML files and the native encoding of Bell Labs' experimental Plan 9 operating system.<br />

To better understand UTF-8, consider a typical Unicode character as a sequence of 16 bits:<br />

x15 x14 x13 x12 x11 x10 x9 x8<br />

x7 x6 x5 x4 x3 x2 x1 x0<br />

Each ASCII character except the null character (each character between 1 and 127) has its<br />

upper nine bits equal to 0:<br />

0 0 0 0 0 0 0 0<br />

0 x6 x5 x4 x3 x2 x1 x0<br />

Therefore, it's easy to encode an ASCII character as a single byte. Just drop the high-order<br />

byte:<br />

0 x6 x5 x4 x3 x2 x1 x0<br />

Now consider characters between 128 and 2047. These all have their top five bits equal to 0,<br />

as shown here:<br />

346

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!