23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Java</strong> I/O<br />

character set was invented. Unicode is a 2-byte, 16-bit character set with 2 16 or 65,536<br />

different possible characters. (Only about 40,000 are used in practice, the rest being reserved<br />

for future expansion.) Unicode can handle most of the world's living languages and a number<br />

of dead ones as well.<br />

The first 256 characters of Unicode—that is, the characters whose high-order byte is zero—<br />

are identical to the characters of the ISO Latin-1 character set. Thus 65 is ASCII A and<br />

Unicode A; 66 is ASCII B and Unicode B and so on.<br />

<strong>Java</strong> streams do not do a good job of reading Unicode text. (This is why readers and writers<br />

were added in <strong>Java</strong> 1.1.) Streams generally read a byte at a time, but each Unicode character<br />

occupies two bytes. Thus, to read a Unicode character, you multiply the first byte read by 256,<br />

add it to the second byte read, and cast the result to a char. For example:<br />

int b1 = in.read();<br />

int b2 = in.read();<br />

char c = (char) (b1*256 + b2);<br />

You must be careful to ensure that you don't inadvertently read the last byte of one character<br />

and the first byte of the next, instead. Thus, for the most part, when reading text encoded in<br />

Unicode or any other format, you should use a reader rather than an input stream. Readers<br />

handle the conversion of bytes in one character set to <strong>Java</strong> chars without any extra effort. For<br />

similar reasons, you should use a writer rather than an output stream to write text.<br />

1.3.4 UTF-8<br />

Unicode is a relatively inefficient encoding when most of your text consists of ASCII<br />

characters. Every character requires the same number of bytes—two—even though some<br />

characters are used much more frequently than others. A more efficient encoding would use<br />

fewer bits for the more common characters. This is what UTF-8 does.<br />

In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919<br />

characters are encoded in two bytes. The remaining Unicode characters are encoded in three<br />

bytes. However, since these three-byte characters are relatively uncommon, [1] especially in<br />

English text, the savings achieved by encoding ASCII in a single byte more than makes up for<br />

it.<br />

<strong>Java</strong>'s .class files use UTF-8 internally to store string literals. Data input streams and data<br />

output streams also read and write strings in UTF-8. However, this is all hidden from direct<br />

view of the programmer, unless perhaps you're trying to write a <strong>Java</strong> compiler or parse output<br />

of a data stream without using the DataInputStream class.<br />

1.3.4.1 Other encodings<br />

ASCII, ISO Latin-1, and Unicode are hardly the only character sets in common use, though<br />

they are the ones handled most directly by <strong>Java</strong>. There are many other character sets, both that<br />

encode different scripts and that encode the same scripts in different ways. For example, IBM<br />

mainframes have long used a non-ASCII eight-bit character set called EBCDIC. EBCDIC has<br />

most of the same characters as ASCII but assigns them to different numbers. Macintoshes<br />

1 The vast majority of the characters above 2047 are the pictograms used for Chinese, Japanese, and Korean.<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!