23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Java</strong> I/O<br />

world's alphabetic scripts and the most common characters from the ideographic scripts of<br />

Chinese and Japanese. The current version of Unicode (2.1) defines 38,887 different<br />

characters from many languages, including English, Russian, Arabic, Hebrew, Greek, Thai,<br />

Korean, and Sanskrit. The most common ideographic characters from Japanese and Chinese<br />

are also included. However, Chinese alone contains over 80,000 different ideograms, so it's<br />

impossible to include them all in a two-byte set. A four-byte Universal Character Set (UCS)<br />

that will include the full Chinese and Japanese scripts is under development. <strong>Java</strong> does not yet<br />

support UCS.<br />

The first 128 Unicode characters (characters through 127) are identical to the ASCII character<br />

set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation<br />

point, so 33 is the Unicode exclamation point, and so on. Table B.1, in Appendix B, shows<br />

this character set. The next 128 Unicode characters (characters 128 through 255) have the<br />

same values as the equivalent characters in the Latin-1 character set defined by ISO standard<br />

8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented<br />

characters, umlauts, cedillas, upside-down question marks, and other characters needed to<br />

write text in most Western European languages. Table B.2 shows these characters. The first<br />

128 characters in Latin-1 are identical to the ASCII character set.<br />

Values beyond 255 encode characters from various other character sets. Where possible,<br />

character blocks describing a particular group of characters map onto established encodings<br />

for that set of characters by simple transposition. For instance, Unicode characters 884<br />

through 1011 encode the Greek alphabet and associated characters like the Greek question<br />

mark (;). [1] This is a direct transposition by 756 of characters 128 through 255 of the ISO<br />

8859-7 character set, which is in turn based on the Greek national standard ELOT 928. For<br />

example, the small letter delta, , ISO 8859-7 character 228, is Unicode character 984. A<br />

small epsilon, , ISO 8859-7 character 229, is Unicode character 985. In general, the<br />

Unicode value for a Greek character equals the ISO 8859-7 value for the character plus 756.<br />

Other character sets are included in Unicode in a similar fashion whenever possible. [2]<br />

NextStep, BeOS, MacOS X Server, Bell Labs' Plan 9, and Windows NT 4.0 all support<br />

Unicode to some extent. Unicode support in MacOS and Windows 98 is more nascent, but it's<br />

coming. Application software is a little slower to appear, but Microsoft Word 97 and 98,<br />

Netscape Navigator 4.0, and Internet Explorer 4.0 all support Unicode. The big hold-up on<br />

most systems is fonts and input methods. Windows NT 5.0 will include fonts covering most<br />

of the defined Unicode characters as well as input methods for most major languages.<br />

14.2 Displaying Unicode Text<br />

Although internally <strong>Java</strong> can handle full Unicode data (it's just numbers, after all), not all <strong>Java</strong><br />

environments can display all Unicode characters. In fact, I'll go so far as to say none of the<br />

current <strong>Java</strong> environments, whether standalone virtual machines or web browsers, can display<br />

all Unicode characters.<br />

Unicode is divided into blocks. For example, characters through 127 are the Basic Latin block<br />

and contain ASCII. Characters 128 through 255 are the Latin Extended-A block and contain<br />

1 Indeed, the Greek question mark is nearly identical to a Latin semicolon; this is not a mistranslation of the character.<br />

2 As much as I'd like to include complete tables for all Unicode characters, if I did so, this book would be little more than that table. For complete lists<br />

of all the Unicode characters and associated glyphs, the canonical reference is The Unicode Standard, Version 2.0, by the Unicode Consortium, ISBN<br />

0-201-48345-9. Online versions of the character tables can be found at http://unicode.org/charts/.<br />

338

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!