23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14.3 Unicode Escapes<br />

<strong>Java</strong> I/O<br />

Currently, there isn't a large installed base of Unicode text editors. There's an even smaller<br />

installed base of machines with full Unicode fonts installed. Therefore, it's essential that all<br />

valid <strong>Java</strong> programs can be written using nothing more than ASCII characters.<br />

All <strong>Java</strong> keywords and operators as well as the names of all the classes, methods, and fields in<br />

the core API may be written in pure ASCII. This is by deliberate design on the part of<br />

<strong>Java</strong>Soft. However, Unicode characters are explicitly allowed in comments, string and char<br />

literals, and identifiers. The following, the opening line from Homer's Odyssey, should be<br />

legal <strong>Java</strong>:<br />

To enable statements like that in <strong>Java</strong> source, non-ASCII characters are embedded through<br />

Unicode escape sequences. The escape sequence for a character is a backslash ( \ ) followed<br />

by a small u, followed by the four-digit hexadecimal code for the character. For example:<br />

char tab = '\u0009';<br />

char softHyphen = '\u00AD';<br />

char sigma = '\u03C3';<br />

char squareKeesu = '\u30B9';.<br />

Using Unicode escapes, the opening line from Homer's Odyssey would be rendered as:<br />

/* \u039F\u03B4\u03C5\u03C3\u03C3\u03B5\u03B9\u03B1 */<br />

String \u03B1\u03C1\u03C7\u03B7 =<br />

"\u0386\u03BD\u03B4\u03C1\u03B1 \u03BC\u03BF\u03B9 "<br />

+ "\u03AD\u03BD\u03BD\u03B5\u03C0\u03B5, "<br />

+ "\u039C\u03BF\u03C5\u03C3\u03B1, "<br />

+ " \u03BF\u03C2 \u03BC\u03AC\u03BB\u03B1 \u03C0\u03BF\u03BB\u03BB\u03B1";<br />

Obviously, this is horribly inconvenient for anything more than an occasional non-ASCII<br />

character.<br />

Many <strong>Java</strong> compilers assume that source files are written in ASCII and that the only Unicode<br />

characters present are Unicode escapes. During a single-pass preprocessing phase, the<br />

compiler converts each raw ASCII character or Unicode escape sequence to a two-byte<br />

Unicode character it stores in memory. Only after preprocessing is complete and the ASCII<br />

file has been converted to in-memory Unicode, is the file actually compiled. Some compilers<br />

and runtimes will also compile the upper 128 characters of the ISO Latin-1 character set.<br />

However, some do not. Worse yet, some <strong>Java</strong> virtual machines can compile files containing<br />

non-ASCII, ISO Latin-1 characters but can't run the files they've compiled. For safety's sake<br />

and maximum portability, you should escape all non-ASCII characters.<br />

Version 1.1 and later of Sun's javac compiler assumes a .java file is written in the platform's<br />

default encoding, which is Latin-1 on Solaris and Windows, MacRoman on the Mac.<br />

However, this produces incorrect results on Windows, because Windows does not use true<br />

Latin-1 but a modified version that includes fewer control characters and more printing<br />

characters.<br />

345

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!