23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Java</strong> I/O<br />

encoding into a new file called isodata.txt encoded with ISO Latin-1 with Unicode escapes,<br />

you would type:<br />

% native2ascii -encoding MacRoman macdata.txt isodata.txt<br />

You can convert it back with the -reverse option:<br />

% native2ascii -encoding MacRoman -reverse isodata.txt macdata.txt<br />

If you don't specify a particular encoding, native2ascii makes its best guess as to the<br />

platform's native encoding. This best guess is read from the system property file.encoding.<br />

On American Macs, the default is MacRoman. On American Solaris, the default is 8859_1<br />

(ISO Latin-1). On American Windows, the default is also 8859_1. However, you shouldn't<br />

rely on these values. Instead, check the property directly. Systems configured for other<br />

countries are likely to have different default encodings. Table B.4 lists the many encodings<br />

that <strong>Java</strong>, javac, and native2ascii understand. As extensive as this list is, there are a few<br />

missing pieces. In particular, ISO 8859-10, a.k.a. Latin-6, includes ASCII plus various<br />

characters used for Lappish, Nordic, and Inuit languages in the upper 128 places. <strong>Java</strong> cannot<br />

currently convert characters in this set to Unicode.<br />

Work is continuing on both Unicode and other character sets. ISO 8859-11 will provide a<br />

standard encoding for Thai. ISO 8859-12, also known as Latin-7, will use the upper 128<br />

characters past ASCII for Celtic. ISO 8859-13, also known as Latin-8, will use them for the<br />

Baltic Rim languages. ISO 8859-14, also known as Latin-9, will encode ASCII plus Sami.<br />

Eventually, converters will be needed for these encodings as well.<br />

14.7 Converting Between Byte Arrays and Strings<br />

The java.lang.String class has several constructors that form strings from byte arrays and<br />

several methods that return a byte array corresponding to a given string. Anytime a Unicode<br />

string is converted to bytes or vice versa, that conversion happens according to one of the<br />

encodings listed in Table B.4. The same string can produce different byte arrays if different<br />

encodings are used. Six constructors form a new String object from a byte array:<br />

public String(byte[] ascii, int highByte)<br />

public String(byte[] ascii, int highByte, int offset, int length)<br />

public String(byte[] data, String encoding)<br />

throws UnsupportedEncodingException<br />

public String(byte[] data, int offset, int length, String encoding)<br />

throws UnsupportedEncodingException<br />

public String(byte[] data)<br />

public String(byte[] data, int offset, int length)<br />

The first two constructors, the ones with the highByte argument, are leftovers from <strong>Java</strong> 1.0<br />

that are deprecated in <strong>Java</strong> 1.1. These two constructors do not accurately translate non-Latin-1<br />

character sets into Unicode. Instead, they read each byte in the ascii array as the low-order<br />

byte of a two-byte character, then fill in the high-order byte with the highByte argument. For<br />

example:<br />

byte[] isoLatin1 = new byte[256];<br />

for (int i = 0; i < 256; i++) isoLatin1[i] = (byte) i;<br />

String s = new String(isoLatin1, 0);<br />

357

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!