23.07.2013 Views

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

Java IO.pdf - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Java</strong> I/O<br />

Chapter 14. Multilingual Character Sets and Unicode<br />

We live on a planet on which many languages are spoken. I can walk out my front door in<br />

Brooklyn on any given day and hear people conversing in French, Creole, Hebrew, Arabic,<br />

Spanish, and languages I don't even recognize. And the Internet is even more diverse than<br />

Brooklyn. A local doctor's office that sets up a storefront on the Web to sell vitamins may<br />

soon find itself shipping to customers whose native language is Chinese, Gujarati, Turkish,<br />

German, Portuguese, or something else. There's no such thing as a local business on the<br />

Internet.<br />

However, the first computers and the first programming languages were mostly designed by<br />

English-speaking programmers in countries where English was the native language. These<br />

programmers designed character sets that worked well for English text, though not much else.<br />

The preeminent such set is ASCII. Since ASCII is a seven-bit character set, each ASCII<br />

character can easily be represented as a single byte, signed or unsigned. Thus, it's natural for<br />

ASCII-based programming languages to equate the character data type with the byte data<br />

type. In these languages, such as C, the same operations that read and write bytes also read<br />

and write characters.<br />

Unfortunately, ASCII is inadequate for almost all non-English languages. It contains no<br />

cedillas, umlauts, betas, thorns, or any of the other thousands of non-English characters that<br />

are used to read and write text around the world. Fairly shortly after the development of<br />

ASCII, there was an explosion of extended character sets around the world, each of which<br />

encoded the basic ASCII characters as well as the additional characters needed for another<br />

language like Greek, Turkish, Arabic, Chinese, Japanese, or Russian. Many of these character<br />

sets are still used today, and much existing data is encoded in them.<br />

However, these character sets are still inadequate for many needs. For one thing, most assume<br />

that you only want to encode English plus one other language. This makes it difficult for a<br />

Russian classicist to write a commentary on an ancient Greek text, for example. Furthermore,<br />

documents are limited by their character sets. Email sent from Morocco may become illegible<br />

in India if the sender is using an Arabic character set but the recipient is using Devanagari.<br />

Unicode is an international effort to provide a single character set that everyone can use.<br />

Unicode supports the characters needed for English, Arabic, Cyrillic, Greek, Devanagari, and<br />

many others. Unicode isn't perfect. There are some omissions, especially in the ideographic<br />

character sets for Chinese and Japanese, but it is the most comprehensive character set yet<br />

devised for all the languages of planet Earth.<br />

<strong>Java</strong> is one of the first programming languages to explicitly address the need for non-English<br />

text. It does this by adopting Unicode as its native character set. All <strong>Java</strong> chars and strings are<br />

given in Unicode. However, since there's also a lot of non-Unicode legacy text in the world,<br />

in a dizzying array of encodings, <strong>Java</strong> also provides the classes you need to read and write<br />

text in these encodings as well.<br />

14.1 Unicode<br />

Unicode is <strong>Java</strong>'s native character set. Each Unicode character is a two-byte, unsigned number<br />

with a value between and 65,535. This provides enough space for characters from all the<br />

337

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!