10.07.2015 Views

Download - Multivac!

Download - Multivac!

Download - Multivac!

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Cookbook A full code sample can be found in the Cookbook topic text_output/process_utf8.Unicode encoding schemes and the Byte Order Mark (BOM). Computer architecturesdiffer in the ordering of bytes, i.e. whether the bytes constituting a larger value (16- or32-bit) are stored with the most significant byte first (big-endian) or the least significantbyte first (little-endian). A common example for big-endian architectures is PowerPC,while the x86 architecture is little-endian. Since UTF-8 and UTF-16 are based on valueswhich are larger than a single byte, the byte-ordering issue comes into play here. An encodingscheme (note the difference to encoding form above) specifies the encodingform plus the byte ordering. For example, UTF-16BE stands for UTF-16 with big-endianbyte ordering. If the byte ordering is not known in advance it can be specified by meansof the code point U+FEFF, which is called Byte Order Mark (BOM). Although a BOM is notrequired in UTF-8, it may be present as well, and can be used to identify a stream ofbytes as UTF-8. Table 4.1 lists the representation of the BOM for various encoding forms.Table 4.1 Byte order marks for various Unicode encoding formsEncoding form Byte order mark (hex) graphical representationUTF-8 EF BB BF UTF-16 big-endian FE FF þÿUTF-16 little-endian FF FE ÿþUTF-32 big-endian 00 00 FE FF ? ? þÿ 1UTF-32 little-endian FF FE 00 00 ÿþ ? ? 11. There is no standard graphical representation of null bytes.4.2 Important Unicode Concepts 75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!