DRAFT IEEE Standard for Binary Floating-Point Arithmetic - Sonic.net
DRAFT IEEE Standard for Binary Floating-Point Arithmetic - Sonic.net
DRAFT IEEE Standard for Binary Floating-Point Arithmetic - Sonic.net
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>DRAFT</strong> <strong>IEEE</strong> <strong>Standard</strong> <strong>for</strong> <strong>Floating</strong>-<strong>Point</strong> <strong>Arithmetic</strong> – 2003 August 12 10:20<br />
The smallest positive normal number is b emin and the largest is<br />
b emax (b − b 1-p ). The nonzero values with magnitude less than<br />
b emin are called subnormal because their magnitudes lie between<br />
zero and the smallest normal magnitude. One reason to<br />
distinguish between normal and subnormal binary numbers is that<br />
<strong>for</strong>mats may encode these two kinds of numbers differently.<br />
Every finite representable number is an integral multiple of the<br />
smallest subnormal magnitude b emin b 1−p .<br />
For any variable that has the value zero, the sign bit s provides an<br />
extra bit of in<strong>for</strong>mation. Although all <strong>for</strong>mats have distinct<br />
representations <strong>for</strong> +0 and –0, the signs are significant in some<br />
circumstances, such as division by zero, but not in others (see<br />
section 6.3). In this standard, 0 and ∞ are written without a sign<br />
when the sign is not important.<br />
[In this section, it may be appropriate to note and explain the differences between<br />
the real numbers (or real numbers extended with signed infinities) and a <strong>for</strong>mat's<br />
approximation to real numbers with respect to zero. In other words, while the<br />
real number system only has a single zero, floating-point numbers have both<br />
positive and negative zero <strong>for</strong> pragmatic reasons to make certain computations<br />
easier to <strong>for</strong>mulate (branch cuts, etc.) and to allow more regular rules (e.g.<br />
defining sign of 1/0). -JDD]<br />
Basic Format Parameters Defining Representable Values<br />
<strong>Binary</strong> <strong>for</strong>mat b = 2 Decimal <strong>for</strong>mat b = 10<br />
Parameter binary32 binary64 binary128 decimal32 decimal64 decimal128<br />
p digits 24 53 113 7 16 34<br />
emax +127 +1023 +16383 +96 +384 +6144<br />
emin –126 –1022 –16382 –95 –383 –6143<br />
Table 1a: Representable Value Parameters<br />
[Note: These choices <strong>for</strong> emin, emax , and p meet the constraints of section 3.1<br />
of <strong>IEEE</strong> 854:<br />
• (emax - emin)/p shall exceed 4 and should exceed 10<br />
• b p -1<br />
>= 10 5 .<br />
Copyright © 2003 by the Institute of Electrical and Electronics Engineers, Inc. This document is an unapproved<br />
draft of a proposed <strong>IEEE</strong>-SA <strong>Standard</strong> - USE AT YOUR OWN RISK. See statement on page 1.<br />
Page 19