15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The efficiency of this process is dependent on the availability of a fast multiplier, since each iteration<br />

of Eq. (9.32) requires two multiplications and a subtraction. The complete process for the initial estimate,<br />

three iterations, and the final quotient determination requires four subtraction operations and seven<br />

multiplication operations to produce a 16-bit quotient. This is faster than a conventional non-restoring<br />

divider if multiplication is roughly as fast as addition, a condition which may be satisfied for systems<br />

which include hardware multipliers.<br />

Floating-Point Arithmetic<br />

Recent advances in VLSI have increased the feasibility of hardware implementations of floating point<br />

arithmetic units. The main advantage of floating point arithmetic is that its wide dynamic range virtually<br />

eliminates overflow for most applications.<br />

Floating-Point Number Systems<br />

A floating point number, A, consists of a significand (or mantissa), Sa, and an exponent, Ea. The value<br />

of a number, A, is given by the equation:<br />

© 2002 by CRC Press LLC<br />

A = Sa r Ea (9.34)<br />

where r is the radix (or base) of the number system. Use of the binary radix (i.e., r = 2) gives maximum<br />

accuracy, but may require more frequent normalization than higher radices.<br />

The IEEE Std. 754 single precision (32-bit) floating point format, which is widely implemented, has<br />

an 8-bit biased integer exponent which ranges between 1 and 254 [20]. The exponent is expressed in<br />

excess 127 code so that its effective value is determined by subtracting 127 from the stored value. Thus,<br />

the range of effective values of the exponent is −126 to 127, corresponding to stored values of 1 to 254,<br />

respectively. A stored exponent value of ZERO (E min) serves as a flag for ZERO (if the significand is ZERO)<br />

and for denormalized numbers (if the significand is non-ZERO). A stored exponent value of 255 (E max)<br />

serves as a flag for infinity (if the significand is ZERO) and for “not a number” (if the significand is nonzero).<br />

The significand is a 25-bit sign magnitude mixed number (the binary point is to the right of the most<br />

significant bit and is always a ONE except for denormalized numbers). More detail on floating point formats<br />

and on the various considerations that arise in the implementation of floating point arithmetic units are<br />

given in [7,21]. The IEEE 754 standard for floating point numbers is discussed in [22,23].<br />

Floating-Point Addition<br />

A flow chart for floating point addition is shown in Fig. 9.18. For this flowchart, the operands are assumed<br />

to be “unpacked” and normalized with magnitudes in the range [1/2, 1). On the flow chart, the operands<br />

are (E a, S a) and (E b, S b), the result is (E s,S s), and the radix is 2. In step 1 the operand exponents are<br />

compared; if they are unequal, the significand of the number with the smaller exponent is shifted right<br />

in step 3 or 4 by the difference in the exponents to properly align the significands. For example, to add<br />

the decimal operands 0.867 × 10 5 and 0.512 × 10 4 , the latter would be shifted right by one digit and<br />

0.867 added to 0.0512 to give a sum of 0.9182 × 10 5 . The addition of the significands is performed in<br />

step 5. Steps 6–8 test for overflow and correct, if necessary, by shifting the significand one position to<br />

the right and incrementing the exponent. Step 9 tests for a zero significand. The loop of steps 10–11<br />

scales unnormalized (but non-ZERO) significands upward to normalize the result. Step 12 tests for<br />

underflow.<br />

Floating point subtraction is implemented with a similar algorithm. Many refinements are possible to<br />

improve the speed of the addition and subtraction algorithms, but floating point addition and subtraction<br />

will, in general, be much slower than fixed-point addition as a result of the need for operand alignment<br />

and result normalization.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!