27.10.2014 Views

Mse: Hardware Algorithms Computer Arithmetic - microLab

Mse: Hardware Algorithms Computer Arithmetic - microLab

Mse: Hardware Algorithms Computer Arithmetic - microLab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Mse</strong>: <strong>Hardware</strong> <strong>Algorithms</strong><br />

<strong>Computer</strong> <strong>Arithmetic</strong><br />

Marcel Jacomet<br />

Bern University of Applied Sciences<br />

Bfh-Ti HuCE-<strong>microLab</strong>, Biel/Bienne<br />

Marcel.Jacomet@bfh.ch<br />

October 29, 2013<br />

Contents<br />

1 Introduction 1<br />

2 Signed Digit Numbers 3<br />

2.1 General SD Numbers . . . . . . . . . . . . . . . . 3<br />

2.2 Binary SD Numbers . . . . . . . . . . . . . . . . 8<br />

2.3 Canonic SD Numbers . . . . . . . . . . . . . . . 13<br />

2.4 Encododing/Converting SD Numbers . . . . . . . 15<br />

3 Fast Addition 18<br />

3.1 Overview of Adders . . . . . . . . . . . . . . . . 18<br />

3.2 SD Adder . . . . . . . . . . . . . . . . . . . . . . 19<br />

4 Multiplication and Division 20<br />

4.1 Sequential <strong>Algorithms</strong> . . . . . . . . . . . . . . . 20<br />

4.2 Sequential Mul . . . . . . . . . . . . . . . . . . . 22<br />

4.3 Sequential Div . . . . . . . . . . . . . . . . . . . 27<br />

4.4 Square Root Extraction . . . . . . . . . . . . . . 34


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4.5 Division Through Mul . . . . . . . . . . . . . . . 36<br />

4.6 Division by Newton-Raphson . . . . . . . . . . . 39<br />

5 Elementary Functions 43<br />

5.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

5.2 Additive Normalization . . . . . . . . . . . . . . 45<br />

5.3 Multiplicative Normalization . . . . . . . . . . . 50<br />

References 51<br />

Marcel Jacomet ii 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

c○ Marcel Jacomet, 2009<br />

All rights reserved. This work may not be translated or copied in<br />

whole or in part without the written permission by the author, except<br />

for brief excerpts in connection with reviews or scholarly analysis.<br />

Use in connection with any form of information storage and retrieval,<br />

electronic adaptation, computer software is forbidden.<br />

Marcel Jacomet iii 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

1 Introduction<br />

The slides ”<strong>Hardware</strong> <strong>Algorithms</strong>: <strong>Computer</strong> <strong>Arithmetic</strong>” are<br />

based on [Kor02].<br />

Two’s complement is the most often used number system in<br />

digital hardware. Nevertheless there exist other, more unconventional<br />

number systems, like the signed digit number system.<br />

The siged digit number system differs from the traditional binary<br />

two’2 complement but can yield significant improvement<br />

with respect to speed in mathematical operators like adders,<br />

multipliers or dividers. This speed improvement is basically<br />

achievedbyeliminatingoroptimizingthedelayconsumingcarry<br />

chains. A special case of the signed digit (SD) and the classical<br />

signed digit (CSD) number systems are the ternary number<br />

systems having the values {−1,0,1}. In a more general view<br />

signed digit number systems (GSD) can represent values like<br />

{−(r−1),−(r−2),...,−1,0,1,...(r−2),(r−1)}. In this chapter<br />

we look into the signed digit number system theory. Some<br />

additional literature is referenced as well.<br />

Marcel Jacomet 1 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Textbooks<br />

• <strong>Computer</strong> <strong>Arithmetic</strong>s, 2nd edition. Israel Koren, 2002<br />

A. K. Peters, hardcover Isbn 1-56881-160-8, Usd 62<br />

• many good texts can be found on the web<br />

Marcel Jacomet 2 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

2 Signed Digit Numbers<br />

2.1 General Signed Digit Numbers<br />

The Signed Digit Number System<br />

• Signeddigitalnumbersystemareusedtoeliminate carry<br />

propagation chains in addition and subtraction.<br />

• SD is useful to develop fast algorithms for mathematical<br />

operators like multiplication or division.<br />

• In all classical fixed-radix systems the digits are restricted<br />

to x i ∈ {0,··· ,r −1}<br />

• However we might expand the system to the signed digit<br />

number system: x i ∈ {(r −1),(r −2),··· ,1,0,1,··· ,(r−<br />

2),(r −1)}, where i equals −i<br />

Marcel Jacomet 3 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

The ceiling ⌈x⌉ of a number x is the smalest integer that is<br />

larger than or equal to x.<br />

Redundancy in Signed Digital Numbers<br />

• For r = 10 the digit set is {9,8,··· ,1,0,1,··· ,8,9} and, if<br />

n = 2 the range is 99 ≤ x ≤ 99, which includes 199 numbers.<br />

However with 2 digits, each having 19 possibilities,<br />

there are 19 2 = 361 representations. Thus some numbers<br />

have more than one representation.<br />

• The signed digit number system has redundancies. (01) =<br />

(19) = 1,(03) = (17) = 3,···<br />

• A high redundancy is too costly, thus we reduce it by restricting<br />

the digit set to x i ∈ {a,a−1,··· ,1,0,1,··· ,a<br />

with ⌈ r−1<br />

2 ⌉ ≤ a ≤ r −1.<br />

• For r = 10 we get the range 5 ≤ a ≤ 9 for a, selecting<br />

a = 5. For n = 2 we can represent 1 by (01), (19) not<br />

being valid anymore since 9 is illegal, however 5 still has<br />

two representations.<br />

Marcel Jacomet 4 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Note that the carry bits have been shifted to the left to<br />

simplify the execution of step two in the algorithm. The first<br />

carry bit c i and the first intermediate bit u i are calculated as<br />

follows: c i = 1 as 5+5 ≥ 5 and thus u i = 5+5−10·1 = 0. This<br />

calculation can be done for all positions simultaneously. The<br />

second step is to calculate the sum bits s i . Here again, all sum<br />

bits can be calculated simultaneously. Thus not that the carry<br />

bits do not propagate.<br />

Adding with SD numbers<br />

step 1 Compute an interim⎧sum u i and a carry c i : u i = x i +<br />

⎨ 1 if(x i +y i ) ≥ a<br />

y i −rc i where c i = 1 if(x i +y i ) ≤ a<br />

⎩<br />

0 otherwise<br />

step 2 Calculate the final sum s i = u i +c i−1<br />

• Let’s do the addition of the two decimal numbers. No<br />

carry propagation is needed!<br />

2 5 5 5 5<br />

+ 2 3 4 5 5<br />

0 1 1 1 1 c i<br />

4 2 1 0 0 u i<br />

5 1 0 1 0 s i<br />

Marcel Jacomet 5 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

We want to guarantee that no carry will be generated. Thus<br />

the variable a has to be defined differently. We need |s i | ≤ a,<br />

which means |s i | = |u i +c i | ≤ a. Since |c i | =≤ 1 we have to<br />

satisfy |u i | ≤ a−1 for all input values x i and y i . Thus for the<br />

largest input values we have u i = 2a−r ≤ a−1, resulting in first<br />

limit a ≤ r−1 which is always the case. However if x i +y i = a,<br />

the largest sum where c i is still 1, we get u i = a − r < 0.<br />

Substituting it into |u i | ≤ a−1 leads to r−a ≤ a−1 and finally<br />

to the second limit r+1<br />

2<br />

≤ a.<br />

Preventing carry riples in SD numbers<br />

• Let’s keep the definition r = 10 and a = 5 and calculate<br />

the sum of the two SD numbers 1244 and 1314<br />

1 2 4 4<br />

+ 1 3 1 4<br />

0 1 1 1 c i<br />

2 5 5 2 u i<br />

3 4 6 2 s i<br />

• As digit 6 does not exist, a carry would occur.<br />

• It can be shown that s i has to be restricted by a new a:<br />

⌈ r+1<br />

2 ⌉ ≤ a ≤ r −1<br />

• Redefining a = 6 we again do an addition of two decimal<br />

numbers 1354 and 1314. No carry propagation is needed!<br />

Marcel Jacomet 6 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

1 3 5 4<br />

+ 1 3 1 4<br />

0 1 1 1 c i<br />

2 4 4 2 u i<br />

3 3 5 2 s i<br />

Marcel Jacomet 7 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

2.2 Binary Signed Digit Numbers<br />

Binary SD numbers<br />

• Digit set for binary SD numbers (r = 2, thus a = 1) is:<br />

{1,0,1}<br />

step 1 Compute an interim⎧sum u i and a carry c i : u i = x i +<br />

⎨ 1 if(x i +y i ) ≥ 1<br />

y i −2c i where c i = 1 if(x i +y i ) ≤ 1<br />

⎩<br />

0 otherwise<br />

step 2 Calculate the final sum s i = u i +c i−1<br />

1 1 1 1 1<br />

+ 0 0 0 0 1<br />

1 1 1 1 1 c i<br />

1 1 1 1 0 u i<br />

1 0 0 0 0 0 s i<br />

Marcel Jacomet 8 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Carry riples in binary SD numbers<br />

• Note that the condition ⌈ r+1<br />

2<br />

⌉ ≤ a ≤ r − 1 cannot be<br />

satisfied, thus no guarantee for preventing carries can be<br />

given.<br />

• Let’s do the addition of the two binary SD numbers. Non<br />

existing digit 2 occures if we would ignore the carries:<br />

0 1 1 1<br />

+ 1 0 0 1<br />

1 1 1 1 c i<br />

1 1 1 0 u i<br />

1 2 2 2 0 s i<br />

Marcel Jacomet 9 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Preventing carry riples in binary SD numbers<br />

• Aproblematicsituationoccurswhenx i y i = 01andx i−1 y i−1<br />

equals either 11 or 01.<br />

1 1<br />

+ 0 0<br />

1 1 c i<br />

1 1 u i<br />

1 2 1 s i<br />

• We can avoid setting u i = 1 in these cases by the new<br />

combination of c i = 0 and u i = 1:<br />

1 1<br />

+ 0 0<br />

0 1 c i<br />

1 1 u i<br />

0 0 1 s i<br />

Marcel Jacomet 10 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Rules to prevent carry riples in binary SD numbers<br />

• We used the following rules for adding binary SD numbers:<br />

x i y i 00 01 01 11 11 11<br />

c i 0 1 1 1 1 0<br />

u i 0 1 1 0 0 0<br />

• Evolving the previously described idea, we get the modified<br />

rules table for adding binary SD numbers:<br />

x i y i 00 01 01 01 01 11 11 11<br />

x i−1 y i−1 - neither at least neither at least - - -<br />

is 1 one is 1 is 1 one is 1<br />

c i 0 1 0 1 0 1 1 0<br />

u i 0 1 1 1 1 0 0 0<br />

Marcel Jacomet 11 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Examples for adding binary SD numbers<br />

• Let’s again do the addition of the two binary numbers.<br />

Now no carry propagation is obtained!<br />

0 1 1 1<br />

+ 1 0 0 1<br />

0 0 0 1 c i<br />

1 1 1 0 u i<br />

0 1 1 0 0 s i<br />

• The above binary numbers represent the decimal numbers<br />

−3 10 and 7 10 , resulting in 4 10 after the addition.<br />

Marcel Jacomet 12 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

2.3 Canonic Signed Digit Numbers<br />

Canonical recoding of binary SD numbers<br />

• Binary SD numbers are particularly useful in fast multiplication<br />

and division algorithms.<br />

• Nonzerodigitscorrespondtoanactiveoperation: add/subtract<br />

• Zero digits correspond to shift-only operations.<br />

• To reduce power consumption, nonzero digits should be<br />

reduced as much as possible.<br />

• The number 7 10 has the following representations:<br />

8 4 2 1<br />

0 1 1 1<br />

1 1 1 1<br />

1 0 1 1<br />

1 0 0 1<br />

1 1 1 1 1<br />

.<br />

• Out of these variants, 1001 is the minimal representation.<br />

Marcel Jacomet 13 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Converting SD in CSD numbers<br />

• TheSDnumberwiththemostoptimalnumberofnon-zero<br />

digits is called canonic signed digit number (CSD).<br />

• ThefollowingalgorithmcanbeusedtoproduceCSDcode:<br />

Starting with the LSB, substitute all 1 sequences equal or<br />

larger than two, with 10···01 (similar for 1 sequences).<br />

• Replace a 11 sequence by 01 and replace a 11 sequence by<br />

01, respectively.<br />

Ex1: Convert the SD number 01111 2 (15 10 ) in a CSD number:<br />

10001<br />

Ex2: Convert the SD number 011011 2 (27 10 ) in a CSD number:<br />

011011 2 = 11101 SD = 100101 CSD<br />

Ex3: ConverttheSDnumber011100 2 inaCSDnumber: 010100 CSD<br />

Marcel Jacomet 14 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

2.4 Encoding and Converting Signed Digit Numbers<br />

If we use 2 bits for encoding the SD digits {1,0,1} then can<br />

select between 4! = 24 differnt encodings.<br />

Encoding SD numbers<br />

• 4! = 24 different encodings are possible for SD numbers<br />

{1,0,1}, 2 of them have practical relevance:<br />

Encoding 1 Encoding 2<br />

x x h x l x h x l<br />

0 0 0 0 0<br />

1 0 1 0 1<br />

1 1 0 1 1<br />

• Encoding 1 satisfies x = x l −x h , Encoding 2 can be viewed<br />

as 2’s complement.<br />

Marcel Jacomet 15 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Converting from SD numbers (1)<br />

• Using ”Encoding 1” the conversion from SD to 2’s complement<br />

can be done by using the equation x = x l − x h<br />

bit for bit.<br />

1 0 1 0 1 1 x<br />

1 0 0 0 1 0 x<br />

Ex3: 25 l<br />

10<br />

- 0 1 0 1 1 0 0 1 1 x h<br />

0 1 1 0 0 1 2’s compl<br />

Marcel Jacomet 16 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Converting from SD numbers (2)<br />

• Even simpler conversion to 2’s complement is the following<br />

algorithm:<br />

– Start at the LSB and move to MSB position.<br />

– Replace any 1 by 1 and succeeding 0’s be 1’s<br />

– Forward the negative sign until a1consumes the negative<br />

sign and is replaced by a 0.<br />

– If a second 1 is reached by the negative sign then it<br />

is replaced by a 0 and the negative sign continues.<br />

– If a 1 is not reached, then the MSB will be set to 1.<br />

Ex4: 9 10 0 1 1 0 0 1 c i (neg)<br />

1 1 1 0 1 1 SD<br />

0 0 1 0 0 1 z i (2’s compl.)<br />

Marcel Jacomet 17 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

3 Fast Addition<br />

3.1 Overview of Adder Architectures<br />

The most often used arithmetic operation is the addition. It<br />

is used either by its own or as part of more complex aritmetic<br />

operations like multiplication, division, hyperbolic trigonometry<br />

operations and so on. Thus a really fast add operation is<br />

essential in hardware algorithms. The most straightforward implementation<br />

is based on the cascoding of full adder blocks wich<br />

has the disadvantage to be very slow for large operands due to<br />

its carry rippling through the result. Numerous adder architectures<br />

have been found to cope with this limitation.<br />

Revisiting Adder Architectures<br />

• Ripple carry adders have the ripple delay problem.<br />

• Carry-look-ahead adder<br />

• Carry-skip adder<br />

• Hierachical adders, Wallace tree adder<br />

• Carry-select adder<br />

• Carry-save adder<br />

• Pipelining of arithmetic operations<br />

Marcel Jacomet 18 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

3.2 Signed Digit Adder<br />

SD Adder<br />

• Develop a cascadable hardware block for adding SD numbers<br />

using the ”Encoding 2” representation.<br />

x hl<br />

i yi<br />

hl<br />

<br />

<br />

x hl<br />

2 yhl 2<br />

<br />

<br />

x hl<br />

1 yhl 1<br />

<br />

<br />

x hl<br />

0 yhl 0<br />

<br />

0 0<br />

+<br />

· · ·<br />

+<br />

+<br />

+<br />

s h i<br />

s l i<br />

s h 2 s l 2<br />

s h 1 s l 1<br />

s h 0 s l 0<br />

• Now no carry chain exists anymore! Thus we have a very<br />

fast adder, independent of its bit length.<br />

Marcel Jacomet 19 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4 Multiplication and Division<br />

4.1 Sequential <strong>Algorithms</strong> for Multiplication,<br />

Division and others<br />

The current trend moving from signal processing implemented<br />

in software technologies towards hardware intensive signal processing<br />

has uncovered a relative lack of understanding hardware<br />

signal processing architectures. Many hardware efficient algorithms<br />

exist, but these are generally not well known due to the<br />

dominance of software systems over the past quarter century. A<br />

selected set of elementary algorithms are presented in the subsequent<br />

sections.<br />

Sequential <strong>Algorithms</strong> for Multiplication, Division and<br />

others<br />

• Binary sequential multiplication is done similar as in decimal<br />

numbers.<br />

• Binary sequential division, dito.<br />

• SQRT (square root), GCD (greatest common divisor) and<br />

other operations have simple sequential implementation<br />

architectures.<br />

• Sequential algorithms are time consuming but profit from<br />

low gate count.<br />

Marcel Jacomet 20 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Marcel Jacomet 21 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4.2 Sequential Algorithm for Multiplication<br />

Sequential Multiplication Algorithm<br />

• Multiplication Z = X · Y, with X = x n−1 x n−2···x 1 x 0<br />

and Y = y n−1 y n−2···y 1 y 0 , where x n−1 and y n−1 are the<br />

sign bits.<br />

• For both operands being positive (i.e. x n−1 = y n−1 = 0)<br />

we obtain: ⎛ ⎞<br />

n−2<br />

∑<br />

Z = ⎝ x j 2 j ⎠·Y<br />

j=0<br />

• Maximal number of bits needed for Z is 1 + 2(n − 1) =<br />

2n−1.<br />

• With P (j) being the partal product, we can find the iteration:<br />

P (j+1) = ( P (j) +x j ·Y ) · 2 −1 , with P (0) = 0 for<br />

j = 0,1,···n−2<br />

• Doing some calculations ...<br />

P (n−1) = (P (n−2) +x n−2 ·Y)·2 −1<br />

Marcel Jacomet 22 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Sequential Multiplication Algorithm<br />

• The previous algorithm is working for positive inputs as<br />

well as for positive multiplier X and negative multiplicant<br />

Y. Z = X ·Y, with X = 3 10 and Y = −6 10<br />

Y 1 0 1 0 −6 10<br />

X × 0 0 1 1 3 10<br />

P (0) = 0 0 0 0 0<br />

x 0 = 1 ⇒ add Y + 1 0 1 0<br />

1 0 1 0<br />

shift right 1 1 0 1 0<br />

x 1 = 1 ⇒ add Y + 1 0 1 0<br />

1 0 1 1 1 0<br />

shift right 1 0 1 1 1 0<br />

x 2 = 0 ⇒ shift only 1 1 0 1 1 1 0 −18 10<br />

Marcel Jacomet 23 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Architecture of Sequential Multiplication Example<br />

• Compact hardware design using one n bit register, one<br />

2n−1 bit register and one n−1 bit adder.<br />

Reg<br />

s Y<br />

❄<br />

✛ ❥+<br />

ShiftReg ✻<br />

s P<br />

✻<br />

X<br />

✲<br />

Marcel Jacomet 24 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Sequential Multiplication Algorithm for negative Operands<br />

• Multiplication Z = X ·Y,<br />

• For both operands being negative (i.e. x n−1 = y n−1 = 1),<br />

the previous algorithm does not work anymore.<br />

• The 2’s complement can be written as follows:<br />

X = −x n−1 ·2 n−1 + ˜X<br />

n−2<br />

∑<br />

where ˜X = x j ·2 j<br />

j=0<br />

• If we would have ignored X being negative, we would have<br />

received:<br />

˜Z = ˜X·Y = (X +x n−1 ·2 n−1 )·Y = X ·Y +Y ·x n−1·2 n−1<br />

• Actually the term X ·Y is needed for the result, thus we<br />

get:<br />

X ·Y = ˜Z −Y ·x n−1 ·2 n−1<br />

• This means we must subtract Y from Z in case x n−1 = 1.<br />

Marcel Jacomet 25 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Multiplication with Two Negative Operands<br />

• For two negative operands the previously developed algorithm<br />

adaption is needed.<br />

Ex 6: Z = X ·Y, with X = −3 10 and Y = −6 10<br />

Y 1 0 1 0 −6 10<br />

X × 1 1 0 1 −3 10<br />

x 0 = 1 ⇒ add Y + 1 0 1 0<br />

shift right 1 1 0 1 0<br />

x 1 = 0 ⇒ shift only 1 1 1 0 1 0<br />

x 2 = 1 ⇒ add Y + 1 0 1 0<br />

1 0 0 0 1 0<br />

shift right 1 1 0 0 0 1 0<br />

x 3 = 1 ⇒ correct + 0 1 1 0<br />

? 0 0 1 0 0 1 0 18 10<br />

Marcel Jacomet 26 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4.3 Sequential Algorithm for Division<br />

Sequential Restoring Division Algorithm<br />

• Division Y = X/D is the most complex of the four basic<br />

arithmetic operations.<br />

• Given dividend X, divisor D, quotient Q and reminder R<br />

we get:<br />

X = Q·D +R with R < D<br />

• We assume that Y,X,D and Q are fractions, thus X < D,<br />

and D ≠ 0.<br />

• Assuming positive operands, we have fractional variables<br />

like:<br />

Q = 0.q 1···q m where m = n−1<br />

• We perform the division as a sequence of subtractions and<br />

shifts.<br />

r i = 2r i−1 −q i ·D with r 0 = X<br />

• Iftheremainder r i is larger than thedivisor D thenq i = 1,<br />

else q i = 0.<br />

Marcel Jacomet 27 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Sequential Restoring Division<br />

• The previous algorithm is working for positive inputs, for<br />

fractions as well as for integers as can be shown.<br />

Q = X/D, with X = 5/8 10 and D = 3/4 10 = 0.11 2<br />

r 0 = X 0 .1 0 1 0 0 0 2r 0 ≥ D<br />

2r 0 0 1 .0 1 0 0 0 set q 1 = 1<br />

add −D + 1 1 .0 1 0<br />

r 1 = 2r 0 −D 0 0 .1 0 0 0 0 2r 1 ≥ D<br />

2r 1 0 1 .0 0 0 0 set q 2 = 1<br />

add −D + 1 1 .0 1 0<br />

r 2 = 2r 1 −D 0 0 .0 1 0 0 2r 2 < D<br />

2r 2 0 0 .1 0 0 set q 3 = 0<br />

r 3 = 2r 2 0 0 .1 0 0 2r 3 ≥ D<br />

2r 3 0 1 .0 0 set q 4 = 1<br />

• Result is Q = 0.1101, continuing the calculation for higher<br />

precision we would have received Q = 0.11010<br />

Marcel Jacomet 28 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Robertson Diagramm for Restoring Division<br />

• With r i−1 < D, the quotient bit q i should be selected such<br />

that r i < D and r i ≥ 0, resulting in the condition<br />

r i = 2·r i−1 −D<br />

r i<br />

✻<br />

D<br />

<br />

q i = 0 q i = 1<br />

✲<br />

D 2D 2r i−1<br />

• Intherestoringdivisionalgorithm,theoldremainder2r i−1<br />

isrestoredifthesubtraction2r i−1 −D wasnegative,shifted<br />

and after a subsequent subtraction we obtain 4r i−1 −D.<br />

Marcel Jacomet 29 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Sequential Non-Restoring Division Algorithm<br />

• The corrections at negative reminders are postponed to<br />

later steps, thus obtaining 2r i−1 −D < 0.<br />

• Shifting it and adding D we get a new reminder 2(2r i−1 −<br />

D)+D = 4r i−1 −D.<br />

• These steps are repeated as long as the reminder remains<br />

negative.<br />

• The quotient bit is determined as follows:<br />

{ 1 if 2ri−1 ≥ 0<br />

q i =<br />

1 if 2r i−1 < 0<br />

• With the above decision, the reminder is computed with<br />

r i = 2r i−1 −q i ·D<br />

• The correction of a ”wrong” selection of the quotient (reminder<br />

gets negative) is done by using 1 instead of 0.<br />

Consider q i q i−1 = 10 (too large) would be corrected by<br />

q i q i−1 = 11.<br />

• If the final reminder and divident have opposite signs, a<br />

final correction is needed Q cor = Q−2 −(n−1)<br />

Marcel Jacomet 30 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Sequential Non-Restoring Division<br />

Q = X/D, with X = −3/8 10 and D = 3/4 10 = 0.11 2<br />

r 0 = X 1 .1 0 1 2r 0 < 0<br />

2r 0 1 1 .0 1 0 set q 1 = 1<br />

add D + 0 0 .1 1 0<br />

r 1 0 0 .0 0 0 2r 1 ≥ 0<br />

2r 1 0 0 .0 0 0 set q 2 = 1<br />

add −D + 1 1 .0 1 0<br />

r 2 1 1 .0 1 0 2r 2 < 0<br />

2r 2 1 0 .1 0 0 set q 3 = 1<br />

add D 0 0 .1 1 0<br />

r 3 1 1 .0 1 0 2r 3 < 0<br />

2r 3 1 .0 1 0 0 correct or set q 4 = 1<br />

• Result before correction is Q = 0.1111 = 0.1001, after the<br />

correction we get Q = 0.1000<br />

Marcel Jacomet 31 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Robertson Diagramm for Non-Restoring Division<br />

• The non-restoring division can be represented graphically.<br />

• Horizontal lines represent ±D<br />

• Diagonal lines represent multiplication by 2<br />

• Vertical lines represent function selection (1vs.1)<br />

• Example: X = 0.5 10 ,D = 0.75 10 , thus Q = 111 2 = 0.101 2<br />

• Q = 0.111<br />

r i<br />

✻D = 0.110<br />

q<br />

i = 1 q i = 1<br />

r0.100 0 = X,r 2 0 = 0.100 2<br />

2r 2 < 0,q 3 = 1 ❍❍❍❍❍❍❥<br />

2r<br />

❍❍❍❍❍<br />

0 = 01.00 2<br />

r<br />

✛ 1 = 0.010 2r 0 − D<br />

0.010 ✲ 2,2r 1<br />

2<br />

❍ <br />

11.100<br />

✻<br />

❍❍❥<br />

✻<br />

✲<br />

❍❨ 2<br />

10.10 11.01 0.110 01.00 2<br />

<br />

❍<br />

0.100 2<br />

r 2 = 11.110 ✛ ❄<br />

11.110 2,2r ❍2<br />

<br />

2<br />

2r 11 −≥ D0,q 2 = 1<br />

<br />

<br />

−D = 11.01<br />

2r0 ≥ 0,q1 = 1<br />

2r i−1<br />

2D = 01.10<br />

Marcel Jacomet 32 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Converting Sd Quotient to Two’s Complement<br />

• The non-restoring division generates representations incompatible<br />

to 2’s complement, thus a conversion for further<br />

processing is needed.<br />

• SD conversion algorithms start with the Lsb, thus needing<br />

the complete quotient before starting.<br />

• An new on-the-fly algorithm is needed to reduce total execution<br />

time.<br />

• It can be shown that the following equations and results<br />

are valid:<br />

p i = 1 2 (q i +1)<br />

(1−p 1 ).p 2 p 3···p m 1<br />

Q = X/D, with X = −3/8 10 and D = 3/4 10 resulting<br />

in the non-corrected value Q = 0.1111Q = 0.0100Q =<br />

0.100Q = 1.100Q = 1.1001Q = 1.1000 steps:step 1: replace<br />

all 1 by 0step 2: shift the given number left by 1 bit<br />

positionstep 3: complement the most significant bitstep 4:<br />

shift a 1 in the Lsb bit positioncorrect: subtract 0.0001<br />

from the non-corrected result Q = −0.5 10<br />

Marcel Jacomet 33 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4.4 Square Root Extraction<br />

Simple SQRT Algorithm<br />

• square root extraction is similar to restoring division<br />

• assume Q = (0.q 1 q 2···q m ) being a fraction, it denotes the<br />

square root of X<br />

• Q i is the partially developed root at step i with the remainder<br />

r i :<br />

i∑<br />

Q i = q k 2 −k<br />

k=1<br />

• leading to<br />

r i 2 −i = (X −Q 2 i)<br />

• we find with some calculations<br />

r i = 2r i−1 −q i ·(2Q i−1 +q i 2 −i )<br />

• this can be viewed as a division with changing divisor D =<br />

(2Q i−1 +q i 2 −i ), see sequential restoring div: 26<br />

• for m → ∞ we get<br />

X −Q 2 = lim m→∞ 2 −m r m = 0<br />

Marcel Jacomet 34 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Simple SQRT Algorithm<br />

Q 2 = X, with X = 10/16 10<br />

r 0 = X 0 .1 0 1 0<br />

2r 0 0 1 .0 1 0 0<br />

−(0+2 −1 ) - 0 0 .1 0 0 0<br />

r 1 0 0 .1 1 0 0 set q 1 = 1,Q 1 = 0.1<br />

2r 1 0 1 .1 0 0 0<br />

−(2Q 1 +2 −2 ) - 0 1 .0 1 0 0<br />

r 2 0 0 .0 1 0 0 set q 2 = 1,Q 1 = 0.11<br />

2r 2 0 0 .1 0 0 0<br />

−(2Q 2 +2 −3 ) - 0 1 .1 0 1 0<br />

r 3 1 0 .1 1 1 0 set q 3 = 0,Q 3 = 0.110<br />

r 3 = 2r 2 0 0 .1 0 0 0<br />

2r 3 0 1 .0 0 0 0<br />

−(2Q 3 +2 −4 ) - 0 1 .1 0 0 1<br />

r 4 1 1 .0 1 1 1 set q 4 = 0,Q 4 = 0.1100<br />

r 4 = 2r 3 0 1 .0 0 0 0<br />

• Result is Q = 0.1100 and the final reminder is 2 −4 r 4 =<br />

16/256 = X −Q 2 = (160−144)/256<br />

Marcel Jacomet 35 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

4.5 Division Through Multiplication<br />

In the previously described division algorithms the number of<br />

repetitive steps are linear to the received number of bit precisions<br />

n in the result. In the next described algorithms the number<br />

of steps is proportinal to log 2 n. However, the algorithms<br />

are based on fast parallel multipliers.<br />

Division by Convergence<br />

• numerator and denumerator are both multiplied by the<br />

same factor<br />

Q = N D = N ·R 0 ·R 1···R −1<br />

D ·R 0 ·R 1···R −1<br />

→ Q 1<br />

• only the quotient needs to be calculated; the algorithm is<br />

suitable for floating-point computations<br />

• the essential step is to select the factors R i correctly<br />

prep: let D be a normalized binary fraction D = 0.1xxxx<br />

• therefore 1/2 ≤ D < 1 and D = 1−y where y ≤ 1/2<br />

iter 1: we select R 0 = 1+y, then<br />

D 1 = D ·R 0 = (1−y)·(1+y) = 1−y 2<br />

• since y 2 ≤ 1/4, D 1 satisfies D 1 ≥ 3/4 and is therefore<br />

closer to 1 than D, in binary D 1 = 0.11xxxx<br />

Marcel Jacomet 36 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

iter 2: we select R 1 = 1+y 2 and obtain<br />

D 2 = D 1 ·R 1 = (1−y 2 )·(1+y 2 ) = 1−y 4<br />

• now y 4 ≤ 1/16, D 2 satisfies D 2 ≥ 15/16, and is therefore<br />

closer to 1 than D 1 , in binary D 2 = 0.1111xxxx<br />

It can easily be shown that the demoninator converges to 1.<br />

Marcel Jacomet 37 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Division by Convergence: Proof<br />

• It can easily be shown that the denominator converges to<br />

1.<br />

D ·R 0 ·R 1···R m−1 = (1−y)[(1+y)(1+y 2 )(1+y 4 )···]<br />

= (1+y)[(1−y)(1+y 2 )(1+y 4 )···]<br />

• The equation in brackets is the series expansion of<br />

1<br />

(1+y),for 0 ≤ y ≤ 1/2<br />

1<br />

lim i→∞ D i = (1+y)·<br />

(1+y) = 1<br />

step 1: D i+1 = D i ·R i and N i+1 = N i ·R i<br />

step 2: Two’s complement operation R i+1 = 2−D i+1<br />

D i = 1−y i2<br />

Marcel Jacomet 38 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example of Division by Convergence<br />

Ex11: Q = N/D, with N = 0.101101 = 0.703125 10 and D =<br />

0.1101 = 0.8125 10 D 0 = 0. 1101<br />

R 0 = 2−D 0 1 .0 0 1 1<br />

N 1 = N ·R 0 0 .1 1 0 1 0 1 0 1 1 1<br />

D 1 = D 0 ·R 0 0 .1 1 1 1 0 1 1 1<br />

R 1 = 2−D 1 1 .0 0 0 0 1 0 0 1<br />

N 2 = N 1 ·R 1 0 .1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 ...<br />

D 2 = D 1 ·R 1 0 .1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 ...<br />

R 2 = 2−D 2 1 .0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 ...<br />

N 3 = N 2 ·R 2 0 .1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 ...<br />

D 3 = D 2 ·R 2 0 .1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...<br />

• To speed up the algorithm further on, look-up tables or<br />

ROMs for initial value of R 0 are used.<br />

4.6 B<br />

ased on the Newton-Raphson Algorithm a division by reciprocation<br />

can be performed. In this case first the reciprocal is<br />

calculated using the Newton-Raphson iteration algorithm, succeeded<br />

by a multiplication. The Netwon Raphson algorithms is<br />

a method of finding the zero of a given function f(x), where the<br />

zero ist the solution of f(x) = 0<br />

Marcel Jacomet 39 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Division by Reciprocation using Newton-Raphson Algorithm<br />

• FirstthereciprocalD iscalculatedusingNewton-Raphson<br />

iteration algorithm.<br />

• In a second step a multiplication is performed to calculate<br />

Q = N/D<br />

• Newton-Raphsoningeneralisusedtocalculatethezerosof<br />

the function f(x), where f(x) is the solution for f(x) = 0.<br />

f(x)<br />

✻<br />

❇ ❇<br />

▲ f(x)<br />

❆<br />

❚<br />

❙<br />

❅<br />

❧◗❍❛❳ ❜ ✠<br />

<br />

✲<br />

x<br />

x<br />

Marcel Jacomet 40 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Newton-Raphson Algorithm for Reciprocation<br />

• Let x 0 be first approximation and x i be the estimate at<br />

ith step, then x i+1 is:<br />

x i+1 = x i + f(x)<br />

−f ′ (x)<br />

• Substitutingf(x)byourreciprocalfindingfunctionf(x) =<br />

1/x−D we find:<br />

x i+1 = x i (2−Dx i )<br />

f(x)<br />

✻<br />

f(x) = 1 x − D<br />

❇ ❇ ❜<br />

▲<br />

▲❆<br />

▲<br />

dy ❚<br />

▲<br />

▲<br />

❙<br />

▲ ❅<br />

▲ ❧◗❍❛❳ ✲<br />

x i x<br />

dxx i+1<br />

Marcel Jacomet 41 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Iterations: Newton-Raphson Algorithm for Reciprocation<br />

• Example D = 0.13, choose initial value: x 0 = 4.0<br />

• Calculate iteration step 3: x 3 = 7.67<br />

• The convergence is quadratic.<br />

• Again, initial values can be generated by ROMs or other<br />

methods for reduced numbers of iterations.<br />

f(x) f(x) = 1 x − D ❜ x = 7.69<br />

0.1<br />

✻<br />

❇ ❇ ❜<br />

▲<br />

▲ x<br />

❆ i+1 = x i · (2 − D · x i)<br />

▲<br />

❚<br />

▲ ▲<br />

❙❜<br />

▲ ❅ ❡<br />

▲<br />

❡ ❧◗❍❛❳ ✲<br />

▲<br />

x 2 = 5.92 · (2 − 0.13 · 5.92) = 7.28<br />

x<br />

x<br />

0 x= 0 4x 1 = x 1 5.92x 2<br />

x 2 = 7.28<br />

Marcel Jacomet 42 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

5 Elementary Functions<br />

5.1 Basics<br />

In a seminar at MicroLab a new architecture has been found<br />

for reciprocal calulation based on the Newton-Raphson iteraton<br />

and a second order polynomial for initial value calculation. The<br />

architectuer is efficient as both steps, initial value calculation<br />

and Newton-Raphson iteration can be executed on the identical<br />

hardware unit [HSGJ10].<br />

Elementary Functions<br />

• Elementary functions can be realized in hardware by different<br />

methods. Some elementary functions:<br />

e x , ln(x), sin(x), cos(x), tanh(x), arcsin(x), etc<br />

• Rom look-up tables might be used, but are large: for n ≥<br />

20 the memory size is ≥ 2.6MB.<br />

• Cordic (cordination rotation digital computer) algorithm<br />

can be used.<br />

• Taylor series expansion can be used, but shows sometimes<br />

bad convergence; e.g.:<br />

e x =<br />

∞∑<br />

i=0<br />

x i<br />

i!<br />

Marcel Jacomet 43 2008


• Polynomial approximations can be used, etc.<br />

<strong>Hardware</strong> <strong>Algorithms</strong><br />

For different elementary functions there exist pre-calculated<br />

constants for the polynomial approximation. Tables with such<br />

constants can be found in [?].<br />

Marcel Jacomet 44 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Polynomial Approximation<br />

• Approximation based on two degree-5 polynomials. Example:<br />

e x<br />

• We can express the elementary function as:<br />

e x = 2 xlog 2 e<br />

• andpartitioningtheexponentxlog 2 e = I+f initsinteger<br />

and fraction parts, such that<br />

e x = 2 I ·2 f<br />

• Implementing 2 I is straigtforward, evaluating 2 f can be<br />

done by a rational approximation, such as a two degree-5<br />

polynoms:<br />

2 f = ((((a 5f +a 4 )f +a 3 )f +a 2 )f +a 1 )f +a 0<br />

((((b 5 f +b 4 )f +b 3 )f +b 2 )f +b 1 )f +1<br />

• a i and b i are known constants, dependent of the target<br />

elementary function.<br />

5.2 Additive Normalization<br />

Therearealternativemethodstocalculatesuchelementaryfunctions<br />

which are more adapted to be implemented in hardware.<br />

Many of these known algorithms for evaluating elementary functions<br />

are based on the division by convergence algorithm discussed<br />

earlier.<br />

Marcel Jacomet 45 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Additive Normalization: Exponential Algorithm<br />

• Algorithm is based on the ”division-by-convergence idea”:<br />

when one formulae is forced to a constant, the other yields<br />

the result (35).<br />

• To evaluate y = e x0 for a fractional argument x 0 , we use.<br />

x i+1 = x i −ln(b i )<br />

y i+1 = y i ·b i<br />

• The b i ’s are selected in such a way that the sequence of x i<br />

approach 0, i.e., x m = 0.<br />

• Note that for for m → ∞ we have x m = 0:<br />

y i+1 ·e xi+1 = y i ·b i ·e xi−ln(bi) = y i ·e xi<br />

y m ·e xm = y 0 ·e x0<br />

= y 0 ·e x0<br />

y m<br />

• The similarity to the ”division-by-convergence” is now apparent,<br />

instead of keeping N i /D i constant, we now keep<br />

y i ·e xi constant.<br />

Marcel Jacomet 46 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Iteration in Exponential Algorithm<br />

• To simplify the multiplication, the b i ’s are given the form:<br />

b i<br />

s i<br />

= (1+s i ·2 −i ) where<br />

∈ {−1,0,1}<br />

• The terms ln(1±2 −i ) have to be pre-calculated and stored<br />

in a look-up table.<br />

• Substituting the equations from the previous slide we get:<br />

x i+1 = x i −ln(1+s i ·2 −i )<br />

y i+1 = y i ·(1+s i ·2 −i )<br />

• To calculate the exponential function e x0 we have to find<br />

the vector s = {s 0 ,s 1 ,···s m−1 }.<br />

• Restrictingx 0 topositivefractionswegetasimplerschema<br />

for selecting s i ∈ {0,1}<br />

• In step (i+1) we set<br />

}<br />

if D =x i −ln(1+2 −i then si+1 = 1, x<br />

) ≥ 0<br />

i+1 = D<br />

else s i+1 = 0, x i+1 = x i<br />

• The convergence is linear: with n steps we get n bits.<br />

Marcel Jacomet 47 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

Example: Exponential Algorithm<br />

i (1+2 −i ) ln(1+2 −i ) (1−2 −i ) ln(1−2 −i )<br />

0 10.0000000000 2 0.693147 0 -<br />

1 1.1000000000 2 0.405465 0.1000000000 2 -0.693147<br />

2 1.0100000000 2 0.223144 0.1100000000 2 -0.287682<br />

3 1.0010000000 2 0.117783 0.1110000000 2 -0.133531<br />

4 1.0001000000 2 0.060625 0.1111000000 2 -0.064539<br />

5 1.0000100000 2 0.030772 0.1111100000 2 -0.031749<br />

Calculate e 0.375 in a 6 bit precision:<br />

• Initial values: i = 0, x 0 = 0.375, y 0 = 1 , y 6 = 1.450<br />

iter 1:<br />

iter 2:<br />

iter 3:<br />

iter 4:<br />

iter 5:<br />

D = x 0 − ln(1+2 −0 ) = -0.318 s 0 = 0<br />

b 0 = (1+s 0 ·2 −0 ) = 1.000<br />

x 1 = x 0 = 0.375 y 1 = y 0 ·b 0 = 1.000<br />

D = x 1 − ln(1+2 −1 ) = -0.030 s 1 = 0<br />

b 1 = (1+s 1 ·2 −1 ) = 1.000<br />

x 2 = x 1 = 0.375 y 2 = y 1 ·b 1 = 1.000<br />

D = x 2 − ln(1+2 −2 ) = +0.152 s 2 = 1<br />

b 2 = (1+s 2 ·2 −2 ) = 1.250<br />

x 3 = D = 0.152 y 3 = y 2 ·b 2 = 1.250<br />

D = x 3 − ln(1+2 −3 ) = +0.034 s 3 = 1<br />

b 3 = (1+s 3 ·2 −3 ) = 1.125<br />

x 4 = D = 0.034 y 4 = y 3 ·b 3 = 1.406<br />

D = x 4 − ln(1+2 −4 ) = -0.027 s 4 = 0<br />

b 4 = (1+s 4 ·2 −4 ) = 1.000<br />

x 5 = x 4 = 0.034 y 5 = y 4 ·b 4 = 1.406<br />

Marcel Jacomet 48 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

iter 6:<br />

D = x 5 − ln(1+2 −5 ) = +0.003 s 5 = 1<br />

b 5 = (1+s 5 ·2 −5 ) = 1.031<br />

x 6 = D = 0.003 y 6 = y 6 ·b 5 = 1.450<br />

Marcel Jacomet 49 2008


<strong>Hardware</strong> <strong>Algorithms</strong><br />

5.3 Multiplicative Normalization<br />

Multiplicative Normalization: Exponential Algorithm<br />

• To calculate e x we did continued summation of the terms<br />

ln(x+s i ·2 −i ).<br />

• This procedure is called additive normalization.<br />

• In a similar way we may define a multiplicative normalization<br />

where an x i is forced to 1 by continued multiplication<br />

with precalculated factors. We approximate now y = lnx.<br />

• Selecting b i such that x i+1 approaches 1, we thus have:<br />

x i+1 = x i ·b i<br />

y i+1 = y i −g(b i )<br />

x i+1 = x 0<br />

∏ i<br />

l=0 b l → 1<br />

• If we select g(b l ) = ln(b l ) then we finally have:<br />

y m = y 0 −ln 1 x 0<br />

= y 0 +lnx 0<br />

• Similar approaches can be done for the other elementary<br />

functions, liketrigonometric, inversetrigonometric, hyperbolic,<br />

etc.<br />

Marcel Jacomet 50 2008


References<br />

<strong>Hardware</strong> <strong>Algorithms</strong><br />

[HSGJ10] Andreas Habegger, Andreas Stahel, Josef Goette,<br />

and Marcel Jacomet. An efficient hardware implementation<br />

for a reciprocal unit. In submitted to: The<br />

5th IEEE International Symposium on Electronic Design,<br />

Test and Applications, Ho Chi Minh City, January<br />

13-15, 2010.<br />

[Kor02] Israel Koren. <strong>Computer</strong> <strong>Arithmetic</strong> <strong>Algorithms</strong>. A K<br />

Peters, Natick, Massachusetts, 2nd edition, 2002.<br />

Marcel Jacomet 51 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!