Mse: Hardware Algorithms Computer Arithmetic - microLab
Mse: Hardware Algorithms Computer Arithmetic - microLab
Mse: Hardware Algorithms Computer Arithmetic - microLab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Mse</strong>: <strong>Hardware</strong> <strong>Algorithms</strong><br />
<strong>Computer</strong> <strong>Arithmetic</strong><br />
Marcel Jacomet<br />
Bern University of Applied Sciences<br />
Bfh-Ti HuCE-<strong>microLab</strong>, Biel/Bienne<br />
Marcel.Jacomet@bfh.ch<br />
October 29, 2013<br />
Contents<br />
1 Introduction 1<br />
2 Signed Digit Numbers 3<br />
2.1 General SD Numbers . . . . . . . . . . . . . . . . 3<br />
2.2 Binary SD Numbers . . . . . . . . . . . . . . . . 8<br />
2.3 Canonic SD Numbers . . . . . . . . . . . . . . . 13<br />
2.4 Encododing/Converting SD Numbers . . . . . . . 15<br />
3 Fast Addition 18<br />
3.1 Overview of Adders . . . . . . . . . . . . . . . . 18<br />
3.2 SD Adder . . . . . . . . . . . . . . . . . . . . . . 19<br />
4 Multiplication and Division 20<br />
4.1 Sequential <strong>Algorithms</strong> . . . . . . . . . . . . . . . 20<br />
4.2 Sequential Mul . . . . . . . . . . . . . . . . . . . 22<br />
4.3 Sequential Div . . . . . . . . . . . . . . . . . . . 27<br />
4.4 Square Root Extraction . . . . . . . . . . . . . . 34
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4.5 Division Through Mul . . . . . . . . . . . . . . . 36<br />
4.6 Division by Newton-Raphson . . . . . . . . . . . 39<br />
5 Elementary Functions 43<br />
5.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
5.2 Additive Normalization . . . . . . . . . . . . . . 45<br />
5.3 Multiplicative Normalization . . . . . . . . . . . 50<br />
References 51<br />
Marcel Jacomet ii 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
c○ Marcel Jacomet, 2009<br />
All rights reserved. This work may not be translated or copied in<br />
whole or in part without the written permission by the author, except<br />
for brief excerpts in connection with reviews or scholarly analysis.<br />
Use in connection with any form of information storage and retrieval,<br />
electronic adaptation, computer software is forbidden.<br />
Marcel Jacomet iii 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
1 Introduction<br />
The slides ”<strong>Hardware</strong> <strong>Algorithms</strong>: <strong>Computer</strong> <strong>Arithmetic</strong>” are<br />
based on [Kor02].<br />
Two’s complement is the most often used number system in<br />
digital hardware. Nevertheless there exist other, more unconventional<br />
number systems, like the signed digit number system.<br />
The siged digit number system differs from the traditional binary<br />
two’2 complement but can yield significant improvement<br />
with respect to speed in mathematical operators like adders,<br />
multipliers or dividers. This speed improvement is basically<br />
achievedbyeliminatingoroptimizingthedelayconsumingcarry<br />
chains. A special case of the signed digit (SD) and the classical<br />
signed digit (CSD) number systems are the ternary number<br />
systems having the values {−1,0,1}. In a more general view<br />
signed digit number systems (GSD) can represent values like<br />
{−(r−1),−(r−2),...,−1,0,1,...(r−2),(r−1)}. In this chapter<br />
we look into the signed digit number system theory. Some<br />
additional literature is referenced as well.<br />
Marcel Jacomet 1 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Textbooks<br />
• <strong>Computer</strong> <strong>Arithmetic</strong>s, 2nd edition. Israel Koren, 2002<br />
A. K. Peters, hardcover Isbn 1-56881-160-8, Usd 62<br />
• many good texts can be found on the web<br />
Marcel Jacomet 2 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
2 Signed Digit Numbers<br />
2.1 General Signed Digit Numbers<br />
The Signed Digit Number System<br />
• Signeddigitalnumbersystemareusedtoeliminate carry<br />
propagation chains in addition and subtraction.<br />
• SD is useful to develop fast algorithms for mathematical<br />
operators like multiplication or division.<br />
• In all classical fixed-radix systems the digits are restricted<br />
to x i ∈ {0,··· ,r −1}<br />
• However we might expand the system to the signed digit<br />
number system: x i ∈ {(r −1),(r −2),··· ,1,0,1,··· ,(r−<br />
2),(r −1)}, where i equals −i<br />
Marcel Jacomet 3 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
The ceiling ⌈x⌉ of a number x is the smalest integer that is<br />
larger than or equal to x.<br />
Redundancy in Signed Digital Numbers<br />
• For r = 10 the digit set is {9,8,··· ,1,0,1,··· ,8,9} and, if<br />
n = 2 the range is 99 ≤ x ≤ 99, which includes 199 numbers.<br />
However with 2 digits, each having 19 possibilities,<br />
there are 19 2 = 361 representations. Thus some numbers<br />
have more than one representation.<br />
• The signed digit number system has redundancies. (01) =<br />
(19) = 1,(03) = (17) = 3,···<br />
• A high redundancy is too costly, thus we reduce it by restricting<br />
the digit set to x i ∈ {a,a−1,··· ,1,0,1,··· ,a<br />
with ⌈ r−1<br />
2 ⌉ ≤ a ≤ r −1.<br />
• For r = 10 we get the range 5 ≤ a ≤ 9 for a, selecting<br />
a = 5. For n = 2 we can represent 1 by (01), (19) not<br />
being valid anymore since 9 is illegal, however 5 still has<br />
two representations.<br />
Marcel Jacomet 4 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Note that the carry bits have been shifted to the left to<br />
simplify the execution of step two in the algorithm. The first<br />
carry bit c i and the first intermediate bit u i are calculated as<br />
follows: c i = 1 as 5+5 ≥ 5 and thus u i = 5+5−10·1 = 0. This<br />
calculation can be done for all positions simultaneously. The<br />
second step is to calculate the sum bits s i . Here again, all sum<br />
bits can be calculated simultaneously. Thus not that the carry<br />
bits do not propagate.<br />
Adding with SD numbers<br />
step 1 Compute an interim⎧sum u i and a carry c i : u i = x i +<br />
⎨ 1 if(x i +y i ) ≥ a<br />
y i −rc i where c i = 1 if(x i +y i ) ≤ a<br />
⎩<br />
0 otherwise<br />
step 2 Calculate the final sum s i = u i +c i−1<br />
• Let’s do the addition of the two decimal numbers. No<br />
carry propagation is needed!<br />
2 5 5 5 5<br />
+ 2 3 4 5 5<br />
0 1 1 1 1 c i<br />
4 2 1 0 0 u i<br />
5 1 0 1 0 s i<br />
Marcel Jacomet 5 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
We want to guarantee that no carry will be generated. Thus<br />
the variable a has to be defined differently. We need |s i | ≤ a,<br />
which means |s i | = |u i +c i | ≤ a. Since |c i | =≤ 1 we have to<br />
satisfy |u i | ≤ a−1 for all input values x i and y i . Thus for the<br />
largest input values we have u i = 2a−r ≤ a−1, resulting in first<br />
limit a ≤ r−1 which is always the case. However if x i +y i = a,<br />
the largest sum where c i is still 1, we get u i = a − r < 0.<br />
Substituting it into |u i | ≤ a−1 leads to r−a ≤ a−1 and finally<br />
to the second limit r+1<br />
2<br />
≤ a.<br />
Preventing carry riples in SD numbers<br />
• Let’s keep the definition r = 10 and a = 5 and calculate<br />
the sum of the two SD numbers 1244 and 1314<br />
1 2 4 4<br />
+ 1 3 1 4<br />
0 1 1 1 c i<br />
2 5 5 2 u i<br />
3 4 6 2 s i<br />
• As digit 6 does not exist, a carry would occur.<br />
• It can be shown that s i has to be restricted by a new a:<br />
⌈ r+1<br />
2 ⌉ ≤ a ≤ r −1<br />
• Redefining a = 6 we again do an addition of two decimal<br />
numbers 1354 and 1314. No carry propagation is needed!<br />
Marcel Jacomet 6 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
1 3 5 4<br />
+ 1 3 1 4<br />
0 1 1 1 c i<br />
2 4 4 2 u i<br />
3 3 5 2 s i<br />
Marcel Jacomet 7 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
2.2 Binary Signed Digit Numbers<br />
Binary SD numbers<br />
• Digit set for binary SD numbers (r = 2, thus a = 1) is:<br />
{1,0,1}<br />
step 1 Compute an interim⎧sum u i and a carry c i : u i = x i +<br />
⎨ 1 if(x i +y i ) ≥ 1<br />
y i −2c i where c i = 1 if(x i +y i ) ≤ 1<br />
⎩<br />
0 otherwise<br />
step 2 Calculate the final sum s i = u i +c i−1<br />
1 1 1 1 1<br />
+ 0 0 0 0 1<br />
1 1 1 1 1 c i<br />
1 1 1 1 0 u i<br />
1 0 0 0 0 0 s i<br />
Marcel Jacomet 8 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Carry riples in binary SD numbers<br />
• Note that the condition ⌈ r+1<br />
2<br />
⌉ ≤ a ≤ r − 1 cannot be<br />
satisfied, thus no guarantee for preventing carries can be<br />
given.<br />
• Let’s do the addition of the two binary SD numbers. Non<br />
existing digit 2 occures if we would ignore the carries:<br />
0 1 1 1<br />
+ 1 0 0 1<br />
1 1 1 1 c i<br />
1 1 1 0 u i<br />
1 2 2 2 0 s i<br />
Marcel Jacomet 9 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Preventing carry riples in binary SD numbers<br />
• Aproblematicsituationoccurswhenx i y i = 01andx i−1 y i−1<br />
equals either 11 or 01.<br />
1 1<br />
+ 0 0<br />
1 1 c i<br />
1 1 u i<br />
1 2 1 s i<br />
• We can avoid setting u i = 1 in these cases by the new<br />
combination of c i = 0 and u i = 1:<br />
1 1<br />
+ 0 0<br />
0 1 c i<br />
1 1 u i<br />
0 0 1 s i<br />
Marcel Jacomet 10 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Rules to prevent carry riples in binary SD numbers<br />
• We used the following rules for adding binary SD numbers:<br />
x i y i 00 01 01 11 11 11<br />
c i 0 1 1 1 1 0<br />
u i 0 1 1 0 0 0<br />
• Evolving the previously described idea, we get the modified<br />
rules table for adding binary SD numbers:<br />
x i y i 00 01 01 01 01 11 11 11<br />
x i−1 y i−1 - neither at least neither at least - - -<br />
is 1 one is 1 is 1 one is 1<br />
c i 0 1 0 1 0 1 1 0<br />
u i 0 1 1 1 1 0 0 0<br />
Marcel Jacomet 11 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Examples for adding binary SD numbers<br />
• Let’s again do the addition of the two binary numbers.<br />
Now no carry propagation is obtained!<br />
0 1 1 1<br />
+ 1 0 0 1<br />
0 0 0 1 c i<br />
1 1 1 0 u i<br />
0 1 1 0 0 s i<br />
• The above binary numbers represent the decimal numbers<br />
−3 10 and 7 10 , resulting in 4 10 after the addition.<br />
Marcel Jacomet 12 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
2.3 Canonic Signed Digit Numbers<br />
Canonical recoding of binary SD numbers<br />
• Binary SD numbers are particularly useful in fast multiplication<br />
and division algorithms.<br />
• Nonzerodigitscorrespondtoanactiveoperation: add/subtract<br />
• Zero digits correspond to shift-only operations.<br />
• To reduce power consumption, nonzero digits should be<br />
reduced as much as possible.<br />
• The number 7 10 has the following representations:<br />
8 4 2 1<br />
0 1 1 1<br />
1 1 1 1<br />
1 0 1 1<br />
1 0 0 1<br />
1 1 1 1 1<br />
.<br />
• Out of these variants, 1001 is the minimal representation.<br />
Marcel Jacomet 13 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Converting SD in CSD numbers<br />
• TheSDnumberwiththemostoptimalnumberofnon-zero<br />
digits is called canonic signed digit number (CSD).<br />
• ThefollowingalgorithmcanbeusedtoproduceCSDcode:<br />
Starting with the LSB, substitute all 1 sequences equal or<br />
larger than two, with 10···01 (similar for 1 sequences).<br />
• Replace a 11 sequence by 01 and replace a 11 sequence by<br />
01, respectively.<br />
Ex1: Convert the SD number 01111 2 (15 10 ) in a CSD number:<br />
10001<br />
Ex2: Convert the SD number 011011 2 (27 10 ) in a CSD number:<br />
011011 2 = 11101 SD = 100101 CSD<br />
Ex3: ConverttheSDnumber011100 2 inaCSDnumber: 010100 CSD<br />
Marcel Jacomet 14 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
2.4 Encoding and Converting Signed Digit Numbers<br />
If we use 2 bits for encoding the SD digits {1,0,1} then can<br />
select between 4! = 24 differnt encodings.<br />
Encoding SD numbers<br />
• 4! = 24 different encodings are possible for SD numbers<br />
{1,0,1}, 2 of them have practical relevance:<br />
Encoding 1 Encoding 2<br />
x x h x l x h x l<br />
0 0 0 0 0<br />
1 0 1 0 1<br />
1 1 0 1 1<br />
• Encoding 1 satisfies x = x l −x h , Encoding 2 can be viewed<br />
as 2’s complement.<br />
Marcel Jacomet 15 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Converting from SD numbers (1)<br />
• Using ”Encoding 1” the conversion from SD to 2’s complement<br />
can be done by using the equation x = x l − x h<br />
bit for bit.<br />
1 0 1 0 1 1 x<br />
1 0 0 0 1 0 x<br />
Ex3: 25 l<br />
10<br />
- 0 1 0 1 1 0 0 1 1 x h<br />
0 1 1 0 0 1 2’s compl<br />
Marcel Jacomet 16 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Converting from SD numbers (2)<br />
• Even simpler conversion to 2’s complement is the following<br />
algorithm:<br />
– Start at the LSB and move to MSB position.<br />
– Replace any 1 by 1 and succeeding 0’s be 1’s<br />
– Forward the negative sign until a1consumes the negative<br />
sign and is replaced by a 0.<br />
– If a second 1 is reached by the negative sign then it<br />
is replaced by a 0 and the negative sign continues.<br />
– If a 1 is not reached, then the MSB will be set to 1.<br />
Ex4: 9 10 0 1 1 0 0 1 c i (neg)<br />
1 1 1 0 1 1 SD<br />
0 0 1 0 0 1 z i (2’s compl.)<br />
Marcel Jacomet 17 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
3 Fast Addition<br />
3.1 Overview of Adder Architectures<br />
The most often used arithmetic operation is the addition. It<br />
is used either by its own or as part of more complex aritmetic<br />
operations like multiplication, division, hyperbolic trigonometry<br />
operations and so on. Thus a really fast add operation is<br />
essential in hardware algorithms. The most straightforward implementation<br />
is based on the cascoding of full adder blocks wich<br />
has the disadvantage to be very slow for large operands due to<br />
its carry rippling through the result. Numerous adder architectures<br />
have been found to cope with this limitation.<br />
Revisiting Adder Architectures<br />
• Ripple carry adders have the ripple delay problem.<br />
• Carry-look-ahead adder<br />
• Carry-skip adder<br />
• Hierachical adders, Wallace tree adder<br />
• Carry-select adder<br />
• Carry-save adder<br />
• Pipelining of arithmetic operations<br />
Marcel Jacomet 18 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
3.2 Signed Digit Adder<br />
SD Adder<br />
• Develop a cascadable hardware block for adding SD numbers<br />
using the ”Encoding 2” representation.<br />
x hl<br />
i yi<br />
hl<br />
<br />
<br />
x hl<br />
2 yhl 2<br />
<br />
<br />
x hl<br />
1 yhl 1<br />
<br />
<br />
x hl<br />
0 yhl 0<br />
<br />
0 0<br />
+<br />
· · ·<br />
+<br />
+<br />
+<br />
s h i<br />
s l i<br />
s h 2 s l 2<br />
s h 1 s l 1<br />
s h 0 s l 0<br />
• Now no carry chain exists anymore! Thus we have a very<br />
fast adder, independent of its bit length.<br />
Marcel Jacomet 19 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4 Multiplication and Division<br />
4.1 Sequential <strong>Algorithms</strong> for Multiplication,<br />
Division and others<br />
The current trend moving from signal processing implemented<br />
in software technologies towards hardware intensive signal processing<br />
has uncovered a relative lack of understanding hardware<br />
signal processing architectures. Many hardware efficient algorithms<br />
exist, but these are generally not well known due to the<br />
dominance of software systems over the past quarter century. A<br />
selected set of elementary algorithms are presented in the subsequent<br />
sections.<br />
Sequential <strong>Algorithms</strong> for Multiplication, Division and<br />
others<br />
• Binary sequential multiplication is done similar as in decimal<br />
numbers.<br />
• Binary sequential division, dito.<br />
• SQRT (square root), GCD (greatest common divisor) and<br />
other operations have simple sequential implementation<br />
architectures.<br />
• Sequential algorithms are time consuming but profit from<br />
low gate count.<br />
Marcel Jacomet 20 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Marcel Jacomet 21 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4.2 Sequential Algorithm for Multiplication<br />
Sequential Multiplication Algorithm<br />
• Multiplication Z = X · Y, with X = x n−1 x n−2···x 1 x 0<br />
and Y = y n−1 y n−2···y 1 y 0 , where x n−1 and y n−1 are the<br />
sign bits.<br />
• For both operands being positive (i.e. x n−1 = y n−1 = 0)<br />
we obtain: ⎛ ⎞<br />
n−2<br />
∑<br />
Z = ⎝ x j 2 j ⎠·Y<br />
j=0<br />
• Maximal number of bits needed for Z is 1 + 2(n − 1) =<br />
2n−1.<br />
• With P (j) being the partal product, we can find the iteration:<br />
P (j+1) = ( P (j) +x j ·Y ) · 2 −1 , with P (0) = 0 for<br />
j = 0,1,···n−2<br />
• Doing some calculations ...<br />
P (n−1) = (P (n−2) +x n−2 ·Y)·2 −1<br />
Marcel Jacomet 22 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Sequential Multiplication Algorithm<br />
• The previous algorithm is working for positive inputs as<br />
well as for positive multiplier X and negative multiplicant<br />
Y. Z = X ·Y, with X = 3 10 and Y = −6 10<br />
Y 1 0 1 0 −6 10<br />
X × 0 0 1 1 3 10<br />
P (0) = 0 0 0 0 0<br />
x 0 = 1 ⇒ add Y + 1 0 1 0<br />
1 0 1 0<br />
shift right 1 1 0 1 0<br />
x 1 = 1 ⇒ add Y + 1 0 1 0<br />
1 0 1 1 1 0<br />
shift right 1 0 1 1 1 0<br />
x 2 = 0 ⇒ shift only 1 1 0 1 1 1 0 −18 10<br />
Marcel Jacomet 23 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Architecture of Sequential Multiplication Example<br />
• Compact hardware design using one n bit register, one<br />
2n−1 bit register and one n−1 bit adder.<br />
Reg<br />
s Y<br />
❄<br />
✛ ❥+<br />
ShiftReg ✻<br />
s P<br />
✻<br />
X<br />
✲<br />
Marcel Jacomet 24 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Sequential Multiplication Algorithm for negative Operands<br />
• Multiplication Z = X ·Y,<br />
• For both operands being negative (i.e. x n−1 = y n−1 = 1),<br />
the previous algorithm does not work anymore.<br />
• The 2’s complement can be written as follows:<br />
X = −x n−1 ·2 n−1 + ˜X<br />
n−2<br />
∑<br />
where ˜X = x j ·2 j<br />
j=0<br />
• If we would have ignored X being negative, we would have<br />
received:<br />
˜Z = ˜X·Y = (X +x n−1 ·2 n−1 )·Y = X ·Y +Y ·x n−1·2 n−1<br />
• Actually the term X ·Y is needed for the result, thus we<br />
get:<br />
X ·Y = ˜Z −Y ·x n−1 ·2 n−1<br />
• This means we must subtract Y from Z in case x n−1 = 1.<br />
Marcel Jacomet 25 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Multiplication with Two Negative Operands<br />
• For two negative operands the previously developed algorithm<br />
adaption is needed.<br />
Ex 6: Z = X ·Y, with X = −3 10 and Y = −6 10<br />
Y 1 0 1 0 −6 10<br />
X × 1 1 0 1 −3 10<br />
x 0 = 1 ⇒ add Y + 1 0 1 0<br />
shift right 1 1 0 1 0<br />
x 1 = 0 ⇒ shift only 1 1 1 0 1 0<br />
x 2 = 1 ⇒ add Y + 1 0 1 0<br />
1 0 0 0 1 0<br />
shift right 1 1 0 0 0 1 0<br />
x 3 = 1 ⇒ correct + 0 1 1 0<br />
? 0 0 1 0 0 1 0 18 10<br />
Marcel Jacomet 26 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4.3 Sequential Algorithm for Division<br />
Sequential Restoring Division Algorithm<br />
• Division Y = X/D is the most complex of the four basic<br />
arithmetic operations.<br />
• Given dividend X, divisor D, quotient Q and reminder R<br />
we get:<br />
X = Q·D +R with R < D<br />
• We assume that Y,X,D and Q are fractions, thus X < D,<br />
and D ≠ 0.<br />
• Assuming positive operands, we have fractional variables<br />
like:<br />
Q = 0.q 1···q m where m = n−1<br />
• We perform the division as a sequence of subtractions and<br />
shifts.<br />
r i = 2r i−1 −q i ·D with r 0 = X<br />
• Iftheremainder r i is larger than thedivisor D thenq i = 1,<br />
else q i = 0.<br />
Marcel Jacomet 27 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Sequential Restoring Division<br />
• The previous algorithm is working for positive inputs, for<br />
fractions as well as for integers as can be shown.<br />
Q = X/D, with X = 5/8 10 and D = 3/4 10 = 0.11 2<br />
r 0 = X 0 .1 0 1 0 0 0 2r 0 ≥ D<br />
2r 0 0 1 .0 1 0 0 0 set q 1 = 1<br />
add −D + 1 1 .0 1 0<br />
r 1 = 2r 0 −D 0 0 .1 0 0 0 0 2r 1 ≥ D<br />
2r 1 0 1 .0 0 0 0 set q 2 = 1<br />
add −D + 1 1 .0 1 0<br />
r 2 = 2r 1 −D 0 0 .0 1 0 0 2r 2 < D<br />
2r 2 0 0 .1 0 0 set q 3 = 0<br />
r 3 = 2r 2 0 0 .1 0 0 2r 3 ≥ D<br />
2r 3 0 1 .0 0 set q 4 = 1<br />
• Result is Q = 0.1101, continuing the calculation for higher<br />
precision we would have received Q = 0.11010<br />
Marcel Jacomet 28 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Robertson Diagramm for Restoring Division<br />
• With r i−1 < D, the quotient bit q i should be selected such<br />
that r i < D and r i ≥ 0, resulting in the condition<br />
r i = 2·r i−1 −D<br />
r i<br />
✻<br />
D<br />
<br />
q i = 0 q i = 1<br />
✲<br />
D 2D 2r i−1<br />
• Intherestoringdivisionalgorithm,theoldremainder2r i−1<br />
isrestoredifthesubtraction2r i−1 −D wasnegative,shifted<br />
and after a subsequent subtraction we obtain 4r i−1 −D.<br />
Marcel Jacomet 29 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Sequential Non-Restoring Division Algorithm<br />
• The corrections at negative reminders are postponed to<br />
later steps, thus obtaining 2r i−1 −D < 0.<br />
• Shifting it and adding D we get a new reminder 2(2r i−1 −<br />
D)+D = 4r i−1 −D.<br />
• These steps are repeated as long as the reminder remains<br />
negative.<br />
• The quotient bit is determined as follows:<br />
{ 1 if 2ri−1 ≥ 0<br />
q i =<br />
1 if 2r i−1 < 0<br />
• With the above decision, the reminder is computed with<br />
r i = 2r i−1 −q i ·D<br />
• The correction of a ”wrong” selection of the quotient (reminder<br />
gets negative) is done by using 1 instead of 0.<br />
Consider q i q i−1 = 10 (too large) would be corrected by<br />
q i q i−1 = 11.<br />
• If the final reminder and divident have opposite signs, a<br />
final correction is needed Q cor = Q−2 −(n−1)<br />
Marcel Jacomet 30 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Sequential Non-Restoring Division<br />
Q = X/D, with X = −3/8 10 and D = 3/4 10 = 0.11 2<br />
r 0 = X 1 .1 0 1 2r 0 < 0<br />
2r 0 1 1 .0 1 0 set q 1 = 1<br />
add D + 0 0 .1 1 0<br />
r 1 0 0 .0 0 0 2r 1 ≥ 0<br />
2r 1 0 0 .0 0 0 set q 2 = 1<br />
add −D + 1 1 .0 1 0<br />
r 2 1 1 .0 1 0 2r 2 < 0<br />
2r 2 1 0 .1 0 0 set q 3 = 1<br />
add D 0 0 .1 1 0<br />
r 3 1 1 .0 1 0 2r 3 < 0<br />
2r 3 1 .0 1 0 0 correct or set q 4 = 1<br />
• Result before correction is Q = 0.1111 = 0.1001, after the<br />
correction we get Q = 0.1000<br />
Marcel Jacomet 31 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Robertson Diagramm for Non-Restoring Division<br />
• The non-restoring division can be represented graphically.<br />
• Horizontal lines represent ±D<br />
• Diagonal lines represent multiplication by 2<br />
• Vertical lines represent function selection (1vs.1)<br />
• Example: X = 0.5 10 ,D = 0.75 10 , thus Q = 111 2 = 0.101 2<br />
• Q = 0.111<br />
r i<br />
✻D = 0.110<br />
q<br />
i = 1 q i = 1<br />
r0.100 0 = X,r 2 0 = 0.100 2<br />
2r 2 < 0,q 3 = 1 ❍❍❍❍❍❍❥<br />
2r<br />
❍❍❍❍❍<br />
0 = 01.00 2<br />
r<br />
✛ 1 = 0.010 2r 0 − D<br />
0.010 ✲ 2,2r 1<br />
2<br />
❍ <br />
11.100<br />
✻<br />
❍❍❥<br />
✻<br />
✲<br />
❍❨ 2<br />
10.10 11.01 0.110 01.00 2<br />
<br />
❍<br />
0.100 2<br />
r 2 = 11.110 ✛ ❄<br />
11.110 2,2r ❍2<br />
<br />
2<br />
2r 11 −≥ D0,q 2 = 1<br />
<br />
<br />
−D = 11.01<br />
2r0 ≥ 0,q1 = 1<br />
2r i−1<br />
2D = 01.10<br />
Marcel Jacomet 32 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Converting Sd Quotient to Two’s Complement<br />
• The non-restoring division generates representations incompatible<br />
to 2’s complement, thus a conversion for further<br />
processing is needed.<br />
• SD conversion algorithms start with the Lsb, thus needing<br />
the complete quotient before starting.<br />
• An new on-the-fly algorithm is needed to reduce total execution<br />
time.<br />
• It can be shown that the following equations and results<br />
are valid:<br />
p i = 1 2 (q i +1)<br />
(1−p 1 ).p 2 p 3···p m 1<br />
Q = X/D, with X = −3/8 10 and D = 3/4 10 resulting<br />
in the non-corrected value Q = 0.1111Q = 0.0100Q =<br />
0.100Q = 1.100Q = 1.1001Q = 1.1000 steps:step 1: replace<br />
all 1 by 0step 2: shift the given number left by 1 bit<br />
positionstep 3: complement the most significant bitstep 4:<br />
shift a 1 in the Lsb bit positioncorrect: subtract 0.0001<br />
from the non-corrected result Q = −0.5 10<br />
Marcel Jacomet 33 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4.4 Square Root Extraction<br />
Simple SQRT Algorithm<br />
• square root extraction is similar to restoring division<br />
• assume Q = (0.q 1 q 2···q m ) being a fraction, it denotes the<br />
square root of X<br />
• Q i is the partially developed root at step i with the remainder<br />
r i :<br />
i∑<br />
Q i = q k 2 −k<br />
k=1<br />
• leading to<br />
r i 2 −i = (X −Q 2 i)<br />
• we find with some calculations<br />
r i = 2r i−1 −q i ·(2Q i−1 +q i 2 −i )<br />
• this can be viewed as a division with changing divisor D =<br />
(2Q i−1 +q i 2 −i ), see sequential restoring div: 26<br />
• for m → ∞ we get<br />
X −Q 2 = lim m→∞ 2 −m r m = 0<br />
Marcel Jacomet 34 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Simple SQRT Algorithm<br />
Q 2 = X, with X = 10/16 10<br />
r 0 = X 0 .1 0 1 0<br />
2r 0 0 1 .0 1 0 0<br />
−(0+2 −1 ) - 0 0 .1 0 0 0<br />
r 1 0 0 .1 1 0 0 set q 1 = 1,Q 1 = 0.1<br />
2r 1 0 1 .1 0 0 0<br />
−(2Q 1 +2 −2 ) - 0 1 .0 1 0 0<br />
r 2 0 0 .0 1 0 0 set q 2 = 1,Q 1 = 0.11<br />
2r 2 0 0 .1 0 0 0<br />
−(2Q 2 +2 −3 ) - 0 1 .1 0 1 0<br />
r 3 1 0 .1 1 1 0 set q 3 = 0,Q 3 = 0.110<br />
r 3 = 2r 2 0 0 .1 0 0 0<br />
2r 3 0 1 .0 0 0 0<br />
−(2Q 3 +2 −4 ) - 0 1 .1 0 0 1<br />
r 4 1 1 .0 1 1 1 set q 4 = 0,Q 4 = 0.1100<br />
r 4 = 2r 3 0 1 .0 0 0 0<br />
• Result is Q = 0.1100 and the final reminder is 2 −4 r 4 =<br />
16/256 = X −Q 2 = (160−144)/256<br />
Marcel Jacomet 35 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
4.5 Division Through Multiplication<br />
In the previously described division algorithms the number of<br />
repetitive steps are linear to the received number of bit precisions<br />
n in the result. In the next described algorithms the number<br />
of steps is proportinal to log 2 n. However, the algorithms<br />
are based on fast parallel multipliers.<br />
Division by Convergence<br />
• numerator and denumerator are both multiplied by the<br />
same factor<br />
Q = N D = N ·R 0 ·R 1···R −1<br />
D ·R 0 ·R 1···R −1<br />
→ Q 1<br />
• only the quotient needs to be calculated; the algorithm is<br />
suitable for floating-point computations<br />
• the essential step is to select the factors R i correctly<br />
prep: let D be a normalized binary fraction D = 0.1xxxx<br />
• therefore 1/2 ≤ D < 1 and D = 1−y where y ≤ 1/2<br />
iter 1: we select R 0 = 1+y, then<br />
D 1 = D ·R 0 = (1−y)·(1+y) = 1−y 2<br />
• since y 2 ≤ 1/4, D 1 satisfies D 1 ≥ 3/4 and is therefore<br />
closer to 1 than D, in binary D 1 = 0.11xxxx<br />
Marcel Jacomet 36 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
iter 2: we select R 1 = 1+y 2 and obtain<br />
D 2 = D 1 ·R 1 = (1−y 2 )·(1+y 2 ) = 1−y 4<br />
• now y 4 ≤ 1/16, D 2 satisfies D 2 ≥ 15/16, and is therefore<br />
closer to 1 than D 1 , in binary D 2 = 0.1111xxxx<br />
It can easily be shown that the demoninator converges to 1.<br />
Marcel Jacomet 37 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Division by Convergence: Proof<br />
• It can easily be shown that the denominator converges to<br />
1.<br />
D ·R 0 ·R 1···R m−1 = (1−y)[(1+y)(1+y 2 )(1+y 4 )···]<br />
= (1+y)[(1−y)(1+y 2 )(1+y 4 )···]<br />
• The equation in brackets is the series expansion of<br />
1<br />
(1+y),for 0 ≤ y ≤ 1/2<br />
1<br />
lim i→∞ D i = (1+y)·<br />
(1+y) = 1<br />
step 1: D i+1 = D i ·R i and N i+1 = N i ·R i<br />
step 2: Two’s complement operation R i+1 = 2−D i+1<br />
D i = 1−y i2<br />
Marcel Jacomet 38 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example of Division by Convergence<br />
Ex11: Q = N/D, with N = 0.101101 = 0.703125 10 and D =<br />
0.1101 = 0.8125 10 D 0 = 0. 1101<br />
R 0 = 2−D 0 1 .0 0 1 1<br />
N 1 = N ·R 0 0 .1 1 0 1 0 1 0 1 1 1<br />
D 1 = D 0 ·R 0 0 .1 1 1 1 0 1 1 1<br />
R 1 = 2−D 1 1 .0 0 0 0 1 0 0 1<br />
N 2 = N 1 ·R 1 0 .1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 ...<br />
D 2 = D 1 ·R 1 0 .1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 ...<br />
R 2 = 2−D 2 1 .0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 ...<br />
N 3 = N 2 ·R 2 0 .1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 ...<br />
D 3 = D 2 ·R 2 0 .1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...<br />
• To speed up the algorithm further on, look-up tables or<br />
ROMs for initial value of R 0 are used.<br />
4.6 B<br />
ased on the Newton-Raphson Algorithm a division by reciprocation<br />
can be performed. In this case first the reciprocal is<br />
calculated using the Newton-Raphson iteration algorithm, succeeded<br />
by a multiplication. The Netwon Raphson algorithms is<br />
a method of finding the zero of a given function f(x), where the<br />
zero ist the solution of f(x) = 0<br />
Marcel Jacomet 39 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Division by Reciprocation using Newton-Raphson Algorithm<br />
• FirstthereciprocalD iscalculatedusingNewton-Raphson<br />
iteration algorithm.<br />
• In a second step a multiplication is performed to calculate<br />
Q = N/D<br />
• Newton-Raphsoningeneralisusedtocalculatethezerosof<br />
the function f(x), where f(x) is the solution for f(x) = 0.<br />
f(x)<br />
✻<br />
❇ ❇<br />
▲ f(x)<br />
❆<br />
❚<br />
❙<br />
❅<br />
❧◗❍❛❳ ❜ ✠<br />
<br />
✲<br />
x<br />
x<br />
Marcel Jacomet 40 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Newton-Raphson Algorithm for Reciprocation<br />
• Let x 0 be first approximation and x i be the estimate at<br />
ith step, then x i+1 is:<br />
x i+1 = x i + f(x)<br />
−f ′ (x)<br />
• Substitutingf(x)byourreciprocalfindingfunctionf(x) =<br />
1/x−D we find:<br />
x i+1 = x i (2−Dx i )<br />
f(x)<br />
✻<br />
f(x) = 1 x − D<br />
❇ ❇ ❜<br />
▲<br />
▲❆<br />
▲<br />
dy ❚<br />
▲<br />
▲<br />
❙<br />
▲ ❅<br />
▲ ❧◗❍❛❳ ✲<br />
x i x<br />
dxx i+1<br />
Marcel Jacomet 41 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Iterations: Newton-Raphson Algorithm for Reciprocation<br />
• Example D = 0.13, choose initial value: x 0 = 4.0<br />
• Calculate iteration step 3: x 3 = 7.67<br />
• The convergence is quadratic.<br />
• Again, initial values can be generated by ROMs or other<br />
methods for reduced numbers of iterations.<br />
f(x) f(x) = 1 x − D ❜ x = 7.69<br />
0.1<br />
✻<br />
❇ ❇ ❜<br />
▲<br />
▲ x<br />
❆ i+1 = x i · (2 − D · x i)<br />
▲<br />
❚<br />
▲ ▲<br />
❙❜<br />
▲ ❅ ❡<br />
▲<br />
❡ ❧◗❍❛❳ ✲<br />
▲<br />
x 2 = 5.92 · (2 − 0.13 · 5.92) = 7.28<br />
x<br />
x<br />
0 x= 0 4x 1 = x 1 5.92x 2<br />
x 2 = 7.28<br />
Marcel Jacomet 42 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
5 Elementary Functions<br />
5.1 Basics<br />
In a seminar at MicroLab a new architecture has been found<br />
for reciprocal calulation based on the Newton-Raphson iteraton<br />
and a second order polynomial for initial value calculation. The<br />
architectuer is efficient as both steps, initial value calculation<br />
and Newton-Raphson iteration can be executed on the identical<br />
hardware unit [HSGJ10].<br />
Elementary Functions<br />
• Elementary functions can be realized in hardware by different<br />
methods. Some elementary functions:<br />
e x , ln(x), sin(x), cos(x), tanh(x), arcsin(x), etc<br />
• Rom look-up tables might be used, but are large: for n ≥<br />
20 the memory size is ≥ 2.6MB.<br />
• Cordic (cordination rotation digital computer) algorithm<br />
can be used.<br />
• Taylor series expansion can be used, but shows sometimes<br />
bad convergence; e.g.:<br />
e x =<br />
∞∑<br />
i=0<br />
x i<br />
i!<br />
Marcel Jacomet 43 2008
• Polynomial approximations can be used, etc.<br />
<strong>Hardware</strong> <strong>Algorithms</strong><br />
For different elementary functions there exist pre-calculated<br />
constants for the polynomial approximation. Tables with such<br />
constants can be found in [?].<br />
Marcel Jacomet 44 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Polynomial Approximation<br />
• Approximation based on two degree-5 polynomials. Example:<br />
e x<br />
• We can express the elementary function as:<br />
e x = 2 xlog 2 e<br />
• andpartitioningtheexponentxlog 2 e = I+f initsinteger<br />
and fraction parts, such that<br />
e x = 2 I ·2 f<br />
• Implementing 2 I is straigtforward, evaluating 2 f can be<br />
done by a rational approximation, such as a two degree-5<br />
polynoms:<br />
2 f = ((((a 5f +a 4 )f +a 3 )f +a 2 )f +a 1 )f +a 0<br />
((((b 5 f +b 4 )f +b 3 )f +b 2 )f +b 1 )f +1<br />
• a i and b i are known constants, dependent of the target<br />
elementary function.<br />
5.2 Additive Normalization<br />
Therearealternativemethodstocalculatesuchelementaryfunctions<br />
which are more adapted to be implemented in hardware.<br />
Many of these known algorithms for evaluating elementary functions<br />
are based on the division by convergence algorithm discussed<br />
earlier.<br />
Marcel Jacomet 45 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Additive Normalization: Exponential Algorithm<br />
• Algorithm is based on the ”division-by-convergence idea”:<br />
when one formulae is forced to a constant, the other yields<br />
the result (35).<br />
• To evaluate y = e x0 for a fractional argument x 0 , we use.<br />
x i+1 = x i −ln(b i )<br />
y i+1 = y i ·b i<br />
• The b i ’s are selected in such a way that the sequence of x i<br />
approach 0, i.e., x m = 0.<br />
• Note that for for m → ∞ we have x m = 0:<br />
y i+1 ·e xi+1 = y i ·b i ·e xi−ln(bi) = y i ·e xi<br />
y m ·e xm = y 0 ·e x0<br />
= y 0 ·e x0<br />
y m<br />
• The similarity to the ”division-by-convergence” is now apparent,<br />
instead of keeping N i /D i constant, we now keep<br />
y i ·e xi constant.<br />
Marcel Jacomet 46 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Iteration in Exponential Algorithm<br />
• To simplify the multiplication, the b i ’s are given the form:<br />
b i<br />
s i<br />
= (1+s i ·2 −i ) where<br />
∈ {−1,0,1}<br />
• The terms ln(1±2 −i ) have to be pre-calculated and stored<br />
in a look-up table.<br />
• Substituting the equations from the previous slide we get:<br />
x i+1 = x i −ln(1+s i ·2 −i )<br />
y i+1 = y i ·(1+s i ·2 −i )<br />
• To calculate the exponential function e x0 we have to find<br />
the vector s = {s 0 ,s 1 ,···s m−1 }.<br />
• Restrictingx 0 topositivefractionswegetasimplerschema<br />
for selecting s i ∈ {0,1}<br />
• In step (i+1) we set<br />
}<br />
if D =x i −ln(1+2 −i then si+1 = 1, x<br />
) ≥ 0<br />
i+1 = D<br />
else s i+1 = 0, x i+1 = x i<br />
• The convergence is linear: with n steps we get n bits.<br />
Marcel Jacomet 47 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
Example: Exponential Algorithm<br />
i (1+2 −i ) ln(1+2 −i ) (1−2 −i ) ln(1−2 −i )<br />
0 10.0000000000 2 0.693147 0 -<br />
1 1.1000000000 2 0.405465 0.1000000000 2 -0.693147<br />
2 1.0100000000 2 0.223144 0.1100000000 2 -0.287682<br />
3 1.0010000000 2 0.117783 0.1110000000 2 -0.133531<br />
4 1.0001000000 2 0.060625 0.1111000000 2 -0.064539<br />
5 1.0000100000 2 0.030772 0.1111100000 2 -0.031749<br />
Calculate e 0.375 in a 6 bit precision:<br />
• Initial values: i = 0, x 0 = 0.375, y 0 = 1 , y 6 = 1.450<br />
iter 1:<br />
iter 2:<br />
iter 3:<br />
iter 4:<br />
iter 5:<br />
D = x 0 − ln(1+2 −0 ) = -0.318 s 0 = 0<br />
b 0 = (1+s 0 ·2 −0 ) = 1.000<br />
x 1 = x 0 = 0.375 y 1 = y 0 ·b 0 = 1.000<br />
D = x 1 − ln(1+2 −1 ) = -0.030 s 1 = 0<br />
b 1 = (1+s 1 ·2 −1 ) = 1.000<br />
x 2 = x 1 = 0.375 y 2 = y 1 ·b 1 = 1.000<br />
D = x 2 − ln(1+2 −2 ) = +0.152 s 2 = 1<br />
b 2 = (1+s 2 ·2 −2 ) = 1.250<br />
x 3 = D = 0.152 y 3 = y 2 ·b 2 = 1.250<br />
D = x 3 − ln(1+2 −3 ) = +0.034 s 3 = 1<br />
b 3 = (1+s 3 ·2 −3 ) = 1.125<br />
x 4 = D = 0.034 y 4 = y 3 ·b 3 = 1.406<br />
D = x 4 − ln(1+2 −4 ) = -0.027 s 4 = 0<br />
b 4 = (1+s 4 ·2 −4 ) = 1.000<br />
x 5 = x 4 = 0.034 y 5 = y 4 ·b 4 = 1.406<br />
Marcel Jacomet 48 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
iter 6:<br />
D = x 5 − ln(1+2 −5 ) = +0.003 s 5 = 1<br />
b 5 = (1+s 5 ·2 −5 ) = 1.031<br />
x 6 = D = 0.003 y 6 = y 6 ·b 5 = 1.450<br />
Marcel Jacomet 49 2008
<strong>Hardware</strong> <strong>Algorithms</strong><br />
5.3 Multiplicative Normalization<br />
Multiplicative Normalization: Exponential Algorithm<br />
• To calculate e x we did continued summation of the terms<br />
ln(x+s i ·2 −i ).<br />
• This procedure is called additive normalization.<br />
• In a similar way we may define a multiplicative normalization<br />
where an x i is forced to 1 by continued multiplication<br />
with precalculated factors. We approximate now y = lnx.<br />
• Selecting b i such that x i+1 approaches 1, we thus have:<br />
x i+1 = x i ·b i<br />
y i+1 = y i −g(b i )<br />
x i+1 = x 0<br />
∏ i<br />
l=0 b l → 1<br />
• If we select g(b l ) = ln(b l ) then we finally have:<br />
y m = y 0 −ln 1 x 0<br />
= y 0 +lnx 0<br />
• Similar approaches can be done for the other elementary<br />
functions, liketrigonometric, inversetrigonometric, hyperbolic,<br />
etc.<br />
Marcel Jacomet 50 2008
References<br />
<strong>Hardware</strong> <strong>Algorithms</strong><br />
[HSGJ10] Andreas Habegger, Andreas Stahel, Josef Goette,<br />
and Marcel Jacomet. An efficient hardware implementation<br />
for a reciprocal unit. In submitted to: The<br />
5th IEEE International Symposium on Electronic Design,<br />
Test and Applications, Ho Chi Minh City, January<br />
13-15, 2010.<br />
[Kor02] Israel Koren. <strong>Computer</strong> <strong>Arithmetic</strong> <strong>Algorithms</strong>. A K<br />
Peters, Natick, Massachusetts, 2nd edition, 2002.<br />
Marcel Jacomet 51 2008