12.07.2015 Views

Synthesis of Floating-Point Addition Clusters on FPGAs Using Carry ...

Synthesis of Floating-Point Addition Clusters on FPGAs Using Carry ...

Synthesis of Floating-Point Addition Clusters on FPGAs Using Carry ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2010 Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Field Programmable Logic and Applicati<strong>on</strong>s<str<strong>on</strong>g>Synthesis</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point <str<strong>on</strong>g>Additi<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Clusters</str<strong>on</strong>g> <strong>on</strong><strong>FPGAs</strong> <strong>Using</strong> <strong>Carry</strong>-Save ArithmeticAmit Verma 1 Ajay K. Verma 2 Hadi Parandeh-Afshar 2 Philip Brisk 3 Paolo Ienne 21 Nati<strong>on</strong>al Institute <str<strong>on</strong>g>of</str<strong>on</strong>g> TechnologyRourkela, India2 Ecole Polytechnique Fédérale de LausanneSchool <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer and Communicati<strong>on</strong>SciencesLausanne, Switzerland3 University <str<strong>on</strong>g>of</str<strong>on</strong>g> California, RiversideDepartment <str<strong>on</strong>g>of</str<strong>on</strong>g> Computer Scienceand EngineeringRiverside, CA, USAAbstract—A new method to synthesize clusters <str<strong>on</strong>g>of</str<strong>on</strong>g> floating-pointadditi<strong>on</strong> operati<strong>on</strong>s <strong>on</strong> <strong>FPGAs</strong> is presented. Similar to Altera’sfloating-point data path compiler, it performs normalizati<strong>on</strong><strong>on</strong>ce, at the output <str<strong>on</strong>g>of</str<strong>on</strong>g> the cluster operati<strong>on</strong>. All significands in theclustered operati<strong>on</strong> are denormalized in parallel with respect tothe largest exp<strong>on</strong>ent: a fixed-point compressor tree then sums thealigned significands, followed by normalizati<strong>on</strong> and rounding.Compared to Altera’s floating-point datapath compiler, ourmethod reduces the critical path delay by as much as 20%, andarea by as much as 29% <strong>on</strong> Altera Stratix III <strong>FPGAs</strong>.Keywords-<str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point additi<strong>on</strong>, carry-save arithmetic, FPGAI. INTRODUCTIONResearchers from Altera recently published a floating-pointFPGA datapath compiler, which included some n<strong>on</strong>-standardtransformati<strong>on</strong>s [1, 2]. One <str<strong>on</strong>g>of</str<strong>on</strong>g> the most effective <strong>on</strong>es was t<str<strong>on</strong>g>of</str<strong>on</strong>g>use together additi<strong>on</strong> clusters <str<strong>on</strong>g>of</str<strong>on</strong>g> up to sixteen inputs. Thesignificand is extended with four overflow and three underflowbits, permitting up to sixteen inputs to be added and subtractedwithout overflow or underflow and eliminating intermediarynormalizati<strong>on</strong> steps inside the cluster; this significantly reducesarea and allows more floating-point operators to be synthesized<strong>on</strong> a fixed-size FPGA, increasing the throughput <str<strong>on</strong>g>of</str<strong>on</strong>g> the design.This paper presents a new approach to synthesize floatingpointadditi<strong>on</strong> clusters. All significands are denormalized withrespect to the largest exp<strong>on</strong>ent am<strong>on</strong>g all input operands, and acompressor tree [3] sums the denormalized significands; theresult is normalized, rounded, and packed into a floating-pointnumber al<strong>on</strong>g with the resulting exp<strong>on</strong>ent and sign bit.Compared to Altera’s compiler [1, 2] our approach improvescritical path delay up to 20% and area up to 29%.II.PRELIMINARIESA. IEEE-754 <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point StandardAn IEEE format number is represented as a triple (S, E, F),where S is a single bit representing the sign, E is the exp<strong>on</strong>ent,and F is the significand; taken together, S and F represent thesignificand in sign-magnitude format. S is a single bit; e and frespectively denote the bitwidth <str<strong>on</strong>g>of</str<strong>on</strong>g> the exp<strong>on</strong>ent andsignificand. For IEEE-754 single-precisi<strong>on</strong> format, usedthroughout this paper, e = 8 and f = 23.A normalized significand includes a hidden ‘1’ that is notexplicitly encoded, i.e., it has the form 1.F; if not, i.e., thesignificand has the form 0.F and is denormalized, as discussedbelow. The range <str<strong>on</strong>g>of</str<strong>on</strong>g> a normalized significand is− f1 ≤ 1.F ≤ 2 − 2 . (1)The exp<strong>on</strong>ent, E, has a base <str<strong>on</strong>g>of</str<strong>on</strong>g> 2 and a bias <str<strong>on</strong>g>of</str<strong>on</strong>g> B = 2 e-1 – 1;for single-precisi<strong>on</strong> format, B = +127. Thus, the valuerepresented by an IEEE-754 floating-point number (S, E, F) isSE−( )( B−11 × 1.F × 2)− . (2)The IEEE-754 format includes several “special” numbersthat are represented explicitly, including positive/negative zero,positive/negative infinity, denormalized numbers, and not-anumber(NaN), which, occurs, for example, when the squareroot <str<strong>on</strong>g>of</str<strong>on</strong>g> a negative number is taken.The number (S, 0, 0) represents zero; because <str<strong>on</strong>g>of</str<strong>on</strong>g> the signbit, both positive and negative zeroes exist. As a c<strong>on</strong>sequence,the value 1.0×2 -B cannot be represented.Any floating-point number <str<strong>on</strong>g>of</str<strong>on</strong>g> the form (S, 0, F ≠ 0) isdenormalized. Denormalized significands have the form 0.F,and represent values <str<strong>on</strong>g>of</str<strong>on</strong>g> the formS−( )( B-11 × 0.F × 2)− . (3)The maximum exp<strong>on</strong>ent value is E max = 2 e – 1 = 2B + 1.The floating-point representati<strong>on</strong> for NaN is (S, E max , F≠ 0) andfor positive/negative infinity is (S, E max , 0).The default rounding mode is round to nearest, and to evenwhen there is a tie; the user can also specify three alternativerounding modes: round toward positive or negative infinity, orround toward zero (truncati<strong>on</strong>).The IEEE-754 standard specifies six numerical operati<strong>on</strong>s:additi<strong>on</strong>, subtracti<strong>on</strong>, multiplicati<strong>on</strong>, divisi<strong>on</strong>, remainder, andsquare root. The standard also specifies rules for c<strong>on</strong>verting toand from the different floating-point formats (e.gshort/integer/l<strong>on</strong>g to/from single/double/quad-precisi<strong>on</strong>), andc<strong>on</strong>versi<strong>on</strong> between the different floating-point formats.The standard includes five excepti<strong>on</strong>s involving specialnumbers; they set a flag and permit computati<strong>on</strong> to c<strong>on</strong>tinue.The excepti<strong>on</strong>s are as follows: (1) if an operati<strong>on</strong> overflows,978-0-7695-4179-2/10 1946-1488/10 $26.00 © $26.00 2010 IEEE© 2010 IEEEDOI 10.1109/FPL.2010.1519


the result is positive or negative infinity, depending <strong>on</strong> the signbit; (2) underflow (when a rounded value is so small that itcannot be represented); (3) divisi<strong>on</strong> by zero; (4) inexact results,i.e., when the result <str<strong>on</strong>g>of</str<strong>on</strong>g> an infinite precisi<strong>on</strong> operati<strong>on</strong> cannot berepresented in the fixed-precisi<strong>on</strong> format being used; and (5)invalid (the result <str<strong>on</strong>g>of</str<strong>on</strong>g> the operati<strong>on</strong> is NaN). There are alsoadditi<strong>on</strong>al rules involving arithmetic involving positive andnegative infinities and NaN, e.g., x op NaN = NaN. Theinterested reader should c<strong>on</strong>sult the IEEE standard for details.B. <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point <str<strong>on</strong>g>Additi<strong>on</strong></str<strong>on</strong>g> and Subtracti<strong>on</strong>This secti<strong>on</strong> describes the c<strong>on</strong>cepts underlying floatingpointadditi<strong>on</strong>. The discussi<strong>on</strong> focuses <strong>on</strong> a naïve, but small,implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a floating-point adder. More compleximplementati<strong>on</strong>s, such as dual-path adders [4], are generallypoor choices <strong>on</strong> which to base fused FPGA operators, becausethey c<strong>on</strong>sume too much area.The inputs are floating-point numbers N 1 = (S 1 , E 1 , F 1 ) and N 2=(S 2 , E 2 , F 2 ), al<strong>on</strong>g with a single bit, op that indicates whetheradditi<strong>on</strong> (op = 0) or subtracti<strong>on</strong> (op = 1) is to be performed.The output is a normalized IEEE-754 floating-point number N 3= (S 3 , E 3 , F 3 ) = N 1 op N 2 . The five key steps, as follows, areillustrated in Figure 1(a). Special cases (zero, infinity, NaN) arehandled externally; if the result is a special case, then the maindatapath will be preempted.Max.Exp: One cannot add or subtract two significands whoseexp<strong>on</strong>ents differ. The first step <str<strong>on</strong>g>of</str<strong>on</strong>g> the floating-point additi<strong>on</strong>operator is to compute the maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> the two exp<strong>on</strong>ents.Denormalize: The sec<strong>on</strong>d step is to shift the significandcorresp<strong>on</strong>ding to the smaller exp<strong>on</strong>ent right by the exp<strong>on</strong>entdifference, ΔE = |E 1 – E 2 |, effectively dividing the significandby 2 ΔE ; implicitly, this is compensated by adding ΔE to thesmaller exp<strong>on</strong>ent, equalizing it with the larger exp<strong>on</strong>ent. Theright shift denormalizes the smaller significand if ΔE > 0.Let f be the precisi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the initial significand; thedenormalized significand has precisi<strong>on</strong> f+3. The bit in positi<strong>on</strong>3 is called the guard bit (G); the bit in positi<strong>on</strong> 2 is called theround bit (R); the bit in positi<strong>on</strong> 1, which is the logical-or <str<strong>on</strong>g>of</str<strong>on</strong>g> all<str<strong>on</strong>g>of</str<strong>on</strong>g> the remaining bits that have been shifted <str<strong>on</strong>g>of</str<strong>on</strong>g>f, is called thesticky bit (T); these bits are required later for rounding.Eff.Op: With equal exp<strong>on</strong>ents, it is now safe to add thesignificands. The effective operati<strong>on</strong> (EOP) refers to theoperati<strong>on</strong> that is performed (either additi<strong>on</strong> or subtracti<strong>on</strong>), anddepends <strong>on</strong> the input bit op and the signs <str<strong>on</strong>g>of</str<strong>on</strong>g> the two operands.Prior work has shown that EOP = op ⊕ S 1 ⊕ S 2 , where EOP =0 indicates additi<strong>on</strong> and EOP = 1 indicates subtracti<strong>on</strong> [4].Normalize: The result <str<strong>on</strong>g>of</str<strong>on</strong>g> the effective operati<strong>on</strong> may not benormalized. The next step is normalizati<strong>on</strong>. Different stepsmust be taken, depending <strong>on</strong> the effective operati<strong>on</strong>.First, c<strong>on</strong>sider effective additi<strong>on</strong> (EOP = 0). Assume that E 1 =E 2 , meaning that no alignment shift is required. In this case, theinputs to add are normalized, having the form 1.F 1 and 1.F 2 ,yielding a result <str<strong>on</strong>g>of</str<strong>on</strong>g> the form 10.F 3 , which indicates an overflow<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>e bit positi<strong>on</strong>. In this case, we shift 10.F 3 right by <strong>on</strong>e t<strong>on</strong>ormalize and increment the exp<strong>on</strong>ent. If the result <str<strong>on</strong>g>of</str<strong>on</strong>g> theeffective operati<strong>on</strong> is normalized, then no shift is required.For effective subtracti<strong>on</strong> (EOP = 1), the result may bedenormalized, and the number <str<strong>on</strong>g>of</str<strong>on</strong>g> leading zeroes before the first<strong>on</strong>e cannot be determined statically. The case where the resultis zero is handled as a special case. Assume that the significandresulting from the sec<strong>on</strong>d effective operati<strong>on</strong> has the form 0.F 3 ;we must first count the number <str<strong>on</strong>g>of</str<strong>on</strong>g> leading zeroes in 0.F 3 usinga comp<strong>on</strong>ent called a leading zero detector (LZD); supposingthat there are L leading zeroes, we shift 0.F 3 left by L(effectively multiplying it by 2 L ) and subtract L from theresulting exp<strong>on</strong>ent to compensate; the result is normalized.Round: This final step accounts for the error incurred due to thebits that were shifted right during significand alignment. Avalue <str<strong>on</strong>g>of</str<strong>on</strong>g> 1 is added to the significand based <strong>on</strong> the roundingmode specified and the values <str<strong>on</strong>g>of</str<strong>on</strong>g> the guard, round, and stickybits, G, R, and T, as discussed earlier; this may <strong>on</strong>ce againcause overflow, such that the result has the form 10.F; this caseis handled identically to normalizati<strong>on</strong> for effective additi<strong>on</strong>:the significand is shifted right by <strong>on</strong>e, normalizing it, and theexp<strong>on</strong>ent is incremented.C. Clustered <str<strong>on</strong>g>Additi<strong>on</strong></str<strong>on</strong>g> in Altera’s <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point DatapathCompilerAltera’s floating-point datapath compiler removes internalnormalizati<strong>on</strong> and rounding operati<strong>on</strong>s within a cluster; itperforms these operati<strong>on</strong>s <strong>on</strong>ly at the output, as illustrated inFig. 1(b). Traditi<strong>on</strong>al IEEE-754 floating-point operators use asign-magnitude representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> significands inside a cluster;Altera’s compiler c<strong>on</strong>verts the significand to a two’scomplement integer instead, and extends the significand from24 bits (including the hidden ‘1’) to 32 bits (which includes thesign bit). Four additi<strong>on</strong>al bits are provided for overflow andthree additi<strong>on</strong>al bits are provided for underflow. The hidden ‘1’no l<strong>on</strong>ger remains in the most significant positi<strong>on</strong>.Within each internal two-input operator, the exp<strong>on</strong>entsmust still be aligned before the effective operati<strong>on</strong> can beperformed. The denormalized significand corresp<strong>on</strong>ding to thesmaller exp<strong>on</strong>ent must still be shifted right by the exp<strong>on</strong>entdifference, ΔE. After the final fixed-point additi<strong>on</strong> operati<strong>on</strong> inthe cluster, the signed significand is c<strong>on</strong>verted back to signmagnitudeform; this is required for normalizati<strong>on</strong>. Altera’spapers do not explicitly state which rounding modes aresupported, whether or not normalizati<strong>on</strong> uses a sticky bit, orhow the sticky bit is computed (bey<strong>on</strong>d the extendedsignificand format) if <strong>on</strong>e is used.With four additi<strong>on</strong>al overflow bits, up to 16 normalizedpositive numbers can be effectively added before overflow;overflow occurs when the signifcand has the form 10000.z.This permits <strong>on</strong>e normalizati<strong>on</strong> operati<strong>on</strong> to be performed afteradding 16 floating-point numbers; however, the length <str<strong>on</strong>g>of</str<strong>on</strong>g> theright shift required to normalize the significand will range from0 to 4; in the floating-point adder described in the precedingsecti<strong>on</strong>, the length <str<strong>on</strong>g>of</str<strong>on</strong>g> the shift ranged from 0 to 1.The three underflow bits help to mitigate negativewordgrowth during effective subtracti<strong>on</strong>; however, these bitsmay be lost due to the right shifting during significandalignment (denormalizati<strong>on</strong>). For large shift values duringdenormalizati<strong>on</strong> and effective subtracti<strong>on</strong>, the result may beless accurate than using IEEE-754 operators.20


N 1 N 2N 1 N 2 N 3 N 4N 1 N 2 N 3 N 4Max.ExpMax.ExpMax.Exp4-input Max.ExpDenormEff.OpDenormEff.OpDenormEff.Op4-way Denorm4-input Eff.OpNormRoundMax.ExpDenormNormRound(a)Eff.OpNorm(c)(b)RoundFigure 1. The five main stages <str<strong>on</strong>g>of</str<strong>on</strong>g> a traditi<strong>on</strong>al IEEE-754 floating point adder/subtractor (a); a four-input floating-point additi<strong>on</strong> cluster as generated byAltera’s floating-point datapath compiler (b) and our improved method (c).If a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> IEEE-754 operators is used to perform clusteredadditi<strong>on</strong>, each normalizati<strong>on</strong> and rounding step at <strong>on</strong>e level willbe followed immediately by a denormalizati<strong>on</strong> at the next step;this normalizati<strong>on</strong>/denormalizati<strong>on</strong> is wholly redundant foreffective additi<strong>on</strong> when the expanded significand width isemployed, because normalizati<strong>on</strong> is <strong>on</strong>ly used to fix overflow;extending the significand with additi<strong>on</strong>al overflow bitseliminates the need to normalize after every effective additi<strong>on</strong>.In the general case, extending the significand by k bits permitseffective additi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> up to 2 k operands followed by <strong>on</strong>enormalizati<strong>on</strong> operati<strong>on</strong> while guaranteeing that overflow willnot occur; underflow, in c<strong>on</strong>trast, cannot be eliminated.III. A NEW APPROACH TO FLOATING-POINT ADDITIONCLUSTER SYNTHESIS ON FPGASFigure 1(c) depicts a simplified versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> our approach t<str<strong>on</strong>g>of</str<strong>on</strong>g>loating-point clustered additi<strong>on</strong> for four input operands. First,the maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> the four exp<strong>on</strong>ents is found: sec<strong>on</strong>d, the threesignificands corresp<strong>on</strong>ding to the n<strong>on</strong>-maximal exp<strong>on</strong>ents aredenormalized relative to the maximal exp<strong>on</strong>ent; third, theeffective operati<strong>on</strong> is applied to the four significands usingcarry-save additi<strong>on</strong> (they are c<strong>on</strong>verted to two’s complementand sign-extended to 32 bits before the effective operati<strong>on</strong>, andback to sign-magnitude afterward); lastly, normalizati<strong>on</strong> androunding are applied. Fig. 2 shows a detailed descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> themulti-operand floating-point adder, including the signextensi<strong>on</strong> and two’s complement c<strong>on</strong>versi<strong>on</strong>/dec<strong>on</strong>versi<strong>on</strong>operati<strong>on</strong>s discussed in the preceding paragraph. The keyfeatures <str<strong>on</strong>g>of</str<strong>on</strong>g> this clustered adder are discussed in the followingsubsecti<strong>on</strong>s.A. Max.ExpThe most straightforward way to compute the maximum <str<strong>on</strong>g>of</str<strong>on</strong>g>a set <str<strong>on</strong>g>of</str<strong>on</strong>g> k numbers is to use a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> 2-input MAX operators. Ifn is the bitwidth <str<strong>on</strong>g>of</str<strong>on</strong>g> the input operators, the delay complexity <str<strong>on</strong>g>of</str<strong>on</strong>g>the MAX operator is O(logn), assuming the use <str<strong>on</strong>g>of</str<strong>on</strong>g> a fast prefixadder to perform subtracti<strong>on</strong>. A tree <str<strong>on</strong>g>of</str<strong>on</strong>g> 2-input MAX operatorsthat computes the maximum <str<strong>on</strong>g>of</str<strong>on</strong>g> k operands has a height <str<strong>on</strong>g>of</str<strong>on</strong>g>O(logk). Therefore, the delay complexity <str<strong>on</strong>g>of</str<strong>on</strong>g> a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> two-inputMAX operators is O(logn × logk).Our implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the MAX operator uses a differentapproach, which shares some principle similarities to leading<strong>on</strong>e’s counters. Our input is a set <str<strong>on</strong>g>of</str<strong>on</strong>g> k n-bit unsigned numbers(exp<strong>on</strong>ents are unsigned). Intuitively, the maximum unsignedinteger has the l<strong>on</strong>gest string <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>secutive ‘1’s in the mostsignificant positi<strong>on</strong>s; however, a situati<strong>on</strong> may exist wherein all<str<strong>on</strong>g>of</str<strong>on</strong>g> the unsigned integers have at least <strong>on</strong>e leading zero.Step 1. The first step is to compute a dominati<strong>on</strong> table: if allbits in some positi<strong>on</strong> are ‘0’ for all integers, we flip these bitsto ‘1’. This guarantees that at least the maximum unsignedinteger am<strong>on</strong>g the inputs begins with a string <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>secutive‘1’s at the most significant positi<strong>on</strong>. If this occurs at bitpositi<strong>on</strong> i, this effectively adds 2 i to each <str<strong>on</strong>g>of</str<strong>on</strong>g> the integers, whichdoes not affect their relative ordering.Step 2. Next, we count the number <str<strong>on</strong>g>of</str<strong>on</strong>g> leading <strong>on</strong>es in each bitpositi<strong>on</strong>. For an n-bit unsigned integer, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> leading<strong>on</strong>es itself is a ⎡log(n+1)⎤-bit unsigned integer. Thus, a list <str<strong>on</strong>g>of</str<strong>on</strong>g> kn-bit unsigned integers is reduced to a list <str<strong>on</strong>g>of</str<strong>on</strong>g> k ⎡log(n+1)⎤-bitunsigned integers. We then compute the maximum unsignedinteger in the sec<strong>on</strong>d list recursively. Terminati<strong>on</strong> is eventuallyguaranteed, since the bitwidth <str<strong>on</strong>g>of</str<strong>on</strong>g> the unsigned integers isreduced logarithmically at each stage.Figure 3 shows a list <str<strong>on</strong>g>of</str<strong>on</strong>g> four 8-bit unsigned integers: {22,59, 62, 61}. The maximum am<strong>on</strong>g them, 62, is computed inthree recursive steps; numbers in parenthesis are the base-10representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the input integers and number <str<strong>on</strong>g>of</str<strong>on</strong>g> leading <strong>on</strong>esin the dominati<strong>on</strong> table. In the example, the dominati<strong>on</strong> tablediffers from the integer <strong>on</strong>ly at the first stage.Each bit <str<strong>on</strong>g>of</str<strong>on</strong>g> the dominati<strong>on</strong> table is computed independentlyin time O(log k), while the table area is O(nk). The delay <str<strong>on</strong>g>of</str<strong>on</strong>g> ann-bit leading <strong>on</strong>e detector is O(logn) and its area is O(n). Sincewe count the leading ‘1’s <str<strong>on</strong>g>of</str<strong>on</strong>g> k integers in parallel, the spacecomplexity is O(nk). To account for the subsequent stages <str<strong>on</strong>g>of</str<strong>on</strong>g>the recursi<strong>on</strong>, we use the following recurrence relati<strong>on</strong>s, T(n, k)and A(n, k) for time and space complexity respectively:( n,k) O( logk + logn) T ( logn,k)T = +(4)( n,k ) O( nk) T ( logn, k)A = +(5)21


E 1… E kF 1 …F kInsert Hidden ‘1’Max Exp<strong>on</strong>entE Max-- -S 11 0ΔE 1… ΔE k01.F 1 1.F k0…S k-1 0C<strong>on</strong>vert Significands to2’s Complement FormExtend Significands to 32 bits>>>>…Denormalizati<strong>on</strong>signMulti-operand Fixed-point Adder0F Sum-1 0C<strong>on</strong>vert Significand toSign-magnitude FormCount Up to 4 Leading ‘1’sLZD+-Normalizati<strong>on</strong>SpecialCases0110E Norm = Norm(E Sum)0F Norm = Norm(F Sum)NaN,+/- infinity, inexact,exp<strong>on</strong>ent overflowand underflowS k+1+E k+1 = Round(E Norm)Round OverflowRound Significand and Shorten to 23-bits,Dropping the Hidden ‘1’F k+1 = Round(F Norm)Figure 2. Illustrati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> our multi-operand floating-point additi<strong>on</strong> clusterThe soluti<strong>on</strong>s to the two recurrence relati<strong>on</strong>s are as follows;we use the shorthand log*n to denote log(log(… (logn)...)):( n,k ) O(log n + log * n log k)T = ×(6)( n,k) O( nk)A = (7)This MAX circuit ourselves was discovered automaticallyby an internally-developed logic synthesis tool that usesdecompositi<strong>on</strong> methods to restructure arithmetic circuits [5]. Inactuality, this circuit may have been discovered in the past andpublished elsewhere; if so, we remain unaware <str<strong>on</strong>g>of</str<strong>on</strong>g> its existence.22


Integer Dom.Table Leading 1’s(22) 0010110 1010110 001 (1)(59) 0111011 1111011 100 (4)(62) 0111110 1111110 110 (6)(61) 0111101 1111101 101 (5)(1) 001 001 00 (0)(4) 100 100 01 (1)(6) 110 110 10 (2)(5) 101 101 01 (1)(0) 00 00 0(1) 01 01 0(2) 10 10 1 (MAX)(1) 01 01 0Figure 3. Illustrati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a multi-operand MAX functi<strong>on</strong> implementedusing a dominati<strong>on</strong> table.B. C<strong>on</strong>versi<strong>on</strong> to Sign-MagnitudeThe output <str<strong>on</strong>g>of</str<strong>on</strong>g> the multi-operand adder is a two’scomplement integer; the expanded significand ensures thatoverflow does not occur if we add sixteen or fewer operands.The c<strong>on</strong>versi<strong>on</strong> back to sign-magnitude representati<strong>on</strong> includesa subtracti<strong>on</strong> operati<strong>on</strong>, which is applied when the sign isnegative. For a 16-operand implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> our adder, thec<strong>on</strong>versi<strong>on</strong>s to and from two’s complement create a maximalerror <str<strong>on</strong>g>of</str<strong>on</strong>g> at most 1 unit in the last positi<strong>on</strong> (ulp). This error isbounded because the significand’s precisi<strong>on</strong> has been extendedby four extra underflow bits in the least significant positi<strong>on</strong>s.The error occurs because right-shifting an integer and thennegating it is not the same as negating an integer and thenright-shifting it, i.e., –(A >> k) ≠ (–A) >> k. For example, let A= 10111 and k = 2; here, the bits shifted in are equal to the signbit that has been shifted right to preserve the positive/negativestatus <str<strong>on</strong>g>of</str<strong>on</strong>g> the shifter integer. In this case, –(A >> 2) = –(11101)= (00010 + 1) = 00011; in c<strong>on</strong>trast, (–A) >> 2 = –(10111) >>2 = (01000 + 1) >> 2 = 01001 >> 2 = 00010.Subtracti<strong>on</strong> produces a carry-out bit; when shifting beforesubtracting, some <str<strong>on</strong>g>of</str<strong>on</strong>g> the bits that may affect the value <str<strong>on</strong>g>of</str<strong>on</strong>g> thecarry-out bit are shifted right, and are not c<strong>on</strong>sidered whenperforming the subtracti<strong>on</strong>. In other words, if L k is the <strong>on</strong>e’scomplement <str<strong>on</strong>g>of</str<strong>on</strong>g> the k least significant bits <str<strong>on</strong>g>of</str<strong>on</strong>g> A, that wereshifted out in the former case, the following identity holds:( A >> k) = (( − A)>> k) + ( carry ( 1 + ))− (8)out L kSince the value <str<strong>on</strong>g>of</str<strong>on</strong>g> carry out (1 + L k ) is either 0 or 1, there is atmost ulp <str<strong>on</strong>g>of</str<strong>on</strong>g> error if we approximate –(A >> k) as (–A) >> k.Since we are using four underflow bits, this error willpropagate to the actual least significant bits <strong>on</strong>ly if we add 16integers and all <str<strong>on</strong>g>of</str<strong>on</strong>g> them have this error. If, for example, wewanted to guarantee a similar error for 32-operand additi<strong>on</strong>, wewould need to add a fifth underflow bit to the significand.C. RoundingAltera does not specify which rounding modes aresupported in their floating-point datapath compiler; nor do theyspecify a method to compute a sticky bit inside <str<strong>on</strong>g>of</str<strong>on</strong>g> an additi<strong>on</strong>cluster. One possibility is that they may compute the sticky bitbased <strong>on</strong> the right shift in the denormalizati<strong>on</strong> stage <str<strong>on</strong>g>of</str<strong>on</strong>g> the last2-input additi<strong>on</strong> in the cluster; however, this sticky bit wouldnot necessarily be accurate, as the input significands arethemselves denormalized. This latter possibility would beimpossible to implement with our approach, as a compressortree rather than a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> adders, is used to perform significandadditi<strong>on</strong>. Since we use the same internal significandrepresentati<strong>on</strong> as Altera, we have implemented our own ad-hocrounding mode, and we have used it in both our design and ourimplementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Altera’s clustered additi<strong>on</strong> operati<strong>on</strong>s. Thisrounding mode uses the four underflow bits to mimic the IEEEround to +/– infinity, but it does not use a sticky bit. Thedrawback is that if all n<strong>on</strong>-zero bits were right-shifted outduring alignment, then the <strong>on</strong>ly way to account for theirexistence would be to use a sticky bit, which we do not.IV. EXPERIMENTAL RESULTSIn this secti<strong>on</strong>, we compare our proposed technique tosynthesize single-precisi<strong>on</strong> clustered floating-point operands tothe technique employed in Altera’s floating-point datapathcompiler [1, 2]; we also compare with a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> IEEEcomplianttwo-input floating-point adders as well. We c<strong>on</strong>siderclusters <str<strong>on</strong>g>of</str<strong>on</strong>g> 2, 4, 8, and 16 operands.We implemented the three clustered floating-point adderdesigns in all four cluster sizes in VHDL and synthesized them<strong>on</strong> an Altera Stratix III FPGA, performing synthesis withAltera’s Quartus II s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware. We synthesized each design as apurely combinati<strong>on</strong>al circuit as to enable a fair comparis<strong>on</strong>.Delay, in both cases, were reported in nanosec<strong>on</strong>ds (ns); area isreported in adaptive LUTs (ALUTs). It is important to note thatALUTs are an Altera-specific metric, and that area usage <strong>on</strong> adifferent vendor’s FPGA may be different due to architecturaldifferences, and variati<strong>on</strong>s in CAD flows targeting them.Fig. 4(a) and (b) show the respective delay and area. Trees<str<strong>on</strong>g>of</str<strong>on</strong>g> IEEE 2-input floating-point adders were n<strong>on</strong>-competitivecompared to Altera’s floating-point datapath compiler [1, 2] formore than two operands. Two implementati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> our synthesismethod are shown: <strong>on</strong>e which builds the multi-operand adderas a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> 3-input adders (with each ALM c<strong>on</strong>figured inShared Arithmetic Mode), and <strong>on</strong>e that implements it using acarry-save-based compressor tree [3]; the former achievestigher area bounds, while the latter improves critical path delay.For 4, 8, and 16-operand adder trees, the respectiveimprovements in critical path delay compared to Altera’sfloating-path datapath compiler were 9%, 8%, and 6% for ourapproach using adder trees, and 16%, 17%, and 20% usingcompressor trees; similarly, the improvements in area were21%, 29%, and 26% using adder trees, and 17%, 25%, and20% using compressor trees.23


8070Delay50004500Area6040003500ns504030ALMs3000250020002010150010005000(a)02 4 81616 2 48 16Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Cluster Operands(b)Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Cluster OperandsTree <str<strong>on</strong>g>of</str<strong>on</strong>g> IEEE 2-input floating-point addersAltera’s floating-point datapath compilerOur method (tree <str<strong>on</strong>g>of</str<strong>on</strong>g> 3-input adders)Our method (compressor tree [3])Figure 4. Experimental results that compare and c<strong>on</strong>trast our method for clustered single-precisi<strong>on</strong> floating-point additi<strong>on</strong> synthesis with Altera’smethod and a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> IEEE 2-input floating-point adders for cluster sizes <str<strong>on</strong>g>of</str<strong>on</strong>g> 2, 4, 8, and 16 synthesized <strong>on</strong> an Altera Stratix III FPGA. For cluster sizes <str<strong>on</strong>g>of</str<strong>on</strong>g>4, 8, and 16, our method has the lowest delay and c<strong>on</strong>sumes the smallest area in comparis<strong>on</strong> to the others.We c<strong>on</strong>sider the improvement in area to be more significantthat the improvement in delay for two reas<strong>on</strong>s: the delays andlatencies (number <str<strong>on</strong>g>of</str<strong>on</strong>g> cycles) for our approach and Altera’s arelikely to be comparable after pipeline registers are inserted;moreover, the critical path <str<strong>on</strong>g>of</str<strong>on</strong>g> a complete design that includes apipelined floating-point datapath may occur elsewhere in thedesign, such as the c<strong>on</strong>trol path or the interface to memory. Inc<strong>on</strong>trast, the reducti<strong>on</strong> in area is not subject to pipelining or thec<strong>on</strong>text in which the additi<strong>on</strong> cluster is deployed.VI. CONCLUSIONThis paper has introduced a new method to synthesizedfused floating-point additi<strong>on</strong> clusters <strong>on</strong> <strong>FPGAs</strong>, whichimproves the area and critical path delay compared to themethods currently used in Altera’s floating-point datapathcompiler [1, 2]. It takes advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> advances in <strong>FPGAs</strong>ynthesis for multi-operand additi<strong>on</strong>s using carry savearithmetic [3], and includes a new approach to computing themaximum value am<strong>on</strong>g a set <str<strong>on</strong>g>of</str<strong>on</strong>g> k unsigned integers.V. RELATED WORKTenca [6] introduced a 3-input floating-point adder thatachieves reduced error compared to two IEEE floating-pointadders; however, this approach requires the precisi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> theeffective operati<strong>on</strong> to increase from f to 2f+5; this approach isunlikely to scale for more than three or four operands. TheFloPoCo compiler [7] can also generate effective 3-inputoperati<strong>on</strong>s when necessary, and can eliminate effectivesubtracti<strong>on</strong> when all input operands are known to be positive,e.g., when computing x 2 + y 2 + z 2 . Our approach, in c<strong>on</strong>trast,takes the same approach as Altera’s floating-point datapathcompiler [1, 2], and scales up to 16 operands, while improvingup<strong>on</strong> both area and delay.It is impossible to parallelize floating-point additi<strong>on</strong> whilemaintaining IEEE compliance due to n<strong>on</strong>-associativity. Kapreand DeH<strong>on</strong> [8] developed a speculative approach to parallelfloating-point accumulati<strong>on</strong> that maintains compliance.Operands are added two at a time, in parallel, using IEEE-754-compliant operators under the optimistic assumpti<strong>on</strong> that alloperators are associative for the given inputs. If a detecti<strong>on</strong>circuits asserts that this assumpti<strong>on</strong> is violated, the summati<strong>on</strong>process is serialized automatically. The drawback <str<strong>on</strong>g>of</str<strong>on</strong>g> thisapproach is that each operator performs normalizati<strong>on</strong>.REFERENCES[1] M. Langhammer, “<str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point Datapath <str<strong>on</strong>g>Synthesis</str<strong>on</strong>g> for <strong>FPGAs</strong>,” Proc.Intl. C<strong>on</strong>f. Field Programmable Logic and Applicati<strong>on</strong>s (FPL ’08), pp.355-360, Sept. 2008.[2] M. Langhammer and T. VanCourt, “FPGA <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point DatapathCompiler,” Proc. IEEE Symp. Field-Programmable Custom ComputingMachines (FCCM ’09), Apr. 2009.[3] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Exploiting Fast <strong>Carry</strong>-Chains <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>FPGAs</strong> for Designing Compressor Trees,” Proc. Intl. C<strong>on</strong>f.Field Programmable Logic and Applicati<strong>on</strong>s (FPL ’09), Aug.-Sept.2009.[4] P.-M. Seidel and G. Even, “Delay-Optimized Implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> IEEE<str<strong>on</strong>g>Floating</str<strong>on</strong>g>-<str<strong>on</strong>g>Point</str<strong>on</strong>g> <str<strong>on</strong>g>Additi<strong>on</strong></str<strong>on</strong>g>,” IEEE Trans. Computers, vol. 53, no. 2, pp.97-113, February, 2004.[5] A. K. Verma, P. Brisk, and P. Ienne, “Iterative Layering: OptimizingArithmetic Circuits by Structing the Informati<strong>on</strong> Flow,” Proc. Intl. C<strong>on</strong>f.Computer-Aided Design (ICCAD ’09), Nov. 2009.[6] A. Tenca, “Multi-operand <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-point <str<strong>on</strong>g>Additi<strong>on</strong></str<strong>on</strong>g>,” Proc. 19 th IEEEIntl. Symp. Computer Arithmetic (ARITH-19 ‘09), pp. 161-168, June,2009.[7] F. de Dinechin, C. Klein, and B. Pasca, “Generating High-PerformanceCustom <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-<str<strong>on</strong>g>Point</str<strong>on</strong>g> Pipeline,” Proc. Intl. C<strong>on</strong>f. Field ProgrammableLogic and Applicati<strong>on</strong>s (FPL ’09), Aug.-Sept. 2009, and TechnicalReport LIP-2009-16, Ens-Ly<strong>on</strong>, Ly<strong>on</strong>, France, April, 2009.[8] N. Kapre and A. DeH<strong>on</strong>, “Optimistic Parallelizati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Floating</str<strong>on</strong>g>-<str<strong>on</strong>g>Point</str<strong>on</strong>g>Accumulati<strong>on</strong>.” Proc. IEEE Symp. Computer Arithmetic (ARITH-18‘07), pp. 205-216, June, 2007.24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!