11.07.2015 Views

ieee transactions on very large scale integration vlsi - Computer ...

ieee transactions on very large scale integration vlsi - Computer ...

ieee transactions on very large scale integration vlsi - Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 1, MARCH 19973 Mult, 3 Add 2 Mult, 2 Add 1 Mult, 2 AddClock Csteps ns Csteps ns Csteps ns163 14 2282 16 2608 16 260882 17 1394 18 1476 21 172255 21 1155 22 1210 29 159548 25 1200 26 1248 37 177624 46 1104 48 1152 66 1584TABLE IResource-C<strong>on</strong>strained Scheduling Results for the EWFchoice eliminates an entire dimensi<strong>on</strong> of the search space,so even an optimal scheduler will explore <strong>on</strong>ly the corresp<strong>on</strong>ding2D slice of the design space, and will produce aschedule that is optimal <strong>on</strong>ly for that <strong>on</strong>e clock length. Abetter schedule may exist for a dierent clock length, butwill not be found.To motivate the need to explore this 3D design space,c<strong>on</strong>sider the problem of scheduling the well-known EllipticWave Filter [6, p.206] (EWF) benchmark, under a varietyof resource c<strong>on</strong>straints, to nd the fastest possible schedule.Assume that the VDP100 module library [7], [8] is used,which hasamultiplicati<strong>on</strong> delay of 163ns, and an additi<strong>on</strong>delay of 48ns.Forced to select a clock length for the scheduling algorithm,the designer would probably choose either a clocklength of 48ns or 163ns { the executi<strong>on</strong> delay of either additi<strong>on</strong>or multiplicati<strong>on</strong>. Given those clock lengths, an optimalscheduler that supports multi-cycle operati<strong>on</strong>s (suchas the ILP-based scheduler [9] in our Voyager design spaceexplorati<strong>on</strong> system) would produce the results shown <strong>on</strong>the rows labeled \48" and \163" in Table I.Now c<strong>on</strong>sider the other rows of Table I, which representother, perhaps less obvious, choices for the clock length.For each resource c<strong>on</strong>straint, the fastest design corresp<strong>on</strong>dsto a clock length of 24ns { a design that would not be foundby ascheduling methodology limited by ad hoc guesses. 2Thus it is important to explore a number of candidate clocklengths to nd the globally optimal soluti<strong>on</strong>.B. Exploring the 3D Design SpaceAvariety of methodologies can be used for design spaceexplorati<strong>on</strong>. The methodologies may use exact algorithmsto nd optimal soluti<strong>on</strong>s, may use heuristic algorithms t<strong>on</strong>d lower and upper bounds <strong>on</strong> the optimal soluti<strong>on</strong>, ormay use heuristic algorithms to estimate the optimal soluti<strong>on</strong>.In general, the tradeo between these three types ofmethodologies is <strong>on</strong>e of soluti<strong>on</strong> qualityversus computati<strong>on</strong>ti<strong>on</strong> [3], or reclocking [4] to determine the nal clock length. However,these techniques generally do not change the relative schedulingof the operati<strong>on</strong>s, and do not perform tradeos involving resourcesharing, so they do not explore the high-level design space as fully asscheduling techniques. Nevertheless, the later use of these techniques,possibly in c<strong>on</strong>juncti<strong>on</strong> with other transformati<strong>on</strong>s [5], can serve asavaluable complement to our methodologies.2 This small clock length also results in a <strong>large</strong>r numberofc<strong>on</strong>trolsteps, and thus a <strong>large</strong>r and more complex c<strong>on</strong>trol unit. However, notethat a clock length of 55ns { more comparable to the ad hoc guesses {results in a schedule almost as fast as the <strong>on</strong>e corresp<strong>on</strong>ding to a 24nsclock, and faster than those corresp<strong>on</strong>ding to the ad hoc guesses.Exhaustive Search:read in DFG, module library, and any c<strong>on</strong>straintsfor each clock lengthoptimally schedule the DFGpresent the best result(s) to the user for evaluati<strong>on</strong>Fig. 3. Exhaustive Search of the 3D Design Space (Impractical)time. This paper is c<strong>on</strong>cerned with nding optimal (exact)soluti<strong>on</strong>s to the scheduling problem in the 3D design space.One exact methodology for optimally solving this 3Dschedulingproblem shown in Figure 3. This methodologyexhaustively explores all potential clock lengths and all feasibleschedules, and guarantees a globally optimal soluti<strong>on</strong>.Unfortunately, the computati<strong>on</strong> time for this methodologyis too high to be practical for all but the simplest examples.In c<strong>on</strong>trast, this paper presents a more ecient exactmethodology, implemented in the Voyager design space explorati<strong>on</strong>system, for optimally solving this 3D-schedulingproblem. This methodology makes the problem tractablethrough: (1) careful pruning of provably inferior pointsfrom the design space, and (2) provably ecient exact algorithmsfor solving the individual problems.However, even this soluti<strong>on</strong> methodology is <strong>on</strong>ly the rststep toward the <strong>large</strong>r design space explorati<strong>on</strong> problemthat eventually needs to be solved. As described here,our methodology does not c<strong>on</strong>sider the module selecti<strong>on</strong>or type mapping problems, and does not support loops orc<strong>on</strong>diti<strong>on</strong>als 3 . It also does not incorporate register, wiring,or c<strong>on</strong>troller area, and <strong>on</strong>ly partially incorporates the delaysassociated with the c<strong>on</strong>troller and wiring. Nevertheless,the work described here can serve as a foundati<strong>on</strong>for an exact soluti<strong>on</strong> methodology that incorporates eachof these factors, either by adding extra dimensi<strong>on</strong>s to thesearch space or by adding other stages to the methodology.II. Methodology OverviewThis paper presents two methodologies to solve the clockdeterminati<strong>on</strong> and scheduling problem, that are guaranteedto nd the globally optimal design, and that are far moreecient than an exhaustive search of the design space. Onemethodology solves the Time-C<strong>on</strong>strained 3D Scheduling(TCS-3D) problem (Figure 4), while the other solves theResource-C<strong>on</strong>strained 3D Scheduling (RCS-3D) problem(Figure 5). Both methodologies are implemented in Rensselaer'sVoyager design space explorati<strong>on</strong> system.The core of each methodology is based roughly <strong>on</strong> the exhaustivesearch of Figure 3. Each methodology computesa set of candidate clock lengths, and then, for each candidateclock length, optimally solves the scheduling problem.However, a straightforward implementati<strong>on</strong> of thiscore methodology takes much too l<strong>on</strong>g to solve, even forsmall benchmarks. Thus it is important to (1) solve thescheduling problem for <strong>on</strong>ly a small, provably minimal setof candidate clock lengths, and (2) solve the scheduling3 However, module selecti<strong>on</strong> has since been incorporated [10], andc<strong>on</strong>diti<strong>on</strong>als and register area are currently being investigated.


CHAUDHURI, ET AL: A SOLUTION METHODOLOGY FOR EXACT DESIGN SPACE EXPLORATION IN A 3D DESIGN SPACE 3Time-C<strong>on</strong>strained 3D Scheduling (TCS-3D):read in DFG, module library, and time c<strong>on</strong>straintcompute minimal set of candidate clock lengthsfor each clock lengthperform Time-C<strong>on</strong>strained Scheduling (TCS)endpresent the results to the user for evaluati<strong>on</strong>Time-C<strong>on</strong>strained Scheduling (TCS):compute tight lower bounds <strong>on</strong> the number of functi<strong>on</strong>alunits of each typeuse these lower bounds as resource c<strong>on</strong>straints, and solveTRCS as a decisi<strong>on</strong> problemwhile no feasible schedule is foundincrease the resource c<strong>on</strong>straintssolve TRCS as a decisi<strong>on</strong> problemendFig. 4. Voyager's Time-C<strong>on</strong>strained 3D Scheduling (TCS-3D)Methodologyproblems as eciently as possible so that an optimal soluti<strong>on</strong>is found in a reas<strong>on</strong>able amount of time.To solve the scheduling problem, both methodologies usean Integer Linear Programming (ILP) formulati<strong>on</strong> (Secti<strong>on</strong>IV) that was developed after a careful formal analysisof that problem. This analysis was presented earlier in[9], where we proved that this formulati<strong>on</strong>, in particularthe formulati<strong>on</strong> of the TRCS problem, was well-structuredand can be solved eciently.Since the search spaces for the TCS and RCS problemsare each <strong>large</strong>r than that of the TRCS problem, thesemethodologies solve the TCS and RCS problems by generatingthe missing c<strong>on</strong>straints, in eect c<strong>on</strong>verting eachinto an easier-to-solve TRCS problem. For the TCS problem,the methodology computes c<strong>on</strong>straints <strong>on</strong> the numberof functi<strong>on</strong>al units of each type; for the RCS problem, itcomputes a time c<strong>on</strong>straint <strong>on</strong> the length of the schedule.Since these c<strong>on</strong>straints can also be found eciently, theentire methodology is ecient.A. Time-C<strong>on</strong>strained 3D Scheduling (TCS-3D)Voyager's methodology for solving the time-c<strong>on</strong>strained3D scheduling problem is outlined in Figure 4. Thismethodology begins by reading in the data ow graph(DFG), the executi<strong>on</strong> delays for the relevant functi<strong>on</strong>alunits in the module library, and the overall time c<strong>on</strong>straint.The minimal set of candidate clock lengths is then determined(see Secti<strong>on</strong> III), based <strong>on</strong> the executi<strong>on</strong> delaysof the relevant functi<strong>on</strong>al units in the module library, andfor chained designs, <strong>on</strong> the structure of the DFG. For theEWF and the module library described earlier, 10 candidateclock lengths would be generated (in the absence ofchaining). For each of these clock lengths, time-c<strong>on</strong>strainedscheduling is then performed, and the results are presentedto the user for evaluati<strong>on</strong>.To solve the TCS problem eciently, Voyager's ILP formulati<strong>on</strong>of the TRCS problem (described in Secti<strong>on</strong> IV) isused as follows. First, tight lower bounds <strong>on</strong> the number ofResource-C<strong>on</strong>strained 3D Scheduling (RCS-3D):read in DFG, module library, and resource c<strong>on</strong>straintscompute minimal set of candidate clock lengthsfor each clock lengthperform Resource-C<strong>on</strong>strained Scheduling (RCS)endpresent the results to the user for evaluati<strong>on</strong>Resource-C<strong>on</strong>strained Scheduling (RCS):compute a tight lower bound <strong>on</strong> the schedule lengthuse this lower bound as a time c<strong>on</strong>straint, and solve TRCSas a decisi<strong>on</strong> problemwhile no feasible schedule is foundincrease the time c<strong>on</strong>straintsolve TRCS as a decisi<strong>on</strong> problemendFig. 5. Voyager's Resource-C<strong>on</strong>strained 3D Scheduling (RCS-3D)Methodologyfuncti<strong>on</strong>al units of eachtype are computed (using a methodsketched out in Secti<strong>on</strong> V). These bounds are then usedas resource c<strong>on</strong>straints, and the TRCS problem is solvedas a decisi<strong>on</strong> problem. If TRCS produces a feasible schedule,then that schedule is guaranteed to be optimal; if not,the resource c<strong>on</strong>straints are increased, and this process isrepeated.This TCS-3D soluti<strong>on</strong> methodology is relatively ecientfor the following reas<strong>on</strong>s. First, the functi<strong>on</strong>al unit lowerbounds can be computed in polynomial time, by solvingat most two Linear Programs (LPs). Sec<strong>on</strong>d, TRCS issolved as a decisi<strong>on</strong> problem, rather than an optimizati<strong>on</strong>problem, using a formulati<strong>on</strong> that is well-structured,and requires few, if any, branches in a branch-and-boundsearch [9]. Finally, the functi<strong>on</strong>al unit lower bounds arehighly accurate [11] (in almost e<strong>very</strong> case they lead immediatelyto a feasible soluti<strong>on</strong>), so in practice the lowerbounds seldom have to be increased to solve TRCS again.Thus the TCS-3D problem can be solved quickly, even formedium-sized benchmarks (see Secti<strong>on</strong> VII).The eciency of the methodology can be further increasedif the goal is to nd the schedule with the fewestnumber of functi<strong>on</strong>al units. In this case, before each TCSproblem is solved as a TRCS problem, the FU lower boundsare compared to the number of FUs required in the bestprevious schedule. If the new bounds are smaller, then theTRCS problem is solved as explained above; if the newbounds are <strong>large</strong>r, then there is no need to solve the TRCSproblem since it would require more functi<strong>on</strong>al units thanthe best soluti<strong>on</strong> found so far.B. Resource-C<strong>on</strong>strained 3D Scheduling (RCS-3D)Voyager's methodology for solving the resourcec<strong>on</strong>strained3D scheduling problem is similar (see Figure5). This methodology reads in a resource c<strong>on</strong>straint,and generates a minimal set of candidate clocks usingthe clock length determinati<strong>on</strong> algorithm described inSecti<strong>on</strong> III. For each of these clock lengths, resourcec<strong>on</strong>strainedscheduling is then performed.


CHAUDHURI, ET AL: A SOLUTION METHODOLOGY FOR EXACT DESIGN SPACE EXPLORATION IN A 3D DESIGN SPACE 9formed by relaxing the precedence c<strong>on</strong>straints between operati<strong>on</strong>s.Precedence c<strong>on</strong>straints have been relaxed in [19](and a similar relaxati<strong>on</strong> in [33]), and in the tighter relaxati<strong>on</strong>sdescribed in [34], [35], [36], and [37] (based <strong>on</strong>a method originally proposed by Fernandez and Bussellin [38, Theorem 1]).A dierent approach, described more formally in [11], isused in Voyager. This approach starts with an ILP formulati<strong>on</strong>of the FU minimizati<strong>on</strong> problem (minimize thenumber m k of FUs of type k 2 K). The problem is then relaxedto a generic descripti<strong>on</strong> of an entire class of FU lowerboundingproblems (the problems above are special cases ofthis generic class). From this class, the FU lower-boundingproblem that produces the tightest possible bound is selectedand solved. This approach formalizes an entire classof FU lower-bounding problems and is guaranteed to producethe tightest possible bound in that class; this boundwas veried to be exact in most of our experiments.Furthermore, the soluti<strong>on</strong> to this FU lower-boundingproblem can be found in polynomial time by solving atmost two LP's, even though the original formulati<strong>on</strong> wasan ILP formulati<strong>on</strong>. Such an LP-based relaxati<strong>on</strong> is chosenbecause we want as tight as possible bounds to increase theeciency of solving the TCS problem. However, Voyageralso has a suite of more ecient heuristic algorithms [10]that may produce less accurate bounds and are suitable fora quick rst pass over the design space.VI. Bounding the Length of the Schedule toSolve RCS More EfficientlyThis secti<strong>on</strong> presents a method of generating a tightlower bound <strong>on</strong> the schedule length, so that the RCS problemcan be solved more eciently. The method is similar inprinciple to the method presented in the previous secti<strong>on</strong>for solving the FU lower-bounding problem.One early formulati<strong>on</strong> of the schedule length lowerboundingproblem in presence of resource c<strong>on</strong>straints ispresented in [19]; however, the bounds produced by thatapproach are <strong>very</strong> loose. More recent algorithms that producetighter bounds are those in [39] and [37] (based <strong>on</strong>Jacks<strong>on</strong>'s earliest deadline rule (ED-Rule) [40]), and thosein [34] and [36] (based <strong>on</strong> a theorem originally given byFernandez and Bussell in [38, Theorem 2]). Furthermore,those algorithms can be applied iteratively (Hu et al. applyFernandez's Theorem 2 in [41], and Langevin appliesED-Rule in [42]), producing even tighter bounds, althoughat the cost of increased algorithmic complexity.In much the same manner as FU-lower bounding, the ILPformulati<strong>on</strong> of the schedule-length minimizati<strong>on</strong> problemcan be relaxed to a generic descripti<strong>on</strong> of an entire class ofschedule-length lower-bounding problems. From this class,the lower-bounding problem that produces the tightest possiblebound (the problems above are special cases of thisgeneric class) is chosen and solved. This approach formalizesan entire class of schedule length lower-boundingproblems, and is guaranteed to produce the tightest boundof all possible precedence relaxati<strong>on</strong>s in polynomial time.TimeClock Csteps ns (Mult, Add)Time C<strong>on</strong>straint = 902ns82 11 902 (4, 2)55 15 825 (4, 2)48 18 864 (5, 2)41 22 902 (4, 2)33 26 858 (4, 2)28 30 840 (4, 2)24 34 816 (4, 2)21 41 861 (4, 2)19 45 855 (4, 2)Time C<strong>on</strong>straint = 760ns (tightest)24 31 744 (6, 2)TABLE IIAR { TCS-3D Results2*,1+ 2*,4+ 4*,2+ 6*,3+ 6*,4+Clock Cs ns Cs ns Cs ns Cs ns Cs ns163 13 2119 10 1630 8 1304 8 1304 8 130482 18 1476 18 1476 11 902 11 902 11 90255 26 1430 26 1430 15 825 14 770 14 77048 LB 1440 LB 1440 LB 864 LB 816 LB 81641 LB 1435 LB 1435 LB 861 LB 779 LB 77933 LB 1452 LB 1452 LB 858 LB 792 LB 79228 LB 1456 LB 1456 LB 840 LB 784 LB 78424 LB 1440 LB 1440 34 816 31 744 31 74421 LB 1449 LB 1449 LB 819 LB 756 LB 75619 LB 1444 LB 1444 LB 817 LB 760 LB 760TABLE IIIAR { RCS-3D Results Without ChainingVII. Experimental ResultsTo dem<strong>on</strong>strate the accuracy and performance of Voyager's3D scheduling methodology, we c<strong>on</strong>ducted a seriesof experiments using the well-known AR-lattice Filter(AR) [19], Elliptic Wave Filter (EWF) [6, p.206], andDiscrete Cosine Transform (DCT) [43] benchmarks. Weused the VDP100 module library from [7], [8], giving adatapath delay of 48ns for additi<strong>on</strong>, 56ns for subtracti<strong>on</strong>,and 163ns for multiplicati<strong>on</strong>. For each benchmark, we performedTime-C<strong>on</strong>strained 3D Scheduling (TCS-3D), andResource-C<strong>on</strong>strained 3D Scheduling (RCS-3D) with andwithout chaining, using the methodologies presented inSecti<strong>on</strong> II.A. AR FilterThe TCS-3D results for the AR lter are presented inTable II. They show, for each of two time c<strong>on</strong>straints,those clock lengths from the candidate set that lead to afeasible schedule (the other clock lengths lead to infeasibleschedules regardless of the number of functi<strong>on</strong>al unitsavailable).For a time c<strong>on</strong>straint of 902ns, eight clock lengths (82ns,55ns, 41ns, 33ns, 28ns, 24ns, 21ns, and 19ns) led to the minimumnumber of functi<strong>on</strong>al units. Of these, the schedulefor the 82ns clock (dd mult =2e) requires the fewest c<strong>on</strong>trolsteps (and thus potentially the smallest c<strong>on</strong>troller), and so


10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 1, MARCH 19972*,1+ 2*,4+ 4*,2+ 6*,3+ 6*,4+Clock Cs ns Cs ns Cs ns Cs ns Cs ns211 12 2532 9 1899 6 1266 5 1055 4 844163 13 2119 9 1467 LB 1304 LB 1141 LB 97896 18 1728 LB 1536 11 1056 LB 1056 LB 864TABLE IVAR{RCS-3D Results for Type I Chaining2*,1+ 2*,4+ 4*,2+ 6*,3+ 6*,4+Clock Cs ns Cs ns Cs ns Cs ns Cs ns106 17 1802 17 1802 9 954 8 848 8 84871 LB 1491 LB 1491 LB 923 11 781 11 78153 LB 1431 LB 1431 LB 848 14 742 14 742TABLE VAR{RCS-3D Results for Type II Chainingwould be preferable. Note that the 48ns clock { <strong>on</strong>e of the\obvious" ad hoc guesses (d add ) { requires an additi<strong>on</strong>almultiplier.To nd the fastest possible design, the critical pathlength was used to derive the tightest possible time c<strong>on</strong>straintof760ns. For this time c<strong>on</strong>straint, <strong>on</strong>ly <strong>on</strong>e clocklength { 24ns { led to a feasible schedule, and thus to theoptimal 3D schedule.The RCS-3D results for the AR lter are presented in TablesIII-VI. Some schedule lengths of interest are shown inboldface, and those that were lower-bounded by the RCS-3D methodology are shown in gray al<strong>on</strong>g with the lowerboundedschedule length. As described in Secti<strong>on</strong> II-B, theTRCS problem was not solved for those clock lengths, sinceeach would result in a schedule that was l<strong>on</strong>ger than theshortest schedule found for the previous clock lengths.In the absence of chaining, schedule lengths are shownfor e<strong>very</strong> candidate clock length in Table III. In this table,the fastest schedules corresp<strong>on</strong>d to clock lengths of55ns, and when sucient resources are available, 24ns.Again, it is interesting to note that neither of these clocklengths is an obvious ad hoc guess (55 is dd mult =3e, and24 is both dd mult =7e and dd add =2e), which means that thefastest schedule might be missed using more c<strong>on</strong>venti<strong>on</strong>almethodologies. Furthermore, although the clock length of24ns would corresp<strong>on</strong>d to a <strong>large</strong>r numberofc<strong>on</strong>trol steps(and perhaps a <strong>large</strong>r c<strong>on</strong>troller), that small clock lengthdoes result in the optimal 3D schedule, because the smallerclock lengths tend to reduce the operati<strong>on</strong> slack.Similarly, in the presence of chaining, schedule lengthsare shown for e<strong>very</strong> candidate clock length in Tables IV-VI.For a given clock length (such as 163ns), chaining usuallyimproved the schedule when there were sucient resourcesavailable, but provided no improvement when the numberof resources was small. Furthermore, for some clock lengthsand resource c<strong>on</strong>straints, <strong>on</strong>e type of chaining provided the<strong>large</strong>st improvement, while for other clock lengths and resourcec<strong>on</strong>straints, another type provided the <strong>large</strong>st improvement.Finally, looking at all these results over all candidateclock lengths, note that type II chaining gave the fastest2*,1+ 2*,4+ 4*,2+ 6*,3+ 6*,4+Clock Cs ns Cs ns Cs ns Cs ns Cs ns163 13 2119 9 1467 7 1141 7 1141 6 978106 17 1802 LB 1484 10 1060 9 954 LB 84896 18 1728 LB 1536 11 1056 LB 960 LB 86482 18 1476 LB 1476 11 902 11 902 LB 90271 LB 1491 LB 1491 LB 923 12 852 11 78155 26 1430 26 1430 15 825 14 770 14 77053 LB 1431 LB 1431 LB 848 LB 795 LB 795TABLE VIAR{RCS-3D Results for Type III ChainingTimeClock Csteps ns (Mult, Add)Time C<strong>on</strong>straint = 1394ns82 17 1394 (3, 3)55 25 1375 (2, 2)48 29 1392 (2, 2)24 58 1392 (2, 2)Time C<strong>on</strong>straint = 1035ns (tightest)24 43 1032 (4, 3)TABLE VIIEWF { TCS-3D Resultsoverall chained schedule (742ns), slightly faster than thefastest unchained schedule (744ns), and with half the numberofc<strong>on</strong>trolsteps. Type III chaining's performance waspoorer (770ns), but at least tied with the unchained scheduleat that same clock length (55ns). The best schedulefrom type I chaining (the \standard" form of chaining),however, was c<strong>on</strong>siderably worse than the best unchainedschedule (844ns vs. 744ns).B. Elliptic Wave Filter (EWF)The TCS-3D results for the EWF are presented in TableVII. Again, they show, for eachoftwo time c<strong>on</strong>straints,those clock lengths from the candidate set that lead to afeasible schedule.For a time c<strong>on</strong>straint of 1394ns, three clock lengths(55ns, 48ns, and 24ns) led to the minimum number offuncti<strong>on</strong>al units. Of these, the schedule for the 55ns clock(dd mult =3e) requires the fewest c<strong>on</strong>trol steps (and thus potentiallya smaller c<strong>on</strong>troller), so would be preferable. Notethat this is a dierent clock length than the <strong>on</strong>e chosenfor the AR lter, illustrating the importance of taking thestructure of the DFG into account.To nd the fastest possible design, the critical pathlength was used to derive the tightest possible time c<strong>on</strong>straintof 1035ns. For this time c<strong>on</strong>straint, <strong>on</strong>ly <strong>on</strong>e clocklength { 24ns { led to a feasible schedule, and thus to theoptimal 3D schedule.The RCS-3D results for the EWF are shown in TablesVIII-XI. In the absence of chaining, the 24ns and55ns clock lengths corresp<strong>on</strong>d to the fastest schedules.Again, for a given clock length, chaining usually improvedthe schedule when there were sucient resources available.However, neither form of chaining was able able to nd an


CHAUDHURI, ET AL: A SOLUTION METHODOLOGY FOR EXACT DESIGN SPACE EXPLORATION IN A 3D DESIGN SPACE 11ASAP 1*,2+ 2*,2+ 3*,3+Clock Cs ns Cs ns Cs ns Cs ns163 14 2282 16 2608 16 2608 14 228282 17 1394 21 1722 18 1476 17 139455 20 1100 29 1595 22 1210 21 115548 23 1104 LB 1632 26 1248 25 120041 34 1394 LB 1599 LB 1230 LB 118933 37 1221 LB 1617 LB 1221 LB 119028 40 1120 LB 1596 44 1232 42 117624 43 1032 66 1584 48 1152 46 110421 57 1197 LB 1596 LB 1155 LB 111319 60 1140 LB 1596 LB 1159 LB 1121TABLE VIIIEWF { RCS-3D Results Without ChainingASAP 1*,2+ 2*,2+ 3*,3+Clock Cs ns Cs ns Cs ns Cs ns211 7 1477 14 2954 14 2954 10 2110163 9 1467 15 2445 15 2445 10 163096 12 1152 20 1920 16 1536 13 1248TABLE IXEWF { RCS-3D Results for Type I Chainingoverall faster schedule than the fastest unchained schedule.C. Discrete Cosine Transform (DCT)The TCS-3D results for the DCT are presented in TableXII. The rst set of results are for a time c<strong>on</strong>straint of500ns, corresp<strong>on</strong>ding to a design will run at 2MHz. Eightclock lengths produced feasible schedules, but <strong>on</strong>ly <strong>on</strong>e {24ns { led to the minimum number of functi<strong>on</strong>al units.To nd the fastest possible design, the critical path lengthwas used to derive the tightest possible time c<strong>on</strong>straint of434ns, and <strong>on</strong>ly <strong>on</strong>e clock length { 24ns { led to a feasibleschedule and thus to the optimal 3D schedule.The RCS-3D results for the DCT are presented in TableXIII. In the absence of chaining, the 56ns clock length(d sub ) corresp<strong>on</strong>ds to the fastest schedule. This time, not<strong>on</strong>ly could type I chaining not nd an overall faster schedulethan the fastest unchained schedule, but it could noteven improve the schedule for a given clock length over theunchained schedule, probably due to the severe resourcec<strong>on</strong>straints.D. Methodology Run TimesVoyager's design space explorati<strong>on</strong> methodologies c<strong>on</strong>sistsof three main tasks: computing the minimal set of candidateclock lengths, computing tight bounds <strong>on</strong> the numberof functi<strong>on</strong>al units or <strong>on</strong> the schedule length, and solvingthe TRCS problem. The minimal set of candidate clocklengths can be computed quickly, and the bounds can becomputed by solving at most two linear programs in polynomialtime, as discussed in Secti<strong>on</strong>s V and VI. Finally,the TRCS formulati<strong>on</strong> used in Voyager is well-structured,meaning that it c<strong>on</strong>verges <strong>on</strong> the optimal soluti<strong>on</strong> fasterthan an arbitrary formulati<strong>on</strong>.To motivate the need for solving the TCS or RCS problemby rst computing bounds and then solving the re-ASAP 1*,2+ 2*,2+ 3*,3+Clock Cs ns Cs ns Cs ns Cs ns106 11 1166 19 2014 15 1590 12 127271 17 1207 LB 1775 19 1349 LB 127853 20 1060 LB 1643 LB 1219 LB 1166TABLE XEWF { RCS-3D Results for Type II ChainingASAP 1*,2+ 2*,2+ 3*,3+Clock Cs ns Cs ns Cs ns Cs ns163 9 1467 15 2445 15 2445 10 1630106 11 1166 19 2014 15 1590 13 137896 12 1152 20 1920 16 1536 13 124882 17 1394 21 1722 18 1476 LB 139471 17 1207 LB 1775 19 1349 LB 142055 20 1100 29 1595 22 1210 21 115553 20 1060 LB 1643 LB 1219 LB 1166TABLE XIEWF { RCS-3D Results for Type III Chainingsulting TRCS problem, c<strong>on</strong>sider the result of solving theTCS problem directly for a time c<strong>on</strong>straint of 1394ns anda 24ns clock <strong>on</strong> the EWF benchmark. Even with a wellstructuredformulati<strong>on</strong> suchasVoyager's, solving this problemdirectly took over an hour of CPU time (using LINDO<strong>on</strong> a Sun SPARCstati<strong>on</strong> 2). In c<strong>on</strong>trast, we spent <strong>on</strong>ly 1.51sec to compute the lower bounds <strong>on</strong> the number of functi<strong>on</strong>alunits, and <strong>on</strong>ly 7.75 sec to solve the TRCS problem{ solving the same problem in two orders of magnitude lesstime!On a <strong>large</strong>r benchmark { the DCT { for a time c<strong>on</strong>straintof 500ns and a 24ns clock, we spent 8.28 sec to compute thelower bounds <strong>on</strong> the number of functi<strong>on</strong>al units, and 2.62sec to solve the TRCS problem. Again, directly solving theTCS problem for this case took over an hour.In general, the best designs for each example were generatedwithin sec<strong>on</strong>ds. However, for <strong>very</strong> small clock lengths(e.g. 19ns), the ILP for the TRCS problem becomes quite<strong>large</strong>, and in some cases would have taken hours to nd theexact soluti<strong>on</strong>. Fortunately, even in those cases the boundswere produced fairly quickly, and could often obviate theneed to solve the TRCS problem for those clock lengths asdescribed in Secti<strong>on</strong>s II-A and II-B.VIII. Summary and Future WorkThis paper has dened a new problem { the 3D schedulingproblem { and has presented an exact soluti<strong>on</strong> methodologyto solve that problem without resorting to a timec<strong>on</strong>sumingexhaustive search. This soluti<strong>on</strong> methodologyis exact {itisguaranteed to nd the optimal clock lengthand schedule. Furthermore, it is ecient {itprunes inferiorpoints in the design space through a careful selecti<strong>on</strong>of candidate clock lengths (an important design parametertoo often determined by guesswork or estimates), andthrough tight bounds <strong>on</strong> the number of functi<strong>on</strong>al units orthe length of the schedule. It can optimally solve mediumsizedproblems in sec<strong>on</strong>ds, as opposed to more c<strong>on</strong>venti<strong>on</strong>altechniques that might require hours. Thus it eliminates the


12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 1, MARCH 1997TimeClock Csteps ns (Mult, Add, Sub)Time C<strong>on</strong>straint = 500ns (2MHz)56 8 448 (11, 7, 4)55 9 495 (15, 5, 6), (13, 6 4)48 10 480 (16, 5, 6)33 15 495 (11, 7, 4), (15, 5, 4)28 17 476 (11, 7, 4), (15, 5, 4)24 20 480 (11, 4, 4)21 23 483 (11, 7, 4), (15, 5, 4)19 26 494 (11, 5, 4), (15, 7, 4)Time C<strong>on</strong>straint = 434ns (tightest)24 18 432 (16, 5, 6)TABLE XIIDCT { TCS-3D Results5 Mult, 3 Add, 2 SubClock Csteps ns163 9 146782 10 82056 14 78455 LB 82548 LB 81641 LB 82033 LB 79228 28 78424 LB 79221 LB 79819 LB 7985 Mult, 3 Add, 2 SubClock Csteps ns219 9 1971211 9 1899163 9 1467104 10 104096 10 960TABLE XIIIDCT { RCS-3D Results Without Chaining, With Type IChainingneed to replace exact techniques with heuristic algorithmsor estimates in an eort to obtain acceptable performance{ a claim not possible with any existing soluti<strong>on</strong> methodology.Although this methodology is a step toward solving thescheduling problem in the c<strong>on</strong>text of a <strong>large</strong>r problem,much work remains for the future. The next step is tocarefully extend the methodology to support c<strong>on</strong>diti<strong>on</strong>als,module selecti<strong>on</strong> and type mapping, and wiring and registercosts.References[1] G. De Micheli, Synthesis and Optimizati<strong>on</strong> of Digital Circuits.McGraw-Hill series in electrical and computer engineering, NewYork, NY, USA: McGraw-Hill, 1994.[2] C. E. Leisers<strong>on</strong> and J. B. Saxe, \Retiming synchr<strong>on</strong>ous circuitry,"Algorithmica, no. 6, pp. 5{35, 1991.[3] R. B. Deokar and S. S. Sapatnekar, \A fresh look at retimingvia clock skew optimizati<strong>on</strong>," in [44], pp. 310{315.[4] P. Jha, S. Parameswaran, and N. Dutt, \Reclocking for HighLevel Synthesis," in Proc. of the Asia-South Pacic C<strong>on</strong>ference<strong>on</strong> Design Automati<strong>on</strong> (ASP-DAC), (Makuhari Messe, Chiba,Japan), IEEE <strong>Computer</strong> Society Press, Aug. 29-Sept. 1 1995.[5] M. Potk<strong>on</strong>jak and J. M. Rabaey, \Optimizing resource utilizati<strong>on</strong>using transformati<strong>on</strong>s," IEEE Transacti<strong>on</strong>s <strong>on</strong> <strong>Computer</strong>-Aided Design, vol. 13, pp. 277{292, Mar. 1994.[6] D. E. Thomas, E. D. Lagnese, R. A. Walker, J. A. Nestor, J. V.Rajan, and R. L. Blackburn, Algorithmic and Register TransferLevel Synthesis: The System Architect's Workbench. 101 PhilipDrive, Assinippi Park, Norwell, MA 02061: Kluwer AcademicPublishers Group, 1990.[7] S. Narayan and D. D. Gajski, \System Clock Estimati<strong>on</strong> based<strong>on</strong> Clock Slack Minimizati<strong>on</strong>," in [45], pp. 66{71.[8] VLSI Technologies Inc., VDP100 1.5 Micr<strong>on</strong> CMOS DatapathCell Library, 1988.[9] S. Chaudhuri, R. A. Walker, and J. E. Mitchell, \Analyzing andExploiting the Structure of the C<strong>on</strong>straints in the ILP Approachto the Scheduling Problem," IEEE Transacti<strong>on</strong>s <strong>on</strong> VLSI Systems,vol. 2, pp. 456{471, Dec. 1994.[10] S. A. Blythe and R. A. Walker, \Towards a Practical Methodologyfor Completely Characterizing the Optimal Design Space,"in Proc. of the 9th Internati<strong>on</strong>al Symposium <strong>on</strong> System-LevelSynthesis, (La Jolla, California), IEEE <strong>Computer</strong> Society Press,Nov. 6-8 1996.[11] S. Chaudhuri and R. A. Walker, \Computing Lower Bounds<strong>on</strong> Functi<strong>on</strong>al Units before Scheduling," IEEE Transacti<strong>on</strong>s <strong>on</strong>VLSI Systems, vol. 4, pp. 273{279, June 1996.[12] B. Rouzeyre, D. Dup<strong>on</strong>t, and G. Sagnes, \Comp<strong>on</strong>ent Selecti<strong>on</strong>,Scheduling and C<strong>on</strong>trol Schemes for High Level Synthesis," in[46].[13] H. B. Bakoglu, Circuits, Interc<strong>on</strong>necti<strong>on</strong>s, and Packaging forVLSI. Addis<strong>on</strong>-Wesley VLSI systems series, Reading, MA, USA:Addis<strong>on</strong>-Wesley, 1990.[14] V. Chaiyakul, A. C.-H. Wu, and D. D. Gajski, \Timing Modelsfor High Level Synthesis," in [45], pp. 60{65.[15] N. Park and A. C. Parker, \Synthesis of Optimal ClockingSchemes," in Proc. of the 23rd ACM/IEEE Design Automati<strong>on</strong>C<strong>on</strong>f., (Las Vegas), pp. 454{460, IEEE <strong>Computer</strong> Society Press,June 1986.[16] R. Jain, A. C. Parker, and N. Park, \Module Selecti<strong>on</strong> forPipeline Synthesis," in Proc. of the 25th ACM/IEEE DesignAutomati<strong>on</strong> C<strong>on</strong>f., (Anaheim, California), pp. 542{547, IEEE<strong>Computer</strong> Society Press, June 12-15 1988.[17] M. Corazao, M. Khalaf, L. Guerra, M. Potk<strong>on</strong>jak, and J. M.Rabaey, \Instructi<strong>on</strong> Set Mapping for Performance Optimizati<strong>on</strong>," in Proc. of the IEEE/ACM Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><strong>Computer</strong>-Aided Design, (Santa Clara, California), pp. 518{521,IEEE <strong>Computer</strong> Society Press, Nov. 7-11 1993.[18] L.-G. Chen and L.-G. Jeng, \Optimal Module Set and Clock CycleSelecti<strong>on</strong> for DSP Synthesis," in Proc. of 1991 IEEE Internati<strong>on</strong>alSymp. <strong>on</strong> Circuits and Systems., (Singapore), pp. 2200{2203, IEEE <strong>Computer</strong> Society Press, June 11-14 1991.[19] R. Jain, A. C. Parker, and N. Park, \Predicting System-LevelArea and Delay for Pipelined and N<strong>on</strong>pipelined Designs," IEEETransacti<strong>on</strong>s <strong>on</strong> <strong>Computer</strong>-Aided Design, vol. 11, pp. 955{965,Aug. 1992.[20] B. M. Pangrle and D. D. Gajski, \Design Tools for IntelligentSilic<strong>on</strong> Compilati<strong>on</strong>," IEEE Transacti<strong>on</strong>s <strong>on</strong> <strong>Computer</strong>-AidedDesign, vol. 6, pp. 1098{112, Nov. 1987.[21] D. D. Gajski, N. D. Dutt, A. Wu, and S. Lin, High-Level Synthesis:Introducti<strong>on</strong> to Chip and System Design. Kluwer internati<strong>on</strong>alseries in engineering and computer science; VLSI,computer architecture, and digital signal processing, 101 PhilipDrive, Assinippi Park, Norwell, MA 02061: Kluwer AcademicPublishers Group, 1992.[22] B. Gregory, D. MacMillen, and D. Fogg, \ISIS: A System forPerformance Driven Resource Sharing," in [47], pp. 285{290.[23] C. Ramachandran, F. J. Kurdahi, D. D. Gajski, V. Chaiyakul,and A. C.-H. Wu, \Accurate layout area and delay modelling forsystem level design," in Proc. of the IEEE/ACM Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> <strong>Computer</strong>-Aided Design, (Santa Clara, California),pp. 355{361, IEEE <strong>Computer</strong> Society Press, Nov. 8-121992.[24] D. Mintz and C. Dangelo, \Timing Estimati<strong>on</strong> for BehavioralDescripti<strong>on</strong>s," in Proc. of the 7th Internati<strong>on</strong>al Symposium <strong>on</strong>High-Level Synthesis, (Niagara-<strong>on</strong>-the-Lake, Canada), pp. 42{47, IEEE <strong>Computer</strong> Society Press, May 18-20 1994.[25] C. Ramachandran and F. J. Kurdahi, \Incorporating the c<strong>on</strong>trollereects during register transfer level synthesis," in [46],pp. 308{313.[26] D. Gelosh and D. E. Setli, \Deriving Ecient Area and DelayEstimates by Modelling Layout Tools," in [44], pp. 402{407.[27] K. N. Lalgudi and M. C. Papaefthymiou, \DELAY: An EcientTool for Retiming with Realistic Delay Modeling," in [44],pp. 304 { 309.[28] R. M. Russell, \The CRAY-1 <strong>Computer</strong> System," Communicati<strong>on</strong>sof the ACM, vol. 21, pp. 63{72, Jan. 1978.


CHAUDHURI, ET AL: A SOLUTION METHODOLOGY FOR EXACT DESIGN SPACE EXPLORATION IN A 3D DESIGN SPACE 13[29] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, \A Formal Approach tothe Scheduling Problem in High-Level Synthesis," IEEE Transacti<strong>on</strong>s<strong>on</strong> <strong>Computer</strong>-Aided Design, vol. 10, pp. 464{475, Apr.1991.[30] B. Landwehr, P. Marwedel, and R. Domer, \OSCAR: OptimumSimultaneous Scheduling, Allocati<strong>on</strong> adn Resource BindingBased <strong>on</strong> Integer Programing," in [46].[31] M. Rim and R. Jain, \Estimating lower-bound performance ofschedules using a relaxati<strong>on</strong> technique," in Proc. of the IEEEInternati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <strong>Computer</strong> Design, (Cambridge,Massachusetts), pp. 290{294, IEEE <strong>Computer</strong> Society Press,Oct. 11-14 1992.[32] C. H. Gebotys, \Optimal Scheduling and Allocati<strong>on</strong> of EmbeddedVLSI Chips," in [47], pp. 116{119.[33] K. Kucukcakar, System-Level Synthesis Techiques with Emphasis<strong>on</strong> Partiti<strong>on</strong>ing and Design Timing. PhD thesis, ElectricalEngineering { Systems Department, University of Southern California,1991.[34] A. Sharma and R. Jain, \Estimating Architectural Resourcesand Performance for High-Level Synthesis Applicati<strong>on</strong>s," IEEETransacti<strong>on</strong>s <strong>on</strong> VLSI Systems, vol. 1, pp. 175{190, June 1993.[35] S. Y. Ohm, F. J. Kurdahi, and N. Dutt, \Comprehensive LowerBound Estimati<strong>on</strong> from Behavioral Descripti<strong>on</strong>s," in Proc. ofthe IEEE/ACM Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><strong>Computer</strong>-AidedDesign, (San Jose, California), pp. 182{187, IEEE <strong>Computer</strong>Society Press, Nov. 6-10 1994.[36] Y. Hu, A. Ghouse, and B. S. Carls<strong>on</strong>, \Lower Bounds <strong>on</strong> theIterati<strong>on</strong> Time and the number of Resources for Functi<strong>on</strong>alPipelined Data Flow Graphs," in [48], pp. 21{24.[37] J. M. Rabaey and M. Potk<strong>on</strong>jak, \Estimating Implementati<strong>on</strong>Bounds for Real Time DSP Applicati<strong>on</strong> Specic Circuits," IEEETransacti<strong>on</strong>s <strong>on</strong> <strong>Computer</strong>-Aided Design, vol. 13, pp. 669{683,June 1994.[38] E. B. Fernandez and B. Bussell, \Bounds <strong>on</strong> the number of Processorsand Time for Multiprocessor Optimal Schedule," IEEETransacti<strong>on</strong>s <strong>on</strong> <strong>Computer</strong>s, vol. C-22, pp. 745{751, Aug. 1973.[39] M. Rim and R. Jain, \Lower-Bound Performance Estimati<strong>on</strong> forthe High-Level Synthesis Scheduling Problem," IEEE Transacti<strong>on</strong>s<strong>on</strong> <strong>Computer</strong>-Aided Design, vol. 13, pp. 451{458, Apr.1994.[40] J. B la_zewicz, \Simple Algorithms for Multiprocessor Schedulingto Meet Deadlines," Informati<strong>on</strong> Processing Letters, vol. 6,pp. 162 { 164, Oct. 1977.[41] Y. Hu and B. S. Carls<strong>on</strong>, \Improved Lower Bounds for theScheduling Optimizati<strong>on</strong> Problem," in Proc. of 1994 IEEE Internati<strong>on</strong>alSymp. <strong>on</strong> Circuits and Systems., (L<strong>on</strong>d<strong>on</strong>, England),pp. 295{298, IEEE <strong>Computer</strong> Society Press, May 30-June 2 1994.[42] M. Langevin and E. Cerny, \A Recursive Technique for ComputingLower-Bound Performance of Schedules," in [48], pp. 16{20.[43] J. A. Nestor and G. Krishnamoorthy, \SALSA: A New Approachto Scheduling with Timing C<strong>on</strong>straints," IEEE Transacti<strong>on</strong>s <strong>on</strong><strong>Computer</strong>-Aided Design, vol. 12, pp. 1107{1122, Aug. 1993.[44] Proc. of the 32nd ACM/IEEE Design Automati<strong>on</strong> C<strong>on</strong>f., (SanFransisco, California), IEEE <strong>Computer</strong> Society Press, June 12-16 1995.[45] Proc. of the European Design Automati<strong>on</strong> C<strong>on</strong>ference (Euro-DAC), (Hamburg, Germany), IEEE <strong>Computer</strong> Society Press,Feb. 1992.[46] Proc. of the European Design and Test C<strong>on</strong>ference, (Paris,France), IEEE <strong>Computer</strong> Society Press, Feb. 8-Mar. 3 1994.[47] Proc. of the 29th ACM/IEEE Design Automati<strong>on</strong> C<strong>on</strong>f., (Anaheim,California), IEEE <strong>Computer</strong> Society Press, June 8-121992.[48] Proc. of the IEEE Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <strong>Computer</strong> Design,(Cambridge, Massachusetts), IEEE <strong>Computer</strong> SocietyPress, Oct. 3-6 1993.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!