13.07.2015 Views

An Automatic Approach to Generate Haste Code from Simulink ...

An Automatic Approach to Generate Haste Code from Simulink ...

An Automatic Approach to Generate Haste Code from Simulink ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

generated by <strong>Code</strong><strong>Simulink</strong>, in order <strong>to</strong> keep consistencybetween the simulation performed at the highest level andthe one used at lower levels.This environment has already been extended <strong>to</strong> the asynchronouscircuit domain [13], but this work done wasfocused on FPGA development, therefore lacking supportfor the back-ends used in ASIC development. To overcomethis limitation, we used the TiDE flow offered by HandshakeSolutions, which is introduced in the next section.3.2. TiDE FlowThe Timeless Design Environment (TiDE) [14] is a se<strong>to</strong>f <strong>to</strong>ols, provided by Handshake Solutions, that can maphardware descriptions on<strong>to</strong> a self-timed gate-level netlist,starting <strong>from</strong> the <strong>Haste</strong> language. <strong>Haste</strong> [15] is a high-levelbehavioral language that supports asynchronous communicationusing CSP [16] constructs.Values can be communicated between parallel processesusing channels. Channels are objects on which read andwrite operations are “synchronized,”. So a process that writeson a channel can only complete its communication when thecorresponding reader has read the data.The standard design flow based on TiDE (see Fig. 2) startswith the system description in <strong>Haste</strong>. This description isconverted by the <strong>Haste</strong> compiler (htcomp) in<strong>to</strong> a behavioraldescription that can be used <strong>to</strong> perform functional simulationand can then be mapped on<strong>to</strong> the desired technology (usinghtmap).Using the <strong>to</strong>ol htlink external Verilog netlists can belinked <strong>to</strong> the netlist generated <strong>from</strong> <strong>Haste</strong>. After this operationthe obtained netlist is optimized and adapted for theback-end part of the flow. (For further details on this, pleaserefer <strong>to</strong> the TiDE manual [14]).There were several reasons why we chose <strong>to</strong> integrate<strong>Code</strong><strong>Simulink</strong> and TiDE:• <strong>Code</strong><strong>Simulink</strong> can convert <strong>Simulink</strong> models in<strong>to</strong>VHDL, but it lacks an ASIC back-end;• <strong>Haste</strong> is a powerful language which allows you <strong>to</strong>easily describe asynchronous circuits using its nativeconstructs;• TiDE can synthesize both RTL code and <strong>Haste</strong> <strong>to</strong>gether.By integrating these two environments we can cover thedesign process <strong>from</strong> high-level modeling and simulation <strong>to</strong>the physical implementation.4. Preliminary <strong>An</strong>alysisIn this section we describe some considerations madeduring the implementation of <strong>Simulink</strong> diagrams in <strong>Haste</strong>.They are related <strong>to</strong> the <strong>Haste</strong> language itself and <strong>to</strong> thecharacteristics of <strong>Simulink</strong>.<strong>Haste</strong> descriptionSynthesis (htcomp)Handshake CircuitsMapping (htmap)Verilog NetlistLinking (htlink)Logical Optimization (htlog)ABSBack−End16163External VerilogNetlistsFigure 2. TiDE flow.*+>x32Behavioral Simulation(htview)Netlist VerificationFigure 3. Small design used <strong>to</strong> test different coding styles in<strong>Haste</strong>.4.1. <strong>Haste</strong> Coding Style<strong>Haste</strong> relies on a transparent compiler. This means thatyou will get what you described, and for this reason it isnecessary <strong>to</strong> take care of the coding styles used in order<strong>to</strong> get the most efficient implementation of the design. Tocompare different coding styles, we developed a set of smallmodels. One of these models is depicted in Fig. 3 andrepresents a simple datapath of 6 different operations on twoinputs which can be selected <strong>from</strong> another input. Althoughthe model is small, it includes several computational blocksthat are often used in <strong>Simulink</strong> diagrams <strong>to</strong> describe morecomplex systems.4.1.1. Communication. To pass data between modules wecan use two different modes: shared variables or channelcommunications.The implementation of shared variables is relatively32O3


Table 2. Comparisons among different block using channels,shared variables, with state or without state implementation.(These results refer <strong>to</strong> a different implementation of thedesign depicted in Fig. 3 and are in number of gates.)Implementation choices Area [µm 2 ]Registers Channels Variables Memory C-gates TotalX X - 15857.6 1441.0 54829.1X - X 15857.6 490.6 54140.9- X - 0 1134.4 44470.9- - X 0 367.9 43683.8Table 3. Comparisons among different coding styles for thedesign depicted in Fig. 3.Design Tuple Registers Area [µm 2 ]Not Used Used Not Used Used Total/C-gatesX - - X 11454.9/4804.4Datapath- X - X 11883.8/4792.2X - X - 4067.3/438.3- X X - 3670.3/254.5cheap, but they require explicit synchronization betweenreaders and writers in order <strong>to</strong> avoid data miss and dataduplication, since registers are shared between the writerand the readers.Channels on the other hand au<strong>to</strong>matically synchronizeinput and output actions of modules running in paralleland thereby guarantee a correct timing relationship betweenthe read and write actions. Their implementation is moreexpensive in terms of area than a shared variable (around1.5%, see Tab. 2 for further details). To keep the conversionof the <strong>Simulink</strong> model <strong>to</strong> <strong>Haste</strong> straightforward, we wouldlike <strong>to</strong> avoid explicit synchronization between modules.Therefore we choose <strong>to</strong> use channels instead of sharedvariables.A channel is a communication mechanism shared betweendifferent objects with at least one transmitter and at leas<strong>to</strong>ne receiver. The implementation of a channel relies on thebundled data approach. This implementation consists of adata part and a control part. The control part takes care ofthe communication pro<strong>to</strong>col and the required delay matchingof the data part.The simplest way <strong>to</strong> describe the way the blocks communicatein a <strong>Simulink</strong> diagram is using separate channelsfor each input/output. This solution is straightforward <strong>to</strong>implement, but it can be more expensive since every inputhas its own control logic.<strong>Haste</strong> allows the user <strong>to</strong> group <strong>to</strong>gether data channels,thereby sharing handshake control circuitry. Such a multipledatachannel is called tuple channel. This solution requiresless area. Deadlock can be introduced however due <strong>to</strong> theIo!IAi?v ; o! A(v)Bi?[[ao,io]]; o! B( ao, io )Figure 4. Example of a <strong>Simulink</strong> model that can lead <strong>to</strong> adeadlock (see Fig. 5)fact that all the input communications are synchronized<strong>to</strong>gether, therefore not allowing individual completion.A typical example is the one depicted in Fig. 5: blockA needs <strong>to</strong> have a complete handshake on its inputs <strong>to</strong>compute; block I needs <strong>to</strong> wait until all the blocks fed byits output have captured its value before continuing. Forthis reason, before concluding the communication with A, itneeds <strong>to</strong> wait for the completion of the communication withB. However, B cannot finish its communication with I untilit receives data <strong>from</strong> A and this can never happen, sinceA cannot compute until it finishes its input communicationwith I. So the system is stuck waiting for a condition thatwill never happen.4.1.2. Functions or Procedures. A module in <strong>Haste</strong> canbe described as a fully combinational block or as a blockwith registers (Fig. 6). Data-flow networks usually do notinclude stages (since data is processed <strong>from</strong> input <strong>to</strong> outputcontinuously). However, in order <strong>to</strong> <strong>to</strong> increase systemthroughput decoupling stages (i.e. registers or latches) canbe required. The results presented in Tab. 3 show a largedifference in terms of area for the two implementations. As adesign trade off exists between area and speed, it is possible<strong>to</strong> choose the desired implementation.4.1.3. Register Placement. As previously mentioned, <strong>Simulink</strong>models do not have the concept of registers as is usualin digital design. Most standard blocks perform operationsregardless of the concept of time. Only a few blocks arerelated <strong>to</strong> timing events. We will come back <strong>to</strong> these blockslater.Registers are necessary <strong>to</strong> achieve performance, but wehave <strong>to</strong> decide where <strong>to</strong> insert them. Since each <strong>Simulink</strong>block has only one output, whereas it can have more thanone input, it is natural <strong>to</strong> insert registers on the outputin order <strong>to</strong> optimize area. Using the <strong>Haste</strong> language it isdifficult <strong>to</strong> describe such an implementation, since when youget data <strong>from</strong> one or more input channels you have <strong>to</strong> s<strong>to</strong>rethem in<strong>to</strong> registers, and this results in latching the inputs.In the present version of the TiDE flow (5.2) the compilerwill put registers where the designer has inserted them in the<strong>Haste</strong> description. In the future release (6.0), the compilercan optimize the number of registers au<strong>to</strong>matically given therequired number of decoupling stages. For this reason weOi?v4


B: process( i0?chan [0..255]& i1?chan [0..255]& o?chan [0..255]).begin& ao : var [0..255]& io : var [0..255]|forever do( i0?ao || i1?io ); o! b( ao, io )odendo!IR1+A1+R1−A1−(a)i?v o!A(v) i?[[ao,io]] o!B(ao,io)R2+R1+A2+ A1+A2−(c)R1−A1−R2−R1+A1+R1−A1−B: process(& i?chan [0..255]& o!chan [0..255]).begin& ao : var [0..255]& io : var [0..255]|forever doi? [[ ao, io ]]; o! b( ao, io )odendR+Ra+(b)o!I i?v o!A(v) i?[[ao,io]]waiting forAb+R+Ra+ cannotbe generatedInput HS notyet completed(d)waiting forRa+o!B(ao,io)Figure 5. A valid <strong>Simulink</strong>-like diagram 4 can be described using separate input channels (a) or with a tupled input (b); thelatter can lead <strong>to</strong> a deadlock as we can see in the sequence diagram that describes its behavior (d), while the former workscorrectly (c).choose <strong>to</strong> use the more common way <strong>to</strong> describe modules(with input registers) and let the compiler decide where <strong>to</strong>put them.4.2. Sampling BlocksThere are three main blocks in <strong>Simulink</strong> that deal witha fixed sampling time: the “unit delay”, the “zero orderhold,” and the “rate transition” (Fig. 7). These blocks areoften used <strong>to</strong> change the input-<strong>to</strong>-output data rate of agiven function, especially when the system has <strong>to</strong> deal withinterfaces providing (or requiring) data at slower (or faster)data rates. Such blocks are also used when it is necessary<strong>to</strong> explicitly insert a s<strong>to</strong>rage element in a design (e.g. for anaccumula<strong>to</strong>r).The “unit delay” block acts as a memory element, whichcan also oversample the input data in order <strong>to</strong> increase theoutput data rate. The “zero order hold” block can reducethe output data rate. Finally the “rate transition” block is asuper set of the previous one. There are also other blocks(like Buffer and Unbuffer) that are not taken in<strong>to</strong> accountnow, since these are used less frequently than the previouslycited ones.These blocks are often used in <strong>Simulink</strong> diagrams for twomain purposes:• introducing an explicit s<strong>to</strong>rage element in a design (e.g.an accumula<strong>to</strong>r, a decoupling register in loops, . . . );• an adaptation <strong>to</strong> different rates in multi-rated systems(e.g. high-speed ADC or DAC interfaced with lowerspeed circuitry or vice versa).In the synchronous implementation, these blocks need <strong>to</strong>both sample and generate data at a given time, according<strong>to</strong> their parameters (input and output sampling time) <strong>to</strong>guarantee the same behavior of the <strong>Simulink</strong> model. Alsoin the asynchronous version we need the same behavior andthis can be achieved in two different ways:• <strong>to</strong> introduce, in each of these blocks, a clock signalwhich can be used <strong>to</strong> derive the desired timing relationships;• <strong>to</strong> move the clock interface only <strong>to</strong> the input blocks,5


& sum2 = func (& A ? var [255..0]& B ? var [255..0]): [511:0].( A + B ) fit [511..0](a)& sum2 = proc (& A ? chan [255..0]& B ? chan [255..0]& Y ! chan [511..0]).begin& a : var [255..0]& b : var [255..0]|forever do( A?a || B?b ); Y ! (a+b) fit [511..0]odend(b)Figure 6. Examples of fully combinational logic (a) and logicwith registers (b).1zUnit Delay(a)Zero−OrderHold(b)Rate TransitionFigure 7. <strong>Simulink</strong> blocks related <strong>to</strong> signal sampling: UnitDelay (a), Zero-Order Hold (b), and Rate Transition (c).after which the following sampling block will just have<strong>to</strong> up-sample (or <strong>to</strong> down-sample) the incoming data,which was already synchronized.Since the first solution introduces much interaction withthe clock, it will result in more area overhead, whereasthe second one maintains clock interactions only on theboundary of the system. For this reason we prefer the latterchoice, even though both options are available and can begenerated <strong>from</strong> the same model au<strong>to</strong>matically.5. <strong>Simulink</strong> <strong>to</strong> <strong>Haste</strong> Conversion5.1. Proposed FlowOur aim is <strong>to</strong> integrate TiDE and <strong>Code</strong><strong>Simulink</strong> <strong>to</strong>getherin order <strong>to</strong> au<strong>to</strong>matically synthesize <strong>Simulink</strong> models withoutdescribing them in <strong>Haste</strong> by hand. Fig. 8 depicts theproposed integrated <strong>to</strong>ol flow.Input <strong>to</strong> this flow is a description of the desired algorithmin <strong>Simulink</strong>, using either pure <strong>Simulink</strong> blocks or(c)<strong>Code</strong><strong>Simulink</strong> blocks (<strong>from</strong> the <strong>Code</strong><strong>Simulink</strong> libraries). Atthis level of abstraction both simulation and architecturalexploration are well supported with a fast design-evaluationloop. In particular this support is based on the <strong>Simulink</strong>environment being very suitable for dataflow design (e.g.filters, streaming data processing, control systems, . . . ) andthe availability of simulation libraries which help in evaluatingsystem’s behavior (e.g. with respect <strong>to</strong> the numberof bits used in the implementation of each block, theusage of integer, fixed point or floating point representation,the presence of signed or unsigned numbers, . . . ). Thesechoices can be modified at <strong>Simulink</strong> level and they affectthe simulation, providing the designer with a powerful way<strong>to</strong> evaluate (manually or with user developed scripts) theoptimal implementation for each block. All these parametersare fully tunable and accessible <strong>to</strong> the designer who caneasily find the best trade-off between circuit complexityand accuracy required by the system under development.In case of using pure <strong>Simulink</strong> blocks, a script can convertsuch model in<strong>to</strong> a <strong>Code</strong><strong>Simulink</strong>-compliant one, introducingall the hardware-specific parameters necessary for ourenvironment in order <strong>to</strong> simulate and synthesize the desiredhardware. If specified, during this conversion step our scriptscan au<strong>to</strong>matically estimate the best characteristics of eachblock used in the design according <strong>to</strong> the simulation set inthe model.From this point <strong>Code</strong><strong>Simulink</strong> scripts will generate:• a <strong>Haste</strong> language file which describes the structurepresent in the <strong>Simulink</strong> model;• a list of all the library files needed for the synthesis;• a set of VHDL files describing the functionality of eachblock in the model;• a set of Cadence RTL Compiler scripts which willinclude all the library files needed and will generate<strong>from</strong> them a synthesizable Verilog version in order <strong>to</strong>include them in the standard TiDE flow.With these files the standard TiDE flow can start, so wecan first use htcomp and htmap <strong>to</strong> generate the Verilognetlist implementing the system structure after which we cansynthesize all the RTL VHDL files in<strong>to</strong> gate-level Verilogones and finally link them <strong>to</strong>gether. After this part theTiDE back-end flow can continue optimizing the design andperforming the timing analysis necessary <strong>to</strong> insert the delaychains required in the control path.At the end of this process, the final netlist is availableand can be used <strong>to</strong> verify the system behavior at this levelof abstraction, a step necessary <strong>to</strong> also analyze systemperformance.In the next sections we describe how each block is convertedin<strong>to</strong> <strong>Haste</strong> and VHDL <strong>to</strong> allow au<strong>to</strong>matic synthesis.6


<strong>Simulink</strong> ModelHDL <strong>Code</strong><strong>Code</strong><strong>Simulink</strong> Front−End FlowmodelConvert<strong>Code</strong><strong>Simulink</strong>ModelDigHwCompileFunctionalBlockPro<strong>to</strong>col Controller<strong>Haste</strong> <strong>Code</strong>Figure 9. Structure of a <strong>Simulink</strong> block described in <strong>Haste</strong>.RTimeless Design Environment FlowRTL VHDLRTL CompilerVerilog NetlisthtlinkTiDE FlowBack−End<strong>Haste</strong>DescriptionhtcompHandshakeCircuitshtmapVerilog NetlistFigure 8. Proposed <strong>Simulink</strong> <strong>to</strong> <strong>Haste</strong> flow.5.2. Block StructureThe structure of a <strong>Simulink</strong> block implemented in <strong>Haste</strong>will be the one depicted in Fig. 9. In that figure we can seehow the block can be divided in<strong>to</strong>:• a functional circuit, which implements the combinationallogic function of the block;• a pro<strong>to</strong>col controller, which coordinates all the operationswithin the block and describes the elements usedfor s<strong>to</strong>rage.The au<strong>to</strong>matic conversion of <strong>Simulink</strong> models in<strong>to</strong> <strong>Haste</strong>and Verilog descriptions is based on the <strong>Code</strong><strong>Simulink</strong>environment. Thanks <strong>to</strong> this we reuse all the things alreadydeveloped for such environment in order reduce both developmentand debug time.Each <strong>Simulink</strong> block is au<strong>to</strong>matically converted in<strong>to</strong> <strong>Haste</strong>code (for controlling the data flows) and in<strong>to</strong> Verilog code(for data elaboration). <strong>Haste</strong> is used <strong>to</strong> describe the <strong>to</strong>pmodule and the communication infrastructure among blocks.The TiDE flow uses <strong>Haste</strong> as main specification language,but it is also possible <strong>to</strong> insert Verilog components, whereas<strong>Code</strong><strong>Simulink</strong> uses VHDL as specification language forits implementation. <strong>Code</strong><strong>Simulink</strong> uses a library-based approachin which each block has a parametric descriptionthat is called when needed in the <strong>to</strong>p module. In order <strong>to</strong>reuse this extensive library we decided <strong>to</strong> use a commercialsynthesizer <strong>to</strong> convert its VHDL behavioral descriptions in<strong>to</strong>gate-level Verilog netlists.The proposed flow merges the <strong>Code</strong><strong>Simulink</strong> and theTiDE one, as depicted in Fig. 8. We start <strong>from</strong> a <strong>Simulink</strong>model (developed with pure <strong>Simulink</strong> blocks or with <strong>Code</strong>-<strong>Simulink</strong> ones), using the scripts developed such a modelis converted in<strong>to</strong> <strong>Haste</strong> and Verilog, which are processeddifferently along the TiDE flow.In order <strong>to</strong> reuse the RTL libraries available in <strong>Code</strong><strong>Simulink</strong>and <strong>to</strong> guarantee modularity and maintainability, eachblock is divided in<strong>to</strong> different files. In this way the commonpart can be shared among blocks without having the burdenof rewriting code already existent. Each module is composedof:• a structure definition of the design, made of a single<strong>Haste</strong> file;• a set of VHDL files, one for each block in the diagram,that describes the RTL behavior of the block;Not all the blocks will have this structure, since thereare interfaces with synchronous environments and samplingblocks that are quite different since their function is morerelated <strong>to</strong> the pro<strong>to</strong>col than <strong>to</strong> the processing part (whichis usually not present). For this reason such blocks havecompletely been described in <strong>Haste</strong>. The following sectionsprovide details on these categories after having described thecommon parts:5.2.1. The <strong>Haste</strong> Shell. According <strong>to</strong> the ideas exposed inSec. 4 the skele<strong>to</strong>n on which a block will be built needs <strong>to</strong>interface input data with the desired logic function and theresults returned by such logic function <strong>to</strong> the block output.Figure 10 shows the <strong>Haste</strong> shell structure for a two-inputsadder, as we can notice the structure is straightforward. Eachblock is represented by a <strong>Haste</strong> procedure in which each7


& sim_sum2 = proc (& Y1 ! chan VECTOR_17& A1 ? chan VECTOR_16& A2 ? chan VECTOR_16). begin& v_A1 : var VECTOR_16& v_A2 : var VECTOR_16| forever do// input acquisition( A1 ? v_A1 || A2 ? v_A2 )// output generation// (sim_sum2_f is imported); Y1 ! sim_sum2_f(.A1( v_A1 ), .A2( v_A2 ) )odendFigure 10. <strong>Haste</strong> shell for a 2-inputs adder.input and each output is listed as an input or an outputchannel respectively. In the body of the procedure onlythe interface operations are performed: inputs are read andoutputs are generated by the external function associated <strong>to</strong>the block itself. Please mind the order of execution, indeedthe inputs are collected in parallel and obviously when allof them are available, the outputs can be generated.5.2.2. Sampling Blocks. Sampling blocks can have differentimplementations synchronized with a global clock, in order<strong>to</strong> slow down the circuit operation (<strong>to</strong> make it operate ata certain Sampling Time) or completely asynchronous (seeSec. 4.2). In both modes the input data rate can differ <strong>from</strong>the output one. Using these blocks it is possible <strong>to</strong> make amulti-rate system in which the data rate is increased (usinga unit delay block) or decreased (using a zero orderhold block). Figure 11 shows the <strong>Haste</strong> description of suchblocks.5.2.3. RTL Processing Part / Parametric RTL Description.Each block has a set of parameters that can be configured<strong>to</strong> make the module able <strong>to</strong> deal with different scenarios(serial or parallel input/output representation, different datawidth,. . . ) and all these parameters can be configured inthe VHDL description. For each block a HDL file will begenerated with all the desired parameters set and an RTLCompiler script that can synthesize it in<strong>to</strong> a Verilog netlist.5.3. <strong>Simulink</strong> <strong>to</strong> <strong>Code</strong><strong>Simulink</strong> ConversionThe typical approach used <strong>to</strong> develop a design that shouldbe converted in<strong>to</strong> hardware is <strong>to</strong> build a diagram using <strong>Code</strong>-<strong>Simulink</strong> blocks <strong>from</strong> the start. The advantage of startingwith <strong>Code</strong><strong>Simulink</strong> blocks instead of <strong>Simulink</strong> blocks isthat their simulation behavior matches that of their hardwareimplementation. Since the <strong>Code</strong><strong>Simulink</strong> block set is one<strong>to</strong>-onecompatible with the standard <strong>Simulink</strong> one, we also& sim_ud = proc (& Y1 ! chan VECTOR_16& A1 ? chan VECTOR_16). begin& v_A1 : var VECTOR_16| forever do// output generation (oversampled)for 5 do ( Y1 ! v_A1 ) od// input acquisition; A1 ? v_A1odend(a)& sim_zoh = proc (& Y1 ! chan VECTOR_16& A1 ? chan VECTOR_16). begin& v_A1 : var VECTOR_16| forever do// input acquisitionfor 5 do ( A1 ? v_A1 ) od// output generation (undersampled); Y1 ! v_A1odend(b)Figure 11. <strong>Haste</strong> description of a “unit delay” 11(a) and ofa “zero order hold” 11(b) blocks both with a over- undersamplingratio of 5.provide a conversion utility which au<strong>to</strong>matically converts apure <strong>Simulink</strong> model in<strong>to</strong> a <strong>Code</strong><strong>Simulink</strong> one by settingthe parameters needed for the implementation according <strong>to</strong>the simulation results of the model.5.4. System DescriptionNow that we have introduced the structure of each blockin the design, we will explain how the whole system isdescribed.The main <strong>Haste</strong> file is composed of different sections (SeeFig. 12):• the definition of the types used across the design;• the definition of the system interface;• the external RTL functions import;• the <strong>Haste</strong> declaration of each block;• the block instance and connection.6. Case Study: a Commercial Audio CODECTo test our methodology we apply it <strong>to</strong> a <strong>Simulink</strong> modelof a commercial Audio CODEC. Such a model describes oneof the two channels in a stereo audio chip implementing aSigma-Delta modula<strong>to</strong>r [17].8


& VECTOR_16& VECTOR_17& VECTOR_32= type [0..2ˆ16-1]= type [0..2ˆ17-1]= type [0..2ˆ32-1]Table 4. Synthesis result comparisons of the same <strong>Simulink</strong>model in different implementations. The designs have beenimplemented using a 180nm technology library.& datapath = main proc (& O ! chan VECTOR_32& A ? chan VECTOR_16& B ? chan VECTOR_16).begin// Internal channel declaration& Y1_6 : chan VECTOR_16 broad// ...// External function declaration& Sum = func (& A1 ? var VECTOR_16& A2 ? var VECTOR_16): VECTOR_16. import// ...// <strong>Haste</strong> shell description of each block& Sum_sh = proc (& Y1 ! chan VECTOR_17& A1 ? chan VECTOR_16& A2 ? chan VECTOR_16).begin& v_A1 : var VECTOR_16 := 0& v_A2 : var VECTOR_16 := 0|forever do( A1 ? v_A1 || A2 ? v_A2 ); Y1 ! Sum( .A1(v_A1), .A2(v_A2))odend// ...|// Block instance and connection// ...|| Sum_sh ( .Y1( Y1_6 ),.A1( Y1_8 ), .A2( Y1_3 ))// ...endFigure 12. Example of the <strong>Haste</strong> code generated for the mainprocedure.This model is quite complex, since it is composed of about150 blocks, including: about 30 16-bit wide multiplicationby constant values, 15 8-bit wide multipliers, and 30 16-bit wide adders. It has been used <strong>to</strong> develop a hand-writtenimplementation in <strong>Haste</strong>. Thanks <strong>to</strong> the collaboration withan industrial partner we had access <strong>to</strong> synthesis resultsof this asynchronous hand-written version and we couldcompare this with the <strong>Haste</strong> version generated by our <strong>to</strong>ol.Comparisons for both versions are based on optimized prelayoutnetlists mapped on<strong>to</strong> the same technology library.The results of this analysis are reported in Tab. 4. Inthis table we compare the hand written <strong>Haste</strong> code withtwo versions of the au<strong>to</strong>matically generated one: the first isDesign Hand written <strong>Au<strong>to</strong>matic</strong> <strong>Generate</strong>dTool TiDE 5.2 TiDE 5.2 TiDE 6.0Sequentialµm 2 32018 89792 11632Logic 138244 357368 152468Totalµm 2 173694 468746 164100Overhead — +170% -5.5%Coding time about 1 week 20 minutespassed through version 5.2 of TiDE flow, while the secondhas been processed with the new pre-release version (6.0).Unfortunately it was not possible <strong>to</strong> compile hand-writtenversion with the TiDE 6.0 flow, since it does not supportanymore some low level constructs available in the oldrelease. We can notice a number of differences between thethree versions proposed. The designs are not architecturallythe same, since the number of registers is not the same inall of them. This is due <strong>to</strong> the code generated (or written):• for the hand-written code, most of the blocks in the<strong>Simulink</strong> model have been implemented using <strong>Haste</strong>functions [15]. The number of blocks for which thedesigner decided <strong>to</strong> insert registers is small compared<strong>to</strong> the <strong>to</strong>tal number of blocks.• for the TiDE 5.2 version, each block has registers onits inputs, which results in a high overhead, since manyof them are not required.• for the TiDE 6.0 version, the compiler au<strong>to</strong>maticallydecides the minimum number of registers required forthe described circuit.For the reasons above, we can conclude that at the momentthe code generated au<strong>to</strong>matically and compiled with theTiDE 6.0 version represents the lower bound with respect <strong>to</strong>the number of registers. On the other hand the same designcompiled with the 5.2 version is the upper bound, since thegranularity at the <strong>Simulink</strong> level is very fine-grained.Since our work was targeted for the TiDE 6.0 version,the results shown in Tab. 4 are promising. The achievedimplementation based on this new flow 2 requires less areathan the hand-written counterpart.In order <strong>to</strong> guarantee the circuit equivalence, we simulatethe netlist generated <strong>from</strong> TiDE 5.2 of the hand-writtencode and the au<strong>to</strong>matically generated one with the sametest bench. Since we had not access <strong>to</strong> the testbenchesused <strong>to</strong> develop the original version, we had <strong>to</strong> create anew tesbench based on the data streams derived <strong>from</strong> the<strong>Simulink</strong> simulation. Because we are still working on afeature which generates input patterns directly <strong>from</strong> the2. At the moment TiDE 6.0 is not complete; indeed some operationshave <strong>to</strong> be performed by hand, but the optimizations performed by the <strong>to</strong>olare stable and will not change significantly with the official <strong>to</strong>ol release.9


<strong>Simulink</strong> environment, we had <strong>to</strong> do the verification partiallyby hand. The result of this analysis demonstrates thefunctional equivalence of the two circuits (with respect <strong>to</strong>the simulation performed).Results shown do not include any figure on timings.Actually we had not access <strong>to</strong> this data for the hand-writtenversion. The only timing constraint we had was <strong>to</strong> be able<strong>to</strong> process all the samples provided at a given data rate, andthis was easily achieved by the au<strong>to</strong>matic generated codecompiled with both TiDE flows.7. Conclusions and Future WorkThis paper has shown how we address a complete flowfor generating asynchronous circuits starting <strong>from</strong> <strong>Simulink</strong>diagrams, using <strong>Haste</strong> as intermediate description language.At the moment only a subset of <strong>Simulink</strong> blocks aresupported, but the methodology can easily be extended inorder <strong>to</strong> cover all blocks. Our proposal has been used on acommercial model of an audio CODEC, showing appealingadvantages (code reuse, time-<strong>to</strong>-market reduction) withoutarea overhead introduction.Obviously a number of optimizations and improvementscan be added:• high-level optimizations, like block merging, in order <strong>to</strong>reduce the number of asynchronous controllers insertedin the circuit;• an integration between the <strong>Simulink</strong> and the RTL simulation,in order <strong>to</strong> have the same test set for both theabstraction levels and, consequently, reduce the testingphase;• a way <strong>to</strong> au<strong>to</strong>matically select where <strong>to</strong> insert pipelinestages in the design.We are looking forward <strong>to</strong> all these optimizations since theycan further reduce area overhead and development time.Moreover we are working <strong>to</strong> use different methodologiesexposed here, in [8] and in [13] in order <strong>to</strong> compare whichone can produce better results.8. AcknowledgmentWe would like <strong>to</strong> thank Luciano Lavagno who helped usduring the writing phase of this paper with remarks andsuggestions.References[1] E. A. Lee, S. Neuendorffer, and M. J. Wirthlin, “Ac<strong>to</strong>rorienteddesign of embedded hardware and software systems,”Journal of Circuits, Systems, and Computers, 2002.[2] W. Wong, “Model-Based Design,” ElectronicDesign, March 2006. [Online]. Available:http://electronicdesign.com/Files/29/12086/12086 01.pdf[3] A. Taubin, J. Cortadella, and L. Lavagno, Design Au<strong>to</strong>mationOf Real-Life Asynchronous Devices <strong>An</strong>d Systems. UnitedStates: Now Publishers Inc, 2007.[4] C. Van Berkel, M. Josephs, and S. Nowick, “Scanning thetechnology: Applications of asynchronous circuits,” Proceedingsof the IEEE, vol. 87, no. 2, February 1999.[5] The Mathwork’s. <strong>Simulink</strong> on-line documentation. [Online].Available: http://www.mathworks.com/products/simulink/[6] Xilinx. System genera<strong>to</strong>r. [Online]. Available:http://www.xilinx.com/ise/optional prod/system genera<strong>to</strong>r.htm[7] Altera. DSP-Builder. [Online]. Available:http://www.altera.com/products/software/products/dsp/dspbuilder.html[8] The Mathworks. HDL-<strong>Code</strong>r. [Online]. Available:http://www.mathworks.com/products/slhdlcoder/[9] E. A. Lee and D. G. Messerschmitt, “Dataflow processnetwork,” Proceedings of the IEEE, vol. 83, no. 5, May 1995.[10] L. Reyneri, F. Cucinotta, A. Serra, and L. Lavagno, “Ahardware/software co-design flow and ip library based onsimulink,” DAC, June 2001.[11] E. Bellei, E. Bussolino, F. Gregoretti, L. Mari, F. Renga, andL. Reyneri, “<strong>Simulink</strong>-based codesing and cosimulation ofa common-rail injec<strong>to</strong>r test bench,” Journal on Computer,Systems and Circuits, vol. 12, pp. 171–202, 2003.[12] I. E. Sutherland, “Micropipelines,” Communications of theACM, vol. 32, no. 6, june 1989.[13] M. Tranchero and L. Reyneri, “<strong>Au<strong>to</strong>matic</strong> generation of selftimedcircuits <strong>from</strong> simulink specifications,” InternationalConference on Electronics, Circuits and Systems, December2007.[14] TiDE Manual, Internal documentation, Handshake Solutions,2007.[15] A. Peeters and M. de Wit, <strong>Haste</strong> Manual,Handshake Solutions, 2007. [Online]. Available:http://www.handshakesolutions.com[16] C. Hoare, “Communicating sequential processes,” Communicationsof the ACM, vol. 21, pp. 666–677, 1978.[17] P. Allen and D. Holberg, CMOS <strong>An</strong>alog Circuits Design.New York: Oxford University Press, 2002.10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!