10.07.2015 Views

Integrating Compiler and System Toolkit Flow for Embedded VLIW ...

Integrating Compiler and System Toolkit Flow for Embedded VLIW ...

Integrating Compiler and System Toolkit Flow for Embedded VLIW ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Integrating</strong> <strong>Compiler</strong> <strong>and</strong> <strong>System</strong> <strong>Toolkit</strong> <strong>Flow</strong> <strong>for</strong><strong>Embedded</strong> <strong>VLIW</strong> DSP ProcessorsChi Wu Kun-Yuan Hsieh Yung-Chia Lin Chung-Ju Wu Wen-Li ShihS. C. Chen Chung-Kai Chen Chien-Ching Huang Yi-Ping You Jenq Kuen LeeDepartment of Computer ScienceNational Tsing-Hua UniversityHsinchu 300, TaiwanEmail: {wuchi, kyhsieh, yclin, jasonwu, wlshih, scchen, ckchen, cchuang, ypyou}@pllab.cs.nthu.edu.twEmail: {jklee}@cs.nthu.edu.twAbstractTo support high-per<strong>for</strong>mance <strong>and</strong> low-power <strong>for</strong> multimediaapplications <strong>and</strong> <strong>for</strong> h<strong>and</strong>-held devices, embedded<strong>VLIW</strong> DSP processors are of research focus. With the tightresource constraints, distributed register files, variablelengthencodings <strong>for</strong> instructions, <strong>and</strong> special data pathsare frequently adopted. This creates challenges to deploysoftware toolkits <strong>for</strong> new embedded DSP processors. Thisarticle presents our methods <strong>and</strong> experiences to developsoftware <strong>and</strong> toolkit flows <strong>for</strong> PAC (Parallel ArchitectureCore) <strong>VLIW</strong> DSP processors. Our toolkits include compilers,assemblers, debugger, <strong>and</strong> DSP micro-kernels. We firstretarget Open Research <strong>Compiler</strong> (ORC) <strong>and</strong> toolkit chains<strong>for</strong> PAC <strong>VLIW</strong> DSP processor <strong>and</strong> address the issues tosupport distributed register files <strong>and</strong> ping-pong data paths<strong>for</strong> embedded <strong>VLIW</strong> DSP processors. Second, the linker<strong>and</strong> assmeber are able to support variable length encodingschemes <strong>for</strong> DSP instructions. In addition, the debugger<strong>and</strong> DSP micro-kernel were designed to h<strong>and</strong>le dualcoreenvironments. The footprint of micro-kernel is alsoaround 10K to address the code-size issues <strong>for</strong> embeddeddevices. We also present the experimental result in the compilerframework by incorporating software pipeline (SWP)policies <strong>for</strong> distributed register files in PAC architecture.Results indicated that our compiler framework gains per<strong>for</strong>manceimprovement around 2.5 times against the codegenerated without our proposed optimizations.1. IntroductionWith the fast growth of consumer electronic products,the dem<strong>and</strong> on digital signal processings related to the multimedia,such as image <strong>and</strong> video processing, is also increasedsignificantly. To meet the per<strong>for</strong>mance challenges,high-end embedded processor <strong>and</strong> DSP processors are movingtowards exploiting intensively instruction level parallelism(ILP). In addition, the tight resources in embeddedsystems also have DSP processors to adopt architecture featuressuch as distributed register files to reduce the amountof read <strong>and</strong> write ports in registers to reduce power consumptions[1], special data path to exploit the characteristicsof DSP applications, variable length encoding schemes<strong>for</strong> instructions to reduce code size, the sub-word instructions<strong>for</strong> video applications, etc. The appearance of thesenew features create challenges to deploy software toolkits<strong>for</strong> new embedded DSP processors.This article presents our methods <strong>and</strong> experiences to developsoftware <strong>and</strong> toolkit flows <strong>for</strong> PAC (Parallel ArchitectureCore) <strong>VLIW</strong> DSP processors. Parallel ArchitectureCore (PAC) is a five-way <strong>VLIW</strong> DSP processors with distributedregister cluster files <strong>and</strong> multi-bank register architectures(known as ping-pong architectures) [2] [3] [4]. Ourtoolkits include compilers, assemblers, debugger, <strong>and</strong> DSPmicro-kernels. We first retarget Open Research <strong>Compiler</strong>(ORC) [5] <strong>and</strong> toolkit chains <strong>for</strong> PAC <strong>VLIW</strong> DSP processor<strong>and</strong> address the issues to support distributed register files<strong>and</strong> ping-pong data paths <strong>for</strong> embedded <strong>VLIW</strong> DSP processors.We also deploy software pipelining techniques withthe considerations of distributed register file architectures.The linker <strong>and</strong> assmeber of our toolkits are able to supportvariable length encoding schemes <strong>for</strong> DSP instructions. Inaddition, the debuggers were designed to h<strong>and</strong>le dual-coreenvironments. The debugger is also integrated with EclipseIDE. The footprint of micro-kernel is also around 10K toaddress the code-size issues <strong>for</strong> embedded devices. We alsopresent the experimental result in the compiler frameworkby incorporating software pipeline policies <strong>for</strong> distributedregister files in PAC architecture. Results indicated thatProceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


communication will consume additional 2-cycle latency.2.2. Software <strong>Flow</strong>We integrate compiler, binutils, debugger, <strong>and</strong> simulatoron Eclipse which was an integrated development environment(IDE) plat<strong>for</strong>m <strong>and</strong> used to provide a convenientdevelopment environment <strong>for</strong> application developers. Theintegrated software solution is shown in Figure 3.Eclipse IDE plat<strong>for</strong>mCDT plug-inPAC C<strong>Compiler</strong>PACBinutilsPACDebuggerSWP <strong>for</strong>distributedregister filesCopy propagations<strong>for</strong> irregular registerfilesAssemblerDisassemblerUserInterfacePAC DSP ISSTargetSideSA-styleinstructionschedulerLinkerObjcopyPALFschedularSymbolSidePAC DSP ICEC/C++ Source FilesCompiledAssembly FilesAssembledObject FilesLoadedPAC DSP ProcessorFigure 3. Software flow <strong>for</strong> PAC plat<strong>for</strong>msA graphical user interface (GUI) integrates the representationof variables, disassembly, registers, <strong>and</strong> breakpoints/watchpointsin<strong>for</strong>mation. Users can develop programsthrough IDE on PAC plat<strong>for</strong>ms. With the supportof C development tools (CDT) plug-in, Eclipse [6] integratescompiler, binutils, debugger, <strong>and</strong> simulator/emulatorin a user interface. The compiler invokes the assembler <strong>and</strong>linker to produce the final ELF executable file which can runon PAC. Programmers can also fetch in<strong>for</strong>mation, such asvariables <strong>and</strong> registers, from the debugger to identify bugsof programs.3. <strong>Compiler</strong>The compiler <strong>for</strong> PAC DSP is developed based on ORCwhich is an open-source compiler infrastructure releasedfrom Intel <strong>and</strong> incorporates most of the optimization techniquesof industry strength so far. This compiler is capableof generating codes with good per<strong>for</strong>mance on its originalIA-64 target by utilizing numbers of EPIC/<strong>VLIW</strong> architecturaladvantages. The preliminary employment from originalIA-64 to PAC DSP includes the new implementationof machine description tables <strong>and</strong> the essential supports <strong>for</strong>PAC DSP code generation. Some optimization phases, suchas copy propagations <strong>for</strong> irregular register files, code generators<strong>for</strong> PAC assemblies, SA-style instruction scheduler<strong>and</strong> PALF scheduler, are published into papers [7–9]. Tillnow, our development of compiler with software pipelineoptimization support <strong>for</strong> PAC DSP is available. In this section,we focus on the studies of supporting basic <strong>VLIW</strong>compiler infrastructures <strong>for</strong> PAC DSP processors in SWPoptimization.3.1. Overview of PAC DSP complierThe compilation starts with processing by the frontends,generating an intermediate representation (IR) of thesource program, <strong>and</strong> feeding it in the back-end. The IR ,called WHIRL, is a part of the Pro64 compiler released bySGI [10]. It includes five representation levels from “veryhigh” to “very low”. Each level is invoked at the back-endto per<strong>for</strong>m a series of lowering processes <strong>and</strong> optimizationson the WHIRL IR. In practice, the optimization phases areorganized as a dynamically-shared library, loaded <strong>and</strong> executedon dem<strong>and</strong> by the back-end. The ones who usePAC DSP compiler can turn on/off different optimizationphases. These optimizations in levels are inter-proceduralanalysis/optimizer, loop nest optimizer, global optimizer,<strong>and</strong> code generator.The code generator would take over the progress after thelowering <strong>and</strong> optimization phases, translating the WHIRLIR into CGIR (Code Generation Intermediate Representation),which is a low level IR reflecting the instruction set architectureof PAC DSP processor. Global <strong>and</strong> local registerallocation, <strong>and</strong> final assembly codes emitting are per<strong>for</strong>medhere. Moreover, the code generator also per<strong>for</strong>ms many targetdependent optimization phases, including control flowoptimization, extended block (peephole) optimizer, integratedglobal/local scheduling (IGLS), hyperblock <strong>for</strong>mation,CG loop analysis <strong>and</strong> trans<strong>for</strong>mation, <strong>and</strong> softwarepipelining.3.2. SWP optimization <strong>for</strong> PAC<strong>Compiler</strong> techniques, such as SWP <strong>and</strong> global instructionscheduling, have been proven to be necessary <strong>and</strong> effectivemethods to increase the degree of ILP in programs.The original implemented software pipeline follows thesepapers which wrote by Richard Huff <strong>and</strong> B. R. Rau [11,12].Un<strong>for</strong>tunately, they did not mention about how to deal withsuch highly-partitioned register file architecture.Provided software pipeline optimization to PAC DSP architectureincludes many works to do. The architecture featuresdifferent between IA-64 <strong>and</strong> PAC listed in Table 1.Each of them will be described separately.PAC DSP is cluster architecture but IA-64 is not. Thusthe cluster assignment phase <strong>and</strong> inter-cluster communicationinstruction insertion needs to be implemented into compiler.The goal of cluster assignment phase is partition loopinto two sets to execute on different cluster without increasingminimal initial interval (MII) which produced by rela-Proceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


Table 1. IA-64/PAC architecture comparisonHardwarefeaturesMultipleClustersDistributedRegisterFilesPing-PongArchitectureRotatingRegisterFilesPredicationSupportIA-64PACno yes Cluster assignment <strong>and</strong>Intercluster communicationno yes Register bank assignmentno yes Ping-pong constraintyes no Modulo variable expansionyes yes Reuse the methods inORCtionship of dependence <strong>and</strong> utilization of functional unit.Each communication instruction insertion causes 3 cyclesdelay <strong>and</strong> one <strong>for</strong> instruction issues <strong>and</strong> two <strong>for</strong> additionalinstruction latency. To avoid increasing MII, these communicationinstructions should not be arranged in the criticalpath. In the other word, the instructions of critical pathwhich causes recurrence MII (RecMII) should be arrangedinto one cluster.To deal with PAC ′ s hardware features on distributed registerfiles, the register bank assignment is necessary. Consideringthese different types of register file, global <strong>and</strong>local, in one cluster, what kind of register file should beassigned to instructions is a problem. If data transfer betweentwo instructions which executed by different functionalunit, M-unit <strong>and</strong> I-unit, in one cluster is needed,then global register file will be assigned to them. However,such arrangement induced new problem by global register,i.e. ping-pong constraint. According to the ping-pongconstraint, additional phase to insure that instruction schedulerwill not arrange instructions into the worst case, suchas following example, is constructed by investigators. Themotivating example illustrates how the ping-pong constraintdamages per<strong>for</strong>mance.(1,1)(1,1)A(1,0)I-unitD(1,0)EM-unit(1,0)CB(Delay, Omega)(1,0)Figure 4. Motivating exampleFigure 4 is a data dependence graph (DDG) built by ourcompiler. The orange nodes in Figure 4 executed by I-unit<strong>and</strong> the blue nodes executed by M-unit. Delay indicateshow many cycles should wait <strong>for</strong> data ready <strong>and</strong> Omegarepresents dependence between different iterations. Due tothe special register file organization, the data-transfer betweenM-unit <strong>and</strong> I-unit is per<strong>for</strong>med by ping-pong bankswitch. In this motivating example, the result of instructionA is used by instruction B, such that register allocator allocatesglobal register to instruction A. In this case, instructionB <strong>and</strong> E should be schedule into one cycle without anyconstraint. However, ping-pong constraint restricts such arrangementbecause of M-unit <strong>and</strong> I-unit can not access thesame register file in single cycle. The code generator incompiler could get MII when it per<strong>for</strong>med software pipelineoptimization. MII of this DDG is dominated by RecMII <strong>and</strong>its value is 4 without any constraint. Since ping-pong constraintexisted, instruction B <strong>and</strong> E could not be arrangedinto the same cycle. The worst case is that instruction E bearranged in first place, then instruction B. Now, MII calculatedby compiler is 5 which is showed in Figure 5. Theresult of this example indicates that ping-pong constraintmight damage the overall per<strong>for</strong>mance.Figure 5. Scheduling result <strong>for</strong> motivating exampleOur new decision phase is built into SWP to solve thisproblem. This decision phase identify which node shouldbe arranged first when the ping-pone constraint occurred. Inthis example, ping-pong constraint occurs between instructionB <strong>and</strong> E, there<strong>for</strong>e the decision phase makes sure thatinstruction B would be arrange be<strong>for</strong>e instruction E. Un<strong>for</strong>tunately,in some situation increasing MII is necessary withping-pong constraint, shown as Figure 6.(3,1)(1,1)AI-unit(1,0)D(1,0)M-unit(1,0)ECB(Delay, Omega)(1,0)Figure 6. Situraion fro increasing MIIAgain, the result of instruction A is used by instructionB which belong to different execution functional unit, suchProceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


that register allocator allocates global register to instructionA. In this situation, no matter which instruction, B or E, bescheduled first, then the MII will be increased.Moreover, IA-64 has hardware support with rotating registerfiles <strong>for</strong> per<strong>for</strong>mance improvement on SWP but PAChas not. There<strong>for</strong>e, modulo variable expansion also needsto build in compiler. The rotating register files solve thelifetime overlap problem of variables. So far PAC DSP doesnot provide such hardware support; hence unrolling <strong>and</strong> renamingare used to h<strong>and</strong>le this problem. In current time,SWP optimization <strong>for</strong> single basic block Do-loop is availablein compile <strong>for</strong> PAC DSP. Experiment result indicatesthat SWP optimization with proposed methods is effectiveon minimized per<strong>for</strong>mance damage which caused by pingpongconstraint <strong>and</strong> others hardware features in PAC DSP.4. <strong>Toolkit</strong>s development <strong>for</strong> PACIn this section, we will briefly describe our experienceson development toolkit chains to this specific architecture.4.1. Assembler/LinkerThe PAC DSP Binary Utility is the collection of binary/objectfile tools <strong>for</strong> users who would like to build ormanipulate PAC DSP executables. The collection containsassembler, linker, disassembler, <strong>and</strong> other object file manipulationtools. The compiler invokes the assembler <strong>and</strong>linker to produce the desired ELF (Executable <strong>and</strong> LinkingFormat) execution files.The binary <strong>for</strong>mat of PAC DSP instructions adopt variablelength encoding, which is an efficient strategy to buildcompact binary representation corresponding to differentassembly code. The variable length encoding tends to encodeoperators, registers, immediate values, <strong>and</strong> any othermeaningful data into bit fields with minimum bytes. Thecode size would be saved by using shorter encoding length.There<strong>for</strong>e, we prefer to encode instructions according totheir frequecies of appearance; the more frequent instructionsuse the shorter encoding length, <strong>and</strong> vice versa.In addition to support variable length encoding in PACDSP Binary Utility, we also develop some schemes to makemore improvment on reducing code size. The major ideaof our scheme is to compress immediate value in binary<strong>for</strong>mat. Consider that even if we reserve a 32-bit field inbinary <strong>for</strong>mat <strong>for</strong> an immediate value, it is not necessary toencode immediate value with complete 32 bits. The basiccompressing concept is presented as Figure 7.Since PAC DSP hardware is able to fetch the immediatevalue <strong>and</strong> automatically manages signed extension to a32-bit data, PAC DSP Binary Utility would try to rebuildthe original machine encoding by compressing the immediatevalue with less bytes. Each machine encoding in objectbasic encoding00 00 00 0A (HEX)39 32310compressing0x0000000A (4 bytes)basic encoding0A (HEX)<strong>and</strong>15 8700x000A (2 bytes)with 0x0A (1 byte)basic encoding00 0A (HEX)23 16150Figure 7. Compressing immediate valuefile is rescaned to find out if there is possibility to compressimmediate value. If the immediate value in the machine encodingcan be compressed, PAC DSP Binary Utility couldcompress the encoding length to be a shorter one. Thereare several steps that help us to do compressing. First, takingthe most significant bit of immediate value as considerationto identify whether it is signed or unsigned value.Second, we count the number of signed/unsigned bit in theoriginal value encoding <strong>and</strong> discard extra bits to <strong>for</strong>m ashorter binary representation <strong>for</strong> the immediate value. Next,determing the minimum bytes which could represent thevalue in terms of revised immediate value length <strong>and</strong> originalsigned/unsigned expression, taking notice of that thehardware is signed-extension. Finally, the basic encodingof instruction <strong>and</strong> the new immediate binary representationwould be recombined into one single binary machine encoding.Following the steps above, it is natural to examineall compressible value of machine encodings in object file<strong>and</strong> entirely reduce code size of a application.Another scheme to make variable length encoding betteris ignoring the predicate bit field. According to PAC DSPinstruction set architecture, almost all instructions are predicatable.It implies that the predicate in<strong>for</strong>mation shouldbe encoded into machine code. However, within a practicalapplication, not every instruction needs to be marked aspredicated. For those instructions which will be certainlyexecuted, we can even discard their predicate in<strong>for</strong>mation.This also saves a little code size.To summarize our works on PAC DSP Binary Utility, notonly do we deal with the complicated encoding introducedby PAC DSP architecture, but also provide more advancedcode size reducing strategies.4.2. DebuggerTo provide an efficient support <strong>for</strong> application developersof PAC DSP in finding <strong>and</strong> reducing bugs or defects insideprograms, GNU debugger (GDB) is employed as a basis<strong>for</strong> the debugging environment. In general, GDB consistsof three major subsystems: user interface, symbol h<strong>and</strong>ling(the symbol side), <strong>and</strong> target system h<strong>and</strong>ling (the targetside) [16]. The target side is an architecture-related compo-Proceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


nent, which provides GDB with the common functions toaccess targets.As illustrated in Figure 8, the PAC DSP GDB h<strong>and</strong>les debugin<strong>for</strong>mation of object files generated by the PAC DSPcompiler in the symbol side, <strong>and</strong> communicates with instructionset simulator (ISS) or in-circuit emulator (ICE)via GDB Remote Serial Protocol (GDB RSP) [17]. TheGDB RSP defines the rules governing the communicationbetween debuggers <strong>and</strong> targets. In the PAC DSP GDB, areduced <strong>and</strong> integrated GDB RSP <strong>for</strong> both ISS <strong>and</strong> ICE isproposed to control targets. For instance, we define that thememory address should be aligned in one word while accessingmemory. Various protocols to set breakpoints <strong>and</strong>watchpoints in the debugging programs are also defined.The modifications in our PAC GDB RSP decrease the complexityof designing the remote stub <strong>for</strong> both PAC DSP ICE<strong>and</strong> PAC DSP ISS. Beside physical registers, the PAC GDBRSP defines pseudo registers to transfer system in<strong>for</strong>mationsuch as interrupts, timers, <strong>and</strong> control registers. By usingpseudo registers, users can inspect sufficient in<strong>for</strong>mation ofthe system in run time.For source level debugging, a translation from the registernumbering used by the PAC DSP compiler to the numberingdefined in the PAC GDB RSP is provided in the targetsystem h<strong>and</strong>ling subsystem. The translation bridges thegap between the debug in<strong>for</strong>mation <strong>and</strong> the PAC GDB RSPto guarantee the correctness of the collected in<strong>for</strong>mation.C Source Files5. OS supportELF <strong>for</strong>mattedObject FilesPAC<strong>Compiler</strong>PAC DSP ICEUser InterfaceSymbol H<strong>and</strong>ling (Symbol Side)Target <strong>System</strong> H<strong>and</strong>ling (Target Side)PAC GDB RSPPAC DSP ISSFigure 8. The PAC GDB setupDual-Core/Multi corecommunication protocolsDevicestasksKernel ServiceDevice DriverDSP InitializationBoot KernelpCore (passive/compact runtime environment) microkerneloperating system is designed <strong>for</strong> PAC VILW DSP processorsunder the dual-core/multi-core scenario which triesto provide a minimal but sufficient OS services with a smallkernel. It is a preemptive, multitasking <strong>and</strong> static microkernelwhich cooperates with the embedded OS that runson the main processor in a dual-core/multi-core plat<strong>for</strong>mthat allows application developers to build the software infrastructureof PAC DSP processors. Figure 9 shows thearchitecture overview of pCore microkernel which is basicallycomposed of resource management, synchronizationmodules, inter-process communication <strong>and</strong> some kernel services.Dual-Core/MultiCoreSupport...IPCMailBoxPIPEProcessSchedulerTask managementSychronizationSemaphoreInterruptHardware Interrupt h<strong>and</strong>lerSoftware Interrupt H<strong>and</strong>lerpCore Architecture OverviewFigure 9. pCore architecture overviewThe kernel structure defines a task as a running programon pCore which is managed by the kernel. The kernel maintainsthe runtime in<strong>for</strong>mation <strong>for</strong> each task by building thetask control block, TCB, to keep the necessary in<strong>for</strong>mation.The data structure of TCB in pCore contains no indirectpointers to make the access to the data fields moreefficiently. Furthermore, the size of memory required <strong>for</strong>TCBs is configurable when the running applications are determined.pCore adopts a priority-based scheduling algorithmto make the whole system deterministic <strong>and</strong> allowthe master processor to have a better control on it underthe master-slave programming model. The communicationbetween tasks in pCore bases on the message-passingscheme which constructs a message as a simple data structurethat contains an identifier of the sender <strong>and</strong> receiver, apointer to a data, a field of data <strong>and</strong> some useful in<strong>for</strong>mation.Two basic mechanisms, PIPE <strong>and</strong> MAILBOX basedon the message-passing scenario are provided <strong>for</strong> the programmerto per<strong>for</strong>m interprocess communication.In a multi-tasking system like pCore, the synchronizationbetween tasks is required in order to ensure the correctnessof concurrent access to data. pCore supports amechanism <strong>for</strong> creating critical sections by using countingsemaphores which are implemented by disabling the interrupts.As the target processor, PAC DSP processors areintended to be employed in the dual-core/multi-core plat<strong>for</strong>m.pCore provides a friendly <strong>and</strong> efficient system calls<strong>for</strong> the programmer to register their own service routines.This system calls allow the programmer easily migrating toany customized dual-core/multi-core communication protocolon it.pCore is a tiny microkernel based OS which is designed<strong>for</strong> PAC DSP processors with a very small kernel size,nevertheless, it implements only a minimal set of servicesProceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


such as inter-process communication, scheduler <strong>and</strong> someother kernel services. Furthermore, it supports the dualcore/multi-coreprogramming model by providing necessarysystem calls <strong>for</strong> the users to port any customized dualcore/multi-corecommunication protocol on it.6. ResultsPreliminary experiments were done by running the DSPstonebenchmark suite [20] on the PAC DSP. Since the PACDSP compiler is still in progress, we only evaluated optimizationsin O1 level <strong>for</strong> the early-stage per<strong>for</strong>mance evaluationsof our designs. Some programs of DSP-stone benchmarksuite will not be processed by SWP optimization becausetheir trip counts of loops are too small to gain benifit.Table 2 shows the cycle counts <strong>for</strong> some programs of DSPstonebenchmark suite estimation on PAC ISS. The betterresults are always produced by code generation with SWP.Table 2. Cycle counts of DSP-stone benchmarksuiteWithout With SWPSWPfir 2672 1208lms 3343 1865fir2dim 14042 10636matrix1 71467 36739matrix2 68671 34363mit1x3 681 480n-real updates 3292 2153n-complex updates 6792 3145convolution 1524 773dot of product 258 178showed the per<strong>for</strong>mance generated from the compiler withsome optimization in O1, such as EBO <strong>and</strong> WOPT. Yellowbar showed the per<strong>for</strong>mance got from our previous works onregister allocation which uses the simulated annealing (SA)method to improve local instruction scheduling. Green barshowed the per<strong>for</strong>mance got from the compiler with SWPoptimization. These results indicated that our SWP policiesgain per<strong>for</strong>mance improvement around 2.5 times against tothe code generated without any optimization. Clearly, softwarepipelining is also an effective method to attain per<strong>for</strong>manceon highly partitioned register files of PAC architecture.However, we noticed that utilization of resourceson the PAC architecture is low when we investigated thecode which was scheduled by SWP optimization. In theother word, most of the instructions will be arranged intoone cluster. Due to the characteristic of DSP-stone benchmarks,while per<strong>for</strong>ming SWP optimization on these programs,most of the MII were dominated by their RecMIIs.There<strong>for</strong>e, such instruction arrangement was done by clusterassignment phase with the goal to minimize the initialinterval. Hence, in our future works we will try to boost theutilization rate of resources. The high-level loop trans<strong>for</strong>mationsshould also be used to enhance partitioning scheme<strong>and</strong> improve the per<strong>for</strong>mance.In addition to the significant improvement of SWP optimization,the current implementation of pCore <strong>for</strong> PACDSP processor is proved to be very tiny with the size lessthan 10K Bytes. The kernel is impressively small comparingwith many existing embedded microkernel OS orreal-time kernels. Moreover, by configuring the kernel be<strong>for</strong>eruntime when the applications running on it are determined,the memory requirement of pCore could be shrinkedeven more. Figure 11 shows the module size distribution32.521.51O0O1O1+SAO1+SWP0.50firlmsfir2dimmatrix1matrix2mat1x3n_real_updatesn_complex_updatesconvolutiondot_productFigure 10. Per<strong>for</strong>mance resultExperimental results of the DSP-stone benchmark suiterunning on the PAC DSP are normalized <strong>and</strong> presented inFigure 10. Blue bar showed the per<strong>for</strong>mance generatedfrom the compiler without any optimization <strong>and</strong> purple barFigure 11. Module size distribution of pCoreof pCore which presents the result of implementation. Thetask management module which is the key feature of multitaskingturns out to have a large memory requirement. Ifthe DSP is employed to run one application at a time, thekernel can be configured to be single-tasking mode whichcan reduce the memory requirement of the kernel in about50%.Proceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006


Furthermore, those toolkits we developed , such as compiler,assembler/linker <strong>and</strong> debugger, are integrated intoEclipse to provide a convenient environment <strong>for</strong> applicationdevelopers. As shown in Figure 12, the graphical user interface(GUI) of our PAC DSP debugger could offer the visualrepresentation of variables, disassembly, registers, <strong>and</strong>breakpoints/watchpoints in<strong>for</strong>mation in an integrated style.National Science Council under grant no. NSC 94-2220-E-007-019, NSC 94-2220-E-007-020, NSC 94-2213-E-007-074 <strong>and</strong> NSC 95-2752-E-007-004-PAE in Taiwan.References[1] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, <strong>and</strong> J. D. OwensRegister organization <strong>for</strong> media processing. International Symposium on HighPer<strong>for</strong>mance Computer Architecture (HPCA), pp.375-386, 2000.[2] David Chang <strong>and</strong> Max Baron: Taiwan’s Roadmap to Leadershipin Design. Microprocessor Report, In-Stat/MDR, Dec. 2004.http://www.mdronline.com/mpr/archive/mpr 2004.html[3] T.-J. Lin, C.-C. Chang. C.-C. Lee, <strong>and</strong> C.-W. Jen An Efficient <strong>VLIW</strong> DSPArchitecture <strong>for</strong> Baseb<strong>and</strong> Processing. In Proceedings of the 21th InternationalConference on Computer Design, 2003.[4] Tay-Jyi Lin, Chie-Min Chao, Chia-Hsien Liu, Pi-Chen Hsiao, Shin-Kai Chen,Li-Chun Lin, Chih-Wei Liu, Chein-Wei Jen Computer architecture: A unifiedprocessor architecture <strong>for</strong> RISC & <strong>VLIW</strong> DSP. In Proceedings of the 15th ACMGreat Lakes symposium on VLSI, April 2005.[5] Roy Ju, Sun Chan, <strong>and</strong> Cheng yong Wu, ”Open Research <strong>Compiler</strong> <strong>for</strong>the Itanium Family”, Tutorial at the 34th Annual Intl Symposium on Microarchitecture,Dec, 2001.Figure 12. A snapshot of Eclipse <strong>for</strong> PAC DSP7. ConclusionThis article attempted to describe our experiences on developmentOS <strong>and</strong> toolkit chains on PAC architecture withstrict constrains on register files access. We developed a tinymicrokernel based OS which is designed <strong>for</strong> PAC DSP processorswith a very small kernel size. This OS implementsa minimal set of services <strong>and</strong> supports the dual-core/multicoreprogramming model by providing necessary systemcalls <strong>for</strong> the users to port any customized dual-core/multicorecommunication protocol on it. We also built assembler/linkerto do the basic machine encoding <strong>and</strong> providemore advanced code size reducing strategies on PAC <strong>and</strong>provided a debugging environment <strong>and</strong> integrated softwaresolutions <strong>for</strong> application developers of PAC DSP. Furthermore,we introduced our experiences on retarget softwarepipeline optimization from IA-64 to PAC DSP. The experimentalevaluation using DSP-stone benchmark indicatessignificant improvement of cycle time. These experiencesmight benefit the architecture designers <strong>and</strong> compiler developerswho are interested in similar heterogeneous clustered<strong>VLIW</strong> architectures with port-restricted, distinct partitionedregister file structures.8. AcknowledgementsThis work was supported in part by Ministry of EconomicAffairs under grant no. 95-EC-17-A-01-S1-034, by[6] Eclipse Plat<strong>for</strong>m Technical Overview, International Business MachinesCorp., 2001. http://www.eclipse.org/articles/Whitepaper-Plat<strong>for</strong>m-3.1/eclipseplat<strong>for</strong>m-whitepaper.pdf[7] Yung-Chia Lin, Chung-Lin Tang, Chung-Ju Wu, Ming-Yu Hung, Yi-Ping You,Ya-Chiao Moo, Sheng-Yuan Chen, Jenq-Kuen Lee. ”<strong>Compiler</strong> Supports <strong>and</strong>Optimizations <strong>for</strong> PAC <strong>VLIW</strong> DSP Processors”, LCPC, 2005.[8] Yung-Chia Lin, Yi-Ping You, Jenq-Kuen Lee. ”Register Allocation <strong>for</strong> <strong>VLIW</strong>DSP Processors with Irregular Register Files”, CPC, 2006.[9] Cheng-Wei Chen, Yung-Chia Lin, Chung-Ling Tang, Jenq-Kuen Lee.”ORC2DSP: <strong>Compiler</strong> Infrastructure Supports <strong>for</strong> <strong>VLIW</strong> DSP Processors”,IEEE VLSI TSA, April 27-29, 2005.[10] SGI - Developer Central Open Source - Pro64http://oss.sgi.com/projects/Pro64/[11] Richard Huff, ”Lifetime-Sensitive Modulo Scheduling”, PLDI SIGPLAN,1993.[12] B. R. Rau, M. Lee, P. P. Tirumalai, M. S. Schlansker, ”Register Allocation <strong>for</strong>Software Pipelined Loops”, ACM SIGPLAN, 1992.[13] A. Capitanio, N. Dutt, <strong>and</strong> A. Nicolau, ”Partitioned register files <strong>for</strong> <strong>VLIW</strong>s:A preliminary analysis of tradeoffs”, Proceedings of the 25th Annual InternationalSymposium on Microarchitecture (MICRO-25),Portl<strong>and</strong>, December 1V4,1992; pages 292V300.[14] Texas Instruments: TMS320C64x Technical Overview. Texas Instruments, Feb2000.[15] The open research compiler official page. http://ipf-orc.source<strong>for</strong>ge.net/.[16] John Gilmore <strong>and</strong> Sten Shebs, ”GDB Internals: A guide to the internals of theGNU debugger” Free Software Foundation, Inc, 2006.[17] Bill Gatliff, ”Embedding with GNU: the gdb Remote Serial Protocol” <strong>Embedded</strong><strong>System</strong> Programming, pp.108-113, November, 1999.[18] Tay-Jyi Lin, Chen-Chia Lee, Chih-Wei Liu, <strong>and</strong> Chein-Wei Jen ”A Novel RegisterOrganization <strong>for</strong> <strong>VLIW</strong> Digital Signal Processors”, Proceedings of 2005IEEE International Symposium on VLSI Design, Automation, <strong>and</strong> Test, 2005,pages 335 V338.[19] R. Leupers ”Instruction scheduling <strong>for</strong> clustered <strong>VLIW</strong> DSPs”, Proc. IntlConference on Parallel Architecture <strong>and</strong> Compilation Techniques, Oct. 2000,pages 291V300.[20] V. Zivojnovic, J. Martinez, C. Schlager <strong>and</strong> H. Meyr ”DSPstone: A DSP-Oriented Benchmarking Methodology”, Proc. of ICSPAT, Dallas, 1994.Proceedings of the 12th IEEE International Conference on <strong>Embedded</strong> <strong>and</strong> Real-Time Computing<strong>System</strong>s <strong>and</strong> Applications (RTCSA'06)0-7695-2676-4/06 $20.00 © 2006

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!