11.07.2015 Views

An Automatic Superword Vectorization in LLVM

An Automatic Superword Vectorization in LLVM

An Automatic Superword Vectorization in LLVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>An</strong> <strong>Automatic</strong> <strong>Superword</strong> <strong>Vectorization</strong> <strong>in</strong> <strong>LLVM</strong>Kuan-Hsu Chen, Bor-Yeh Shen, and Wuu YangDepartment of Computer ScienceNational Chiao Tung UniversityHs<strong>in</strong>chu City 300, Taiwan{khchen, byshen, wuuyang}@cs.nctu.edu.twAbstract—More and more modern processors supportSIMD <strong>in</strong>structions for improv<strong>in</strong>g performance <strong>in</strong> mediaapplications. Programmers usually need detailed targetspecificknowledge to use SIMD <strong>in</strong>structions directly.Thus, an auto-vectorization compiler that automaticallygenerates efficient SIMD <strong>in</strong>structions is <strong>in</strong> urgent need. Weimplement an automatic superword vectorization basedon the <strong>LLVM</strong> compiler <strong>in</strong>frastructure, to which an autovectorizationand an alignment analysis passes have beenadded.The superword auto-vectorization pass exploits dataparallelismand convert IR <strong>in</strong>structions from primitivetype to vector type. Then, <strong>in</strong> code generator, the alignmentanalysis pass analyzes every memory access with respectto those vector <strong>in</strong>structions and generates the alignment<strong>in</strong>formation for generate target-specific alignment<strong>in</strong>structions. In this paper, we use UltraSPARC as ourexperimental platform and two realignment <strong>in</strong>structionsto perform misaligned access. We also present prelim<strong>in</strong>aryexperimental results, which reveals that our optimizationgenerates vectorized code that are 4% up to 35% speedup. 1Index Terms—Auto-vectorization, UltraSPARC T2,VISInstruction Set.I. INTRODUCTIONIn recent years, <strong>in</strong> order to improve the performanceof multimedia applications, the multimedia-extension <strong>in</strong>structionshave been added to most popular generalproposemicroprocessors. These <strong>in</strong>structions operate onfixed-length vectors. Exist<strong>in</strong>g multimedia extensions,such as MMX/SSE for X86, Visual Instruction Set (VIS)for UltraSPARC, and AltiVec for PowerPC, are characterizedas s<strong>in</strong>gle-<strong>in</strong>struction multiple-data (SIMD) architectures.SIMD architectures exploit the data-parallelism[1] that exists <strong>in</strong> many multimedia applications. SIMDtypically <strong>in</strong>volves the simultaneous execution of thesame <strong>in</strong>struction sequence on different elements <strong>in</strong> alarge data set. For example, some SIMD <strong>in</strong>structions1 The work reported <strong>in</strong> this paper is partially supported by NationalScience Council, Taiwan, Republic of Ch<strong>in</strong>a, under grants NSC 96-2628-E-009-014-MY3, NSC 98-2220-E-009-050, and NSC 98-2220-E-009-051 and a grant from Sun Microsystems OpenSparc Project.simultaneously operate on a block of 128-bit data that ispartitioned <strong>in</strong>to four 32-bit <strong>in</strong>tegers. SIMD architectures<strong>in</strong>clude not only common arithmetic <strong>in</strong>structions, butalso other <strong>in</strong>structions, such as data alignment, datatype conversion, data reorganization, etc., that are alsoneeded to prepare the data <strong>in</strong> a proper format for SIMDexecution [2].In an auto-vectorization technique, there are generallyfour major issues. First, not all code can be vectorized.For example, tasks with complicated control flowwould not benefit from SIMD processors. Second, SIMD<strong>in</strong>structions are commonly used <strong>in</strong> <strong>in</strong>-l<strong>in</strong>e assemblyrout<strong>in</strong>es or special library rout<strong>in</strong>es, which are usually notportable. Manually prepar<strong>in</strong>g <strong>in</strong>-l<strong>in</strong>e assembly rout<strong>in</strong>es iscostly. The third issue is the alignment problem causedby SIMD load/store <strong>in</strong>structions. These <strong>in</strong>structions usuallyrequire that a block of data be aligned at mach<strong>in</strong>eword boundary. The last issue is that SIMD <strong>in</strong>structionsare always architecture-specific. For example, MMXprovides multiply-and add <strong>in</strong>struction operation ( a = a+ (b x c) ) but VIS does not.We propose the design and implementation of anautomatic superword vectorization based on the <strong>LLVM</strong>compiler <strong>in</strong>frastructure, which is widely used <strong>in</strong> theresearch community. <strong>LLVM</strong> supports powerful optimizationsand analyses. However, <strong>LLVM</strong> currently does notyet support auto-vectorization. Therefore, we implementan auto-vectorization pass and an alignment analysis pass<strong>in</strong> <strong>LLVM</strong>. Auto-vectorization exploits data-level parallelism<strong>in</strong> <strong>in</strong>nermost loop and converts IR <strong>in</strong>structionsfrom primitive type to vector type. Because the operands<strong>in</strong> a vector operation must be properly aligned, if datais mis-aligned, additional <strong>in</strong>structions must be used toaccess correct data. These additional <strong>in</strong>structions <strong>in</strong>curextra penalty. Therefore, we use an alignment-analysispass that detects mis-aligned load/store and reducesrealignment <strong>in</strong>struction use. The alignment <strong>in</strong>formationis used by the target code generator to generate targetspecificalignment <strong>in</strong>structions. The optimization andanalysis passes operate on <strong>LLVM</strong> Intermediate Representation(IR).


The rest of this article is organized as follows. Relatedwork and an <strong>in</strong>troduction of <strong>LLVM</strong> are given <strong>in</strong> section2. A detailed description of our auto-vectorization andalignment analysis is provided <strong>in</strong> section 3. The experimentalresults are presented <strong>in</strong> section 4. Section 5 isthe conclusion and future work.II. BACKGROUNDThis section reviews current auto-vectorization techniquesand the alignment mechanisms. In addition, we<strong>in</strong>troduces the <strong>LLVM</strong> compiler <strong>in</strong>frastructure briefly.A. Overview of Auto-<strong>Vectorization</strong><strong>Vectorization</strong> is a process which translates sequentialloops <strong>in</strong>to parallel versions that utilize SIMD <strong>in</strong>structions[3] SIMD <strong>in</strong>structions are normally used with<strong>in</strong> <strong>in</strong>-l<strong>in</strong>eassembly rout<strong>in</strong>es or processor-specific libraries. However,this approach is tedious, s<strong>in</strong>ce a better alternative isto use a compiler that automatically issues SIMD <strong>in</strong>structions.Many researchers have studied auto-vectorization[1, 3–7]. There are two types of vectorization, loopbasedvectorization and basic-block vectorization (i.e.,superword level parallelism, SLP).Loop-based vectorization attempts to explore datalevelparallelism from a loop nest. In the first step, aloop-based vectorizer creates a data-dependence graphand identifies strongly connected components <strong>in</strong> thedependence graph. In a dependence graph, nodes representstatements and edges denote dependences betweenstatements. Note that nodes <strong>in</strong> the same stronglyconnected component are not vectorizable because theyhave cyclic dependences. Next, the vectorizer comb<strong>in</strong>eseach strongly connected component <strong>in</strong>to a piblock [1]and rebuilt dependences between operations <strong>in</strong> differentpiblocks. The result<strong>in</strong>g graph conta<strong>in</strong>s no cyclic dependences.Next, the vectorizer performs a topological sortof the graph to determ<strong>in</strong>e a valid execution order of thepiblocks, and then emit a vectorized statement for eachvectorizable node. F<strong>in</strong>ally, the vectorizer construct a loopto execute operations <strong>in</strong> the piblock of orig<strong>in</strong>al programorder.The basic-block auto-vectorization (SLP) packs <strong>in</strong>dependentisomorphic statements <strong>in</strong> a basic block. Isomorphicstatements are those that conta<strong>in</strong> the sameoperations <strong>in</strong> the same order [3], for example, Fig. 1shows the four isomorphic statements. They can executed<strong>in</strong> parallel because they are data <strong>in</strong>dependent with eachothers. In addition, the loops and non-iterative programscould benefit from this approach. Besides, this approachunrolls the <strong>in</strong>nermost loop to <strong>in</strong>crease parallelism <strong>in</strong>sidea basic block, but it limits on the unroll factor whichrelies on the knowledge of target vector register size. ForFig. 1: Isomorphic statements.Fig. 2: Misaligned loads.example, to calculate the unroll factor, the compiler mustknow the architecture’s vector register size (suppose it is8-byte) if the array is assumed to conta<strong>in</strong> 2-byte elements<strong>in</strong> the loop then the unroll factor will be 4 (8/2).In addition, if the loop’s trip counts is not divisible byunroll factor, the loop would be split which break oneloop <strong>in</strong>to two, one is vectorized loop and another onefor rema<strong>in</strong><strong>in</strong>g iteration.There are many related techniques have been proposed<strong>in</strong> the past. Some techniques focus on vectoriz<strong>in</strong>gscalar source [3, 5–9], enable vectorization onnon-vectorizable code and build cost models to selectthe most effective SIMD <strong>in</strong>structions or target-specific<strong>in</strong>structions. Others focus on the portable issue [10–12]because each multimedia extenion are very different.The usual solution is to design vector <strong>in</strong>structions <strong>in</strong>IR. For example, Bocch<strong>in</strong>o and Adve [12] designed avector LLVA (Low Level Virtual Architecture) based on<strong>LLVM</strong>, that can transfer IR to native code by avoid<strong>in</strong>guse of hardware-specific parameters and achievecode competitive with hand-code native assembly, butprogrammer should hand-tuned vector LLVA code forprogram carefully for each target. This method obviouslycan get more efficient than auto-vectorization. There hasbeen some work focused on generat<strong>in</strong>g efficient codethat is <strong>in</strong>fluenced by the alignment, vector size, and datamovement constra<strong>in</strong>ts [13].B. Overview of Alignment MechanismsThe misalignment problem often occurs <strong>in</strong> SIMD operations.For example, two SIMD <strong>in</strong>structions which op-


1) Orig<strong>in</strong>al:TABLE I: Alignment Mechanism across different platforms [12, 14]Target Unaligned Load Aligned Load Realign Load RTAltiVec/VMX lvx vperm lvslSSE/SEE3 movdqu,lddqu movdqaMIPS-3D luxcl alnv.ps addressMIPS64 ldl,ldralpha ldq u extql,extqh addressVIS1 ldd faligndata alignaddressvoid add ( s h o r t ∗a , s h o r t ∗b , s h o r t ∗c , i n t n ){f o r ( i =0; i


a permutation mask or some other value <strong>in</strong> differentplatforms. For example, the AltiVec lvsr <strong>in</strong>structioncomputes a permutation mask which the permutation<strong>in</strong>struction vperm could use to select the correct data.The alignaddress <strong>in</strong>struction <strong>in</strong> VIS computes a maskand return an aligned address for load<strong>in</strong>g two aligneddata to register. Then the faligndata <strong>in</strong>struction uses themask to select correct data <strong>in</strong> two registers. Moreover,<strong>in</strong> the MIPS-3D and alpha platform, the hardware automaticallyclear the lower-bit of the effective addressand returns an aligned address (RT) for shift, rotates, orpermute to extract the unaligned data elements.In the software mechanisms, the common approachesare dynamic loop peel<strong>in</strong>g and multi-version<strong>in</strong>g [?,8,15].Multi-version<strong>in</strong>g is used to avoid misaligned memoryaccesses by us<strong>in</strong>g runtime checks, because not all alignment<strong>in</strong>formation can be obta<strong>in</strong>ed <strong>in</strong> static time. Dynamicloop peel<strong>in</strong>g focuses on enforc<strong>in</strong>g aligned accesses.Fig. 3 illustrates those approaches.C. The <strong>LLVM</strong> Compiler InfrastructureThe Low Level Virtual Mach<strong>in</strong>e (<strong>LLVM</strong>) is a compiler<strong>in</strong>frastructure which is composed by reusable modularcomponents (passes). Each pass performs one analysis orone transformation <strong>in</strong> a certa<strong>in</strong> type (function, loop, etc.).<strong>LLVM</strong> enable effective program optimizations acrossthe entire lifetime which <strong>in</strong>cludes compile time, l<strong>in</strong>ktime, and run time and provides a powerful IntermediateRepresentation (IR). The <strong>LLVM</strong> IR is a RISClike<strong>in</strong>struction set but with key higher-level <strong>in</strong>formationfor effective analysis. This <strong>in</strong>cludes type <strong>in</strong>formation,explicit control flow graphs, and an explicit dataflowrepresentation (us<strong>in</strong>g an <strong>in</strong>f<strong>in</strong>ite, typed register set <strong>in</strong> theSSA form) [16].The type system is the most important feature <strong>in</strong><strong>LLVM</strong>. It is low-level, language-<strong>in</strong>dependent and is usedto implement data types and operations from high-levellanguages, expos<strong>in</strong>g their implementation behavior toall stages of optimization [16]. IR <strong>in</strong>structions performtype conversions and low-level address arithmetic whilepreserv<strong>in</strong>g type <strong>in</strong>formation [16]. The vector type is thekey type used <strong>in</strong> this paper which represents a vector ofelements. A companion code generator <strong>in</strong> <strong>LLVM</strong> backendalways emits SIMD <strong>in</strong>structions for the vector typewhen the target platform supports SIMD <strong>in</strong>structions.Note that the <strong>LLVM</strong> does not support auto-vectorizationtransformations currently but only provides the vectortype <strong>in</strong> IR which can be generated by <strong>in</strong>tr<strong>in</strong>sic functionsor built-<strong>in</strong> functions <strong>in</strong> the source code.Fig. 4: Our automatic superword vectorization.III. OVERVIEW OF AUTOMATIC SUPERWORDVECTORIZATIONThis section expla<strong>in</strong>s our automatic superword vectorization,which is depicted <strong>in</strong> Fig. 4. In the frontend,source code is compiled to <strong>LLVM</strong> IR. <strong>LLVM</strong>supports many <strong>in</strong>dependent and re-useable optimizationand transformation passes <strong>in</strong> the <strong>LLVM</strong> IR level. We addtwo passes: auto-vectorization and alignment analysis.The auto-vectorization pass aims to explore data-levelparallelism <strong>in</strong> the <strong>in</strong>nermost loop and converts IR <strong>in</strong>structionsfrom primitive type to vector type. The alignmentanalysis is an <strong>in</strong>terprocedural po<strong>in</strong>ter analysis which calculatesalignment <strong>in</strong>formation for <strong>in</strong>struction selection.The <strong>LLVM</strong> back-end provides several code generatorsfor various platforms and supports some multimediaextensions but it does not support the VIS extension onSPARC platform. We modified an exist<strong>in</strong>g SPARC codegenerator <strong>in</strong> <strong>LLVM</strong> for generat<strong>in</strong>g SIMD <strong>in</strong>structions andrealignment <strong>in</strong>structions with the help of the alignmentanalysis pass.A. Auto-<strong>Vectorization</strong>The auto-vectorization pass works on each loop<strong>in</strong>dependently. This pass aims to explore data-levelparallelism and produces vector-type IR <strong>in</strong>structionfrom primitive types. Our basic-block auto-vectorizationmethod is similar to Larsen and Amaras<strong>in</strong>ghe’s approach(SLP) [3], and we assume overflows will not occur. Thedetail procedure is presented <strong>in</strong> this section.1) Pre-pass: The pre-pass step performs several<strong>LLVM</strong> optimization and transformation passes. Thosepasses <strong>in</strong>clude the general SSA-form optimization, con-


Fig. 6: Vectorizable statement classification.Fig. 8: After type transformation <strong>in</strong> IR.Fig. 7: Type transformation process<strong>in</strong>g.conta<strong>in</strong>s more than one node (statement), it <strong>in</strong>dicatesthese nodes cannot be comb<strong>in</strong>ed together to emit a SIMD<strong>in</strong>struction. Fig. 6 shows the vectorizable statementclassification result of Fig. 5 (b) code, there are twocandidate sets.5) Type Transformation: The type-transformationstep compares candidate sets and group them if theyare isomorphic statements. Next, it comb<strong>in</strong>es two groups<strong>in</strong>to a new one when they use the same candidate set.Lastly, if the number of candidate sets <strong>in</strong> a group isequals to VF, isomorphic statements will be put togetheras vector <strong>in</strong>structions <strong>in</strong> orig<strong>in</strong>al program order. In otherword, this step transforms <strong>in</strong>structions from orig<strong>in</strong>alprimitive data types to vector types <strong>in</strong> <strong>LLVM</strong> IR typerepresentation. The follow<strong>in</strong>g terms would be compared<strong>in</strong> different candidate sets one by one. First, they havethe same numbers of node <strong>in</strong> the dependency graph, andthe node does not have flow dependency with a node<strong>in</strong> another set. Second, two nodes <strong>in</strong> identical order ofeach candidate sets would be compared accord<strong>in</strong>g to thefollow<strong>in</strong>g conditions:• The operations are the same.• The address operands must be adjacent memoryreferences.In other words, two candidate sets are isomorphicwhen they are <strong>in</strong> the same group. Next, we comb<strong>in</strong>etwo groups <strong>in</strong>to a new one when they use the samecandidate set and each different candidate set do nothave dependency relationship. (by us<strong>in</strong>g <strong>LLVM</strong> aliasanalysis) In the next step, if the number of candidatesets <strong>in</strong> a group equals to VF, the node <strong>in</strong> the candidatesets of a group would be emitted to vector <strong>in</strong>structions.In this step, the best choice to pack candidate set to agroup is when a group number is equal to the maximumparallelism degree which target supports. For example,if a group has eight candidate sets, it would be suitablewhen the target provides eight 16-bit operation thanfour 16-bit operation. Unfortunately, auto-vectorizationperforms <strong>in</strong> retargetable IR, so we assume that the targetprovides VF degree operation. In <strong>LLVM</strong> IR, we use thebitcast <strong>in</strong>struction to transfer load/store <strong>in</strong>structions fromprimitive types to vector type, and <strong>in</strong>sert vector type<strong>in</strong>structions that are equal to the rema<strong>in</strong><strong>in</strong>g primitivetype <strong>in</strong>structions <strong>in</strong> the orig<strong>in</strong>al order. In the comb<strong>in</strong>eand-transferstep, it also handles the loop <strong>in</strong>variant variableby us<strong>in</strong>g the scalar expansion. The loop <strong>in</strong>variantmeans the statements conta<strong>in</strong> an <strong>in</strong>variant operand <strong>in</strong>the expression. The type transformation process<strong>in</strong>g isshown <strong>in</strong> Fig. 7, and Fig. 8 shows result IR after typetransformation. Note that the bitcast <strong>in</strong>struction is usedto converts orig<strong>in</strong>al type to vector type.B. Alignment <strong>An</strong>alysisThe alignment analysis is <strong>in</strong>terprocedural po<strong>in</strong>ter analysis,which is similar to Pryanishnikov’s approach [6].It analyzes every memory access with respect to vector<strong>in</strong>structions. The goal of alignment analysis is to identifymemory references that are unaligned on all executions.The alignment <strong>in</strong>formation is calculated with the modulooperator <strong>in</strong> the memory <strong>in</strong>structions addresses. There are


two assumptions. First, the compiler must align data.It means that when a program allocates memory, thefirst element should be aligned. Second, the address isassumed to be misaligned when the alignment <strong>in</strong>formationcannot be obta<strong>in</strong>ed. The alignment <strong>in</strong>formation ofmemory access is given by:AlignmentInfo = P mod NThe AlignmentInfo is a set of alignment value moduleN, the P denotes the po<strong>in</strong>ter address, and the N denotesmemory boundary <strong>in</strong> byte (usually, the N is equals to targetvector register size). Because alignment <strong>in</strong>formation would bechanged when the po<strong>in</strong>ter performs arithmetic operation, suchas *(p+i). In order to evaluate po<strong>in</strong>ter arithmetic, the transferfunction F : M × M ↦→ M is given by:F (A1 ± A2) = [(A1 mod N) ± (A2 mod N)] mod NF (A1 · A2) = [(A1 mod N) · (A2 mod N)] mod NM is the set of all possible alignment values modulo N,and A1 and A2 denote a scalar or a value of AlignmentInfoset.Alignment analysis traverses the IR and propagates thealignment <strong>in</strong>formation. It starts from the ma<strong>in</strong> function. Eachfunction call is visited sequentially. When a procedure <strong>in</strong>volvesa function call, alignment<strong>in</strong>fo of argument variable wouldbe passed across the function call. Otherwise, alignment<strong>in</strong>fowould be merged <strong>in</strong> the different times when a functionis <strong>in</strong>volved. In the loop block, the alignment <strong>in</strong>formationpropagates <strong>in</strong> each iterator. Consider the follow<strong>in</strong>g example:the alignment <strong>in</strong>formation of b (the vector length is 8) iscalculated. Assume the <strong>in</strong>put argument b’s alignment <strong>in</strong>formationset is 1. In the first iteration, the alignment <strong>in</strong>formationof po<strong>in</strong>ter b is 1, and then the b’s alignment <strong>in</strong>formationis 3 ((1 mod 8 + 2 mod 8) mod 8 = 3) <strong>in</strong> the statement*b+=2. Next, the alignment <strong>in</strong>formation of po<strong>in</strong>ter b is 1,3by propagation (1 ∪ 3). F<strong>in</strong>ally, the analysis would stop whenthe old set equal the new set, the b’s alignment <strong>in</strong>formationset is 1,3,5,7.void sum ( i n t ∗ a , i n t ∗b , i n t ∗ c ){f o r ( i =0; i


Fig. 10: Pre-pass and Auto-vectorization speedup evaluation.Fig. 11: Auto-vectorization speedup compared with Preoptimizationevaluation.optimization, auto-vectorization optimization, and disabl<strong>in</strong>galignment analysis result. Note that the disabl<strong>in</strong>g alignmentanalysis always uses realignment <strong>in</strong>struction to do aligned ormisaligned vector access. In the experience, time is measuredus<strong>in</strong>g the gethrtime() call. The time measured is high resolution<strong>in</strong> nanoseconds. Speed up rates when compared with-O0 optimization for the test suites are shown <strong>in</strong> Fig. 10. Theresults of pre-pass and auto-vectorization are shown <strong>in</strong> Fig. 11.Fig. 12 shows the compar<strong>in</strong>g result of enabl<strong>in</strong>g and disabl<strong>in</strong>galignment analysis that means misaligned <strong>in</strong>struction would begenerated <strong>in</strong> every vector memory access. In each figure, thestart symbol represents the misaligned access <strong>in</strong> the program.In certa<strong>in</strong> cases, the speedup rate is lower because themisaligned access (e.g., lms, fir), or multiply always appears<strong>in</strong> the program (e.g., fir, iir) or the smaller fraction of thebenchmark code can be mapped to SIMD <strong>in</strong>structions (e.g.,lms). Otherwise, by us<strong>in</strong>g scalar expansion, the auxiliary arraywould allocate extra memory and use more <strong>in</strong>structions to access.These would add more penalty. However, the evaluationresult shows a speedup of 4% up to 36% and the averageperformance improvement is up to 17.2% by compar<strong>in</strong>g withpre-pass. If we compar<strong>in</strong>g matrix and matrix add, the averageperformance impact is 14.5%, but it still improves 30.35% bycompar<strong>in</strong>g with pre-pass.V. CONCLUSION AND FUTURE WORKIn this paper we design and implement an automaticsuperword vectorization <strong>in</strong> <strong>LLVM</strong>. It can produces SIMD<strong>in</strong>structions to get performance improved. Our design <strong>in</strong>cludesauto-vectorization and alignment analysis passes. Autovectorizationexploits data parallelism and enables base vectorization.In addition, the alignment analysis pass is used forthe back-end to select target-specific realignment <strong>in</strong>structionsto handle the misaligned problem <strong>in</strong> our experiment environment.Furthermore, the vectorization is effective because<strong>LLVM</strong> supports high-level <strong>in</strong>formation <strong>in</strong> its IR. Besides, ourvectorization is <strong>in</strong>dependent of other optimization and analysispasses <strong>in</strong> <strong>LLVM</strong>. Programs could also apply other powerfuloptimizations after the auto-vectorization. In addition, theFig. 12: Auto-vectorization speedup compare with disabl<strong>in</strong>galignment analysis.vectorized code can generate SIMD <strong>in</strong>structions for otherplatforms supported by <strong>LLVM</strong>.Future work would focus on enhance vectorization capabilityand transfer <strong>LLVM</strong> IR to to vector LLVA [11] whichprovide rich set of vector <strong>in</strong>structions based on <strong>LLVM</strong> IR.(e.g., Vector LLVA provides saturation arithmetic) In addition,because auto-vectorization may <strong>in</strong>troduce some extra codesand costs, the suitable cost model would be added for SPARCon <strong>LLVM</strong>.REFERENCES[1] K. Kennedy and J. R. Allen, Optimiz<strong>in</strong>g compilers for modernarchitectures: a dependence-based approach, San Francisco,CA, USA, 2001.[2] A. Shahbahrami and B. Juurl<strong>in</strong>k, “Performance improvementof multimedia kernels by alleviat<strong>in</strong>g overhead <strong>in</strong>structions onSIMD devices,” <strong>in</strong> APPT ’09: Proceed<strong>in</strong>gs of the 8th InternationalSymposium on Advanced Parallel Process<strong>in</strong>g Technologies.Berl<strong>in</strong>, Heidelberg: Spr<strong>in</strong>ger-Verlag, August 24-25 2009,pp. 389–407.


[3] L. Samuel and S. Amaras<strong>in</strong>ghe, “Exploit<strong>in</strong>g superword levelparallelism with multimedia <strong>in</strong>struction sets,” <strong>in</strong> PLDI ’00:Proceed<strong>in</strong>gs of the ACM SIGPLAN 2000 Conference on Programm<strong>in</strong>gLanguage Design and Implementation. New York,NY, USA: ACM, June 18-21 2000, pp. 145–156.[4] M. J. Wolfe, High performance compilers for parallel comput<strong>in</strong>g,C. Shankl<strong>in</strong> and L. Ortega, Eds. Boston, MA, USA:Addison-Wesley Longman Publish<strong>in</strong>g Co., Inc., 1995.[5] D. Naishlos, “Autovectorization <strong>in</strong> GCC,” <strong>in</strong> the GCC DevelopersSummit, June 2-4 2004, pp. 105–117.[6] I. Pryanishnikov, A. Krall, and N. Horspool, “Compiler optimizationsfor processors with SIMD <strong>in</strong>structions,” <strong>in</strong> SoftwarePractice and Experience, vol. 37, no. 1. New York, NY, USA:John Wiley & Sons, Inc., January 2007, pp. 93–113.[7] S. Larsen, R. Rabbah, and S. Amaras<strong>in</strong>ghe, “Exploit<strong>in</strong>g vectorparallelism <strong>in</strong> software pipel<strong>in</strong>ed loops,” <strong>in</strong> MICRO 38:Proceed<strong>in</strong>gs of the 38th annual IEEE/ACM International Symposiumon Microarchitecture. Wash<strong>in</strong>gton, DC, USA: IEEEComputer Society, November 12-16 2005, pp. 119–129.[8] A. Krall and S. Lelait, “Compilation techniques for multimediaprocessors,” <strong>in</strong> International Journal of Parallel Programm<strong>in</strong>g,vol. 28, no. 4. Norwell, MA, USA: Kluwer AcademicPublishers, August 2000, pp. 347–361.[9] P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao, “<strong>An</strong><strong>in</strong>tegrated simdization framework us<strong>in</strong>g virtual vectors,” <strong>in</strong> ICS’05: Proceed<strong>in</strong>gs of the 19th <strong>An</strong>nual International Conferenceon Supercomput<strong>in</strong>g. New York, NY, USA: ACM, June 20-222005, pp. 169–178.[10] M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, and H. Meyr,“A SIMD optimization framework for retargetable compilers,”<strong>in</strong> ACM Transactions on Architecture and Code Optimization,vol. 6, no. 1. New York, NY, USA: ACM, March 2009, pp.1–27.[11] R. L. Bocch<strong>in</strong>o Jr. and V. S. Adve, “Vector LLVA: a virtualvector <strong>in</strong>struction set for media process<strong>in</strong>g,” <strong>in</strong> VEE ’06:Proceed<strong>in</strong>gs of the 2nd International Conference on VirtualExecution Environments. New York, NY, USA: ACM, June14-16 2006, pp. 46–56.[12] D. Nuzman and R. Henderson, “Multi-platform autovectorization,”<strong>in</strong> CGO ’06: Proceed<strong>in</strong>gs of the InternationalSymposium on Code Generation and Optimization. Wash<strong>in</strong>gton,DC, USA: IEEE Computer Society, March 26-29 2006, pp.281–294.[13] A. E. Eichenberger, P. Wu, and K. O’Brien, “<strong>Vectorization</strong>for SIMD architectures with alignment constra<strong>in</strong>ts,” <strong>in</strong> PLDI’04: Proceed<strong>in</strong>gs of the ACM SIGPLAN 2004 Conference onProgramm<strong>in</strong>g Language Design and Implementation. NewYork, NY, USA: ACM, June 09-11 2004, pp. 82–93.[14] Sun Microsystems, Inc., UltraSPARC T2 supplement to theUltraSPARC architecture 2007, 2007.[15] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian, “<strong>Automatic</strong><strong>in</strong>tra-register vectorization for the Intel R○architecture,” <strong>in</strong> InternationalJournal of Parallel Programm<strong>in</strong>g, vol. 30, no. 2.Norwell, MA, USA: Kluwer Academic Publishers, April 2002,pp. 65–98.[16] C. Lattner and V. Adve, “<strong>LLVM</strong>: a compilation framework forlifelong program analysis & transformation,” <strong>in</strong> CGO ’04: Proceed<strong>in</strong>gsof the International Symposium on Code Generationand Optimization: feedback-directed and runtime optimization.Wash<strong>in</strong>gton, DC, USA: IEEE Computer Society, March 20-242004, pp. 75–86.[17] C. Lattner, The <strong>LLVM</strong> Compiler Infrastructure Project.[Onl<strong>in</strong>e]. Available: http://llvm.org/[18] Sun Microsystems, Inc., VIS <strong>in</strong>struction set user’s manual,2002.[19] V. Zivojnovic, J. M. Velarde, C. Schlager, and H. Meyr,“DSPSTONE: A DSP-oriented benchmark<strong>in</strong>g methodology,” <strong>in</strong>Proceed<strong>in</strong>gs of the International Conference on Signal Process<strong>in</strong>gand Technology (ICSPAT’94). Dallas, Tex, USA: MillerFreeman, October 1994, pp. 715–720.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!