An Automatic Superword Vectorization in LLVM
An Automatic Superword Vectorization in LLVM
An Automatic Superword Vectorization in LLVM
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>An</strong> <strong>Automatic</strong> <strong>Superword</strong> <strong>Vectorization</strong> <strong>in</strong> <strong>LLVM</strong>Kuan-Hsu Chen, Bor-Yeh Shen, and Wuu YangDepartment of Computer ScienceNational Chiao Tung UniversityHs<strong>in</strong>chu City 300, Taiwan{khchen, byshen, wuuyang}@cs.nctu.edu.twAbstract—More and more modern processors supportSIMD <strong>in</strong>structions for improv<strong>in</strong>g performance <strong>in</strong> mediaapplications. Programmers usually need detailed targetspecificknowledge to use SIMD <strong>in</strong>structions directly.Thus, an auto-vectorization compiler that automaticallygenerates efficient SIMD <strong>in</strong>structions is <strong>in</strong> urgent need. Weimplement an automatic superword vectorization basedon the <strong>LLVM</strong> compiler <strong>in</strong>frastructure, to which an autovectorizationand an alignment analysis passes have beenadded.The superword auto-vectorization pass exploits dataparallelismand convert IR <strong>in</strong>structions from primitivetype to vector type. Then, <strong>in</strong> code generator, the alignmentanalysis pass analyzes every memory access with respectto those vector <strong>in</strong>structions and generates the alignment<strong>in</strong>formation for generate target-specific alignment<strong>in</strong>structions. In this paper, we use UltraSPARC as ourexperimental platform and two realignment <strong>in</strong>structionsto perform misaligned access. We also present prelim<strong>in</strong>aryexperimental results, which reveals that our optimizationgenerates vectorized code that are 4% up to 35% speedup. 1Index Terms—Auto-vectorization, UltraSPARC T2,VISInstruction Set.I. INTRODUCTIONIn recent years, <strong>in</strong> order to improve the performanceof multimedia applications, the multimedia-extension <strong>in</strong>structionshave been added to most popular generalproposemicroprocessors. These <strong>in</strong>structions operate onfixed-length vectors. Exist<strong>in</strong>g multimedia extensions,such as MMX/SSE for X86, Visual Instruction Set (VIS)for UltraSPARC, and AltiVec for PowerPC, are characterizedas s<strong>in</strong>gle-<strong>in</strong>struction multiple-data (SIMD) architectures.SIMD architectures exploit the data-parallelism[1] that exists <strong>in</strong> many multimedia applications. SIMDtypically <strong>in</strong>volves the simultaneous execution of thesame <strong>in</strong>struction sequence on different elements <strong>in</strong> alarge data set. For example, some SIMD <strong>in</strong>structions1 The work reported <strong>in</strong> this paper is partially supported by NationalScience Council, Taiwan, Republic of Ch<strong>in</strong>a, under grants NSC 96-2628-E-009-014-MY3, NSC 98-2220-E-009-050, and NSC 98-2220-E-009-051 and a grant from Sun Microsystems OpenSparc Project.simultaneously operate on a block of 128-bit data that ispartitioned <strong>in</strong>to four 32-bit <strong>in</strong>tegers. SIMD architectures<strong>in</strong>clude not only common arithmetic <strong>in</strong>structions, butalso other <strong>in</strong>structions, such as data alignment, datatype conversion, data reorganization, etc., that are alsoneeded to prepare the data <strong>in</strong> a proper format for SIMDexecution [2].In an auto-vectorization technique, there are generallyfour major issues. First, not all code can be vectorized.For example, tasks with complicated control flowwould not benefit from SIMD processors. Second, SIMD<strong>in</strong>structions are commonly used <strong>in</strong> <strong>in</strong>-l<strong>in</strong>e assemblyrout<strong>in</strong>es or special library rout<strong>in</strong>es, which are usually notportable. Manually prepar<strong>in</strong>g <strong>in</strong>-l<strong>in</strong>e assembly rout<strong>in</strong>es iscostly. The third issue is the alignment problem causedby SIMD load/store <strong>in</strong>structions. These <strong>in</strong>structions usuallyrequire that a block of data be aligned at mach<strong>in</strong>eword boundary. The last issue is that SIMD <strong>in</strong>structionsare always architecture-specific. For example, MMXprovides multiply-and add <strong>in</strong>struction operation ( a = a+ (b x c) ) but VIS does not.We propose the design and implementation of anautomatic superword vectorization based on the <strong>LLVM</strong>compiler <strong>in</strong>frastructure, which is widely used <strong>in</strong> theresearch community. <strong>LLVM</strong> supports powerful optimizationsand analyses. However, <strong>LLVM</strong> currently does notyet support auto-vectorization. Therefore, we implementan auto-vectorization pass and an alignment analysis pass<strong>in</strong> <strong>LLVM</strong>. Auto-vectorization exploits data-level parallelism<strong>in</strong> <strong>in</strong>nermost loop and converts IR <strong>in</strong>structionsfrom primitive type to vector type. Because the operands<strong>in</strong> a vector operation must be properly aligned, if datais mis-aligned, additional <strong>in</strong>structions must be used toaccess correct data. These additional <strong>in</strong>structions <strong>in</strong>curextra penalty. Therefore, we use an alignment-analysispass that detects mis-aligned load/store and reducesrealignment <strong>in</strong>struction use. The alignment <strong>in</strong>formationis used by the target code generator to generate targetspecificalignment <strong>in</strong>structions. The optimization andanalysis passes operate on <strong>LLVM</strong> Intermediate Representation(IR).
The rest of this article is organized as follows. Relatedwork and an <strong>in</strong>troduction of <strong>LLVM</strong> are given <strong>in</strong> section2. A detailed description of our auto-vectorization andalignment analysis is provided <strong>in</strong> section 3. The experimentalresults are presented <strong>in</strong> section 4. Section 5 isthe conclusion and future work.II. BACKGROUNDThis section reviews current auto-vectorization techniquesand the alignment mechanisms. In addition, we<strong>in</strong>troduces the <strong>LLVM</strong> compiler <strong>in</strong>frastructure briefly.A. Overview of Auto-<strong>Vectorization</strong><strong>Vectorization</strong> is a process which translates sequentialloops <strong>in</strong>to parallel versions that utilize SIMD <strong>in</strong>structions[3] SIMD <strong>in</strong>structions are normally used with<strong>in</strong> <strong>in</strong>-l<strong>in</strong>eassembly rout<strong>in</strong>es or processor-specific libraries. However,this approach is tedious, s<strong>in</strong>ce a better alternative isto use a compiler that automatically issues SIMD <strong>in</strong>structions.Many researchers have studied auto-vectorization[1, 3–7]. There are two types of vectorization, loopbasedvectorization and basic-block vectorization (i.e.,superword level parallelism, SLP).Loop-based vectorization attempts to explore datalevelparallelism from a loop nest. In the first step, aloop-based vectorizer creates a data-dependence graphand identifies strongly connected components <strong>in</strong> thedependence graph. In a dependence graph, nodes representstatements and edges denote dependences betweenstatements. Note that nodes <strong>in</strong> the same stronglyconnected component are not vectorizable because theyhave cyclic dependences. Next, the vectorizer comb<strong>in</strong>eseach strongly connected component <strong>in</strong>to a piblock [1]and rebuilt dependences between operations <strong>in</strong> differentpiblocks. The result<strong>in</strong>g graph conta<strong>in</strong>s no cyclic dependences.Next, the vectorizer performs a topological sortof the graph to determ<strong>in</strong>e a valid execution order of thepiblocks, and then emit a vectorized statement for eachvectorizable node. F<strong>in</strong>ally, the vectorizer construct a loopto execute operations <strong>in</strong> the piblock of orig<strong>in</strong>al programorder.The basic-block auto-vectorization (SLP) packs <strong>in</strong>dependentisomorphic statements <strong>in</strong> a basic block. Isomorphicstatements are those that conta<strong>in</strong> the sameoperations <strong>in</strong> the same order [3], for example, Fig. 1shows the four isomorphic statements. They can executed<strong>in</strong> parallel because they are data <strong>in</strong>dependent with eachothers. In addition, the loops and non-iterative programscould benefit from this approach. Besides, this approachunrolls the <strong>in</strong>nermost loop to <strong>in</strong>crease parallelism <strong>in</strong>sidea basic block, but it limits on the unroll factor whichrelies on the knowledge of target vector register size. ForFig. 1: Isomorphic statements.Fig. 2: Misaligned loads.example, to calculate the unroll factor, the compiler mustknow the architecture’s vector register size (suppose it is8-byte) if the array is assumed to conta<strong>in</strong> 2-byte elements<strong>in</strong> the loop then the unroll factor will be 4 (8/2).In addition, if the loop’s trip counts is not divisible byunroll factor, the loop would be split which break oneloop <strong>in</strong>to two, one is vectorized loop and another onefor rema<strong>in</strong><strong>in</strong>g iteration.There are many related techniques have been proposed<strong>in</strong> the past. Some techniques focus on vectoriz<strong>in</strong>gscalar source [3, 5–9], enable vectorization onnon-vectorizable code and build cost models to selectthe most effective SIMD <strong>in</strong>structions or target-specific<strong>in</strong>structions. Others focus on the portable issue [10–12]because each multimedia extenion are very different.The usual solution is to design vector <strong>in</strong>structions <strong>in</strong>IR. For example, Bocch<strong>in</strong>o and Adve [12] designed avector LLVA (Low Level Virtual Architecture) based on<strong>LLVM</strong>, that can transfer IR to native code by avoid<strong>in</strong>guse of hardware-specific parameters and achievecode competitive with hand-code native assembly, butprogrammer should hand-tuned vector LLVA code forprogram carefully for each target. This method obviouslycan get more efficient than auto-vectorization. There hasbeen some work focused on generat<strong>in</strong>g efficient codethat is <strong>in</strong>fluenced by the alignment, vector size, and datamovement constra<strong>in</strong>ts [13].B. Overview of Alignment MechanismsThe misalignment problem often occurs <strong>in</strong> SIMD operations.For example, two SIMD <strong>in</strong>structions which op-
1) Orig<strong>in</strong>al:TABLE I: Alignment Mechanism across different platforms [12, 14]Target Unaligned Load Aligned Load Realign Load RTAltiVec/VMX lvx vperm lvslSSE/SEE3 movdqu,lddqu movdqaMIPS-3D luxcl alnv.ps addressMIPS64 ldl,ldralpha ldq u extql,extqh addressVIS1 ldd faligndata alignaddressvoid add ( s h o r t ∗a , s h o r t ∗b , s h o r t ∗c , i n t n ){f o r ( i =0; i
a permutation mask or some other value <strong>in</strong> differentplatforms. For example, the AltiVec lvsr <strong>in</strong>structioncomputes a permutation mask which the permutation<strong>in</strong>struction vperm could use to select the correct data.The alignaddress <strong>in</strong>struction <strong>in</strong> VIS computes a maskand return an aligned address for load<strong>in</strong>g two aligneddata to register. Then the faligndata <strong>in</strong>struction uses themask to select correct data <strong>in</strong> two registers. Moreover,<strong>in</strong> the MIPS-3D and alpha platform, the hardware automaticallyclear the lower-bit of the effective addressand returns an aligned address (RT) for shift, rotates, orpermute to extract the unaligned data elements.In the software mechanisms, the common approachesare dynamic loop peel<strong>in</strong>g and multi-version<strong>in</strong>g [?,8,15].Multi-version<strong>in</strong>g is used to avoid misaligned memoryaccesses by us<strong>in</strong>g runtime checks, because not all alignment<strong>in</strong>formation can be obta<strong>in</strong>ed <strong>in</strong> static time. Dynamicloop peel<strong>in</strong>g focuses on enforc<strong>in</strong>g aligned accesses.Fig. 3 illustrates those approaches.C. The <strong>LLVM</strong> Compiler InfrastructureThe Low Level Virtual Mach<strong>in</strong>e (<strong>LLVM</strong>) is a compiler<strong>in</strong>frastructure which is composed by reusable modularcomponents (passes). Each pass performs one analysis orone transformation <strong>in</strong> a certa<strong>in</strong> type (function, loop, etc.).<strong>LLVM</strong> enable effective program optimizations acrossthe entire lifetime which <strong>in</strong>cludes compile time, l<strong>in</strong>ktime, and run time and provides a powerful IntermediateRepresentation (IR). The <strong>LLVM</strong> IR is a RISClike<strong>in</strong>struction set but with key higher-level <strong>in</strong>formationfor effective analysis. This <strong>in</strong>cludes type <strong>in</strong>formation,explicit control flow graphs, and an explicit dataflowrepresentation (us<strong>in</strong>g an <strong>in</strong>f<strong>in</strong>ite, typed register set <strong>in</strong> theSSA form) [16].The type system is the most important feature <strong>in</strong><strong>LLVM</strong>. It is low-level, language-<strong>in</strong>dependent and is usedto implement data types and operations from high-levellanguages, expos<strong>in</strong>g their implementation behavior toall stages of optimization [16]. IR <strong>in</strong>structions performtype conversions and low-level address arithmetic whilepreserv<strong>in</strong>g type <strong>in</strong>formation [16]. The vector type is thekey type used <strong>in</strong> this paper which represents a vector ofelements. A companion code generator <strong>in</strong> <strong>LLVM</strong> backendalways emits SIMD <strong>in</strong>structions for the vector typewhen the target platform supports SIMD <strong>in</strong>structions.Note that the <strong>LLVM</strong> does not support auto-vectorizationtransformations currently but only provides the vectortype <strong>in</strong> IR which can be generated by <strong>in</strong>tr<strong>in</strong>sic functionsor built-<strong>in</strong> functions <strong>in</strong> the source code.Fig. 4: Our automatic superword vectorization.III. OVERVIEW OF AUTOMATIC SUPERWORDVECTORIZATIONThis section expla<strong>in</strong>s our automatic superword vectorization,which is depicted <strong>in</strong> Fig. 4. In the frontend,source code is compiled to <strong>LLVM</strong> IR. <strong>LLVM</strong>supports many <strong>in</strong>dependent and re-useable optimizationand transformation passes <strong>in</strong> the <strong>LLVM</strong> IR level. We addtwo passes: auto-vectorization and alignment analysis.The auto-vectorization pass aims to explore data-levelparallelism <strong>in</strong> the <strong>in</strong>nermost loop and converts IR <strong>in</strong>structionsfrom primitive type to vector type. The alignmentanalysis is an <strong>in</strong>terprocedural po<strong>in</strong>ter analysis which calculatesalignment <strong>in</strong>formation for <strong>in</strong>struction selection.The <strong>LLVM</strong> back-end provides several code generatorsfor various platforms and supports some multimediaextensions but it does not support the VIS extension onSPARC platform. We modified an exist<strong>in</strong>g SPARC codegenerator <strong>in</strong> <strong>LLVM</strong> for generat<strong>in</strong>g SIMD <strong>in</strong>structions andrealignment <strong>in</strong>structions with the help of the alignmentanalysis pass.A. Auto-<strong>Vectorization</strong>The auto-vectorization pass works on each loop<strong>in</strong>dependently. This pass aims to explore data-levelparallelism and produces vector-type IR <strong>in</strong>structionfrom primitive types. Our basic-block auto-vectorizationmethod is similar to Larsen and Amaras<strong>in</strong>ghe’s approach(SLP) [3], and we assume overflows will not occur. Thedetail procedure is presented <strong>in</strong> this section.1) Pre-pass: The pre-pass step performs several<strong>LLVM</strong> optimization and transformation passes. Thosepasses <strong>in</strong>clude the general SSA-form optimization, con-
Fig. 6: Vectorizable statement classification.Fig. 8: After type transformation <strong>in</strong> IR.Fig. 7: Type transformation process<strong>in</strong>g.conta<strong>in</strong>s more than one node (statement), it <strong>in</strong>dicatesthese nodes cannot be comb<strong>in</strong>ed together to emit a SIMD<strong>in</strong>struction. Fig. 6 shows the vectorizable statementclassification result of Fig. 5 (b) code, there are twocandidate sets.5) Type Transformation: The type-transformationstep compares candidate sets and group them if theyare isomorphic statements. Next, it comb<strong>in</strong>es two groups<strong>in</strong>to a new one when they use the same candidate set.Lastly, if the number of candidate sets <strong>in</strong> a group isequals to VF, isomorphic statements will be put togetheras vector <strong>in</strong>structions <strong>in</strong> orig<strong>in</strong>al program order. In otherword, this step transforms <strong>in</strong>structions from orig<strong>in</strong>alprimitive data types to vector types <strong>in</strong> <strong>LLVM</strong> IR typerepresentation. The follow<strong>in</strong>g terms would be compared<strong>in</strong> different candidate sets one by one. First, they havethe same numbers of node <strong>in</strong> the dependency graph, andthe node does not have flow dependency with a node<strong>in</strong> another set. Second, two nodes <strong>in</strong> identical order ofeach candidate sets would be compared accord<strong>in</strong>g to thefollow<strong>in</strong>g conditions:• The operations are the same.• The address operands must be adjacent memoryreferences.In other words, two candidate sets are isomorphicwhen they are <strong>in</strong> the same group. Next, we comb<strong>in</strong>etwo groups <strong>in</strong>to a new one when they use the samecandidate set and each different candidate set do nothave dependency relationship. (by us<strong>in</strong>g <strong>LLVM</strong> aliasanalysis) In the next step, if the number of candidatesets <strong>in</strong> a group equals to VF, the node <strong>in</strong> the candidatesets of a group would be emitted to vector <strong>in</strong>structions.In this step, the best choice to pack candidate set to agroup is when a group number is equal to the maximumparallelism degree which target supports. For example,if a group has eight candidate sets, it would be suitablewhen the target provides eight 16-bit operation thanfour 16-bit operation. Unfortunately, auto-vectorizationperforms <strong>in</strong> retargetable IR, so we assume that the targetprovides VF degree operation. In <strong>LLVM</strong> IR, we use thebitcast <strong>in</strong>struction to transfer load/store <strong>in</strong>structions fromprimitive types to vector type, and <strong>in</strong>sert vector type<strong>in</strong>structions that are equal to the rema<strong>in</strong><strong>in</strong>g primitivetype <strong>in</strong>structions <strong>in</strong> the orig<strong>in</strong>al order. In the comb<strong>in</strong>eand-transferstep, it also handles the loop <strong>in</strong>variant variableby us<strong>in</strong>g the scalar expansion. The loop <strong>in</strong>variantmeans the statements conta<strong>in</strong> an <strong>in</strong>variant operand <strong>in</strong>the expression. The type transformation process<strong>in</strong>g isshown <strong>in</strong> Fig. 7, and Fig. 8 shows result IR after typetransformation. Note that the bitcast <strong>in</strong>struction is usedto converts orig<strong>in</strong>al type to vector type.B. Alignment <strong>An</strong>alysisThe alignment analysis is <strong>in</strong>terprocedural po<strong>in</strong>ter analysis,which is similar to Pryanishnikov’s approach [6].It analyzes every memory access with respect to vector<strong>in</strong>structions. The goal of alignment analysis is to identifymemory references that are unaligned on all executions.The alignment <strong>in</strong>formation is calculated with the modulooperator <strong>in</strong> the memory <strong>in</strong>structions addresses. There are
two assumptions. First, the compiler must align data.It means that when a program allocates memory, thefirst element should be aligned. Second, the address isassumed to be misaligned when the alignment <strong>in</strong>formationcannot be obta<strong>in</strong>ed. The alignment <strong>in</strong>formation ofmemory access is given by:AlignmentInfo = P mod NThe AlignmentInfo is a set of alignment value moduleN, the P denotes the po<strong>in</strong>ter address, and the N denotesmemory boundary <strong>in</strong> byte (usually, the N is equals to targetvector register size). Because alignment <strong>in</strong>formation would bechanged when the po<strong>in</strong>ter performs arithmetic operation, suchas *(p+i). In order to evaluate po<strong>in</strong>ter arithmetic, the transferfunction F : M × M ↦→ M is given by:F (A1 ± A2) = [(A1 mod N) ± (A2 mod N)] mod NF (A1 · A2) = [(A1 mod N) · (A2 mod N)] mod NM is the set of all possible alignment values modulo N,and A1 and A2 denote a scalar or a value of AlignmentInfoset.Alignment analysis traverses the IR and propagates thealignment <strong>in</strong>formation. It starts from the ma<strong>in</strong> function. Eachfunction call is visited sequentially. When a procedure <strong>in</strong>volvesa function call, alignment<strong>in</strong>fo of argument variable wouldbe passed across the function call. Otherwise, alignment<strong>in</strong>fowould be merged <strong>in</strong> the different times when a functionis <strong>in</strong>volved. In the loop block, the alignment <strong>in</strong>formationpropagates <strong>in</strong> each iterator. Consider the follow<strong>in</strong>g example:the alignment <strong>in</strong>formation of b (the vector length is 8) iscalculated. Assume the <strong>in</strong>put argument b’s alignment <strong>in</strong>formationset is 1. In the first iteration, the alignment <strong>in</strong>formationof po<strong>in</strong>ter b is 1, and then the b’s alignment <strong>in</strong>formationis 3 ((1 mod 8 + 2 mod 8) mod 8 = 3) <strong>in</strong> the statement*b+=2. Next, the alignment <strong>in</strong>formation of po<strong>in</strong>ter b is 1,3by propagation (1 ∪ 3). F<strong>in</strong>ally, the analysis would stop whenthe old set equal the new set, the b’s alignment <strong>in</strong>formationset is 1,3,5,7.void sum ( i n t ∗ a , i n t ∗b , i n t ∗ c ){f o r ( i =0; i
Fig. 10: Pre-pass and Auto-vectorization speedup evaluation.Fig. 11: Auto-vectorization speedup compared with Preoptimizationevaluation.optimization, auto-vectorization optimization, and disabl<strong>in</strong>galignment analysis result. Note that the disabl<strong>in</strong>g alignmentanalysis always uses realignment <strong>in</strong>struction to do aligned ormisaligned vector access. In the experience, time is measuredus<strong>in</strong>g the gethrtime() call. The time measured is high resolution<strong>in</strong> nanoseconds. Speed up rates when compared with-O0 optimization for the test suites are shown <strong>in</strong> Fig. 10. Theresults of pre-pass and auto-vectorization are shown <strong>in</strong> Fig. 11.Fig. 12 shows the compar<strong>in</strong>g result of enabl<strong>in</strong>g and disabl<strong>in</strong>galignment analysis that means misaligned <strong>in</strong>struction would begenerated <strong>in</strong> every vector memory access. In each figure, thestart symbol represents the misaligned access <strong>in</strong> the program.In certa<strong>in</strong> cases, the speedup rate is lower because themisaligned access (e.g., lms, fir), or multiply always appears<strong>in</strong> the program (e.g., fir, iir) or the smaller fraction of thebenchmark code can be mapped to SIMD <strong>in</strong>structions (e.g.,lms). Otherwise, by us<strong>in</strong>g scalar expansion, the auxiliary arraywould allocate extra memory and use more <strong>in</strong>structions to access.These would add more penalty. However, the evaluationresult shows a speedup of 4% up to 36% and the averageperformance improvement is up to 17.2% by compar<strong>in</strong>g withpre-pass. If we compar<strong>in</strong>g matrix and matrix add, the averageperformance impact is 14.5%, but it still improves 30.35% bycompar<strong>in</strong>g with pre-pass.V. CONCLUSION AND FUTURE WORKIn this paper we design and implement an automaticsuperword vectorization <strong>in</strong> <strong>LLVM</strong>. It can produces SIMD<strong>in</strong>structions to get performance improved. Our design <strong>in</strong>cludesauto-vectorization and alignment analysis passes. Autovectorizationexploits data parallelism and enables base vectorization.In addition, the alignment analysis pass is used forthe back-end to select target-specific realignment <strong>in</strong>structionsto handle the misaligned problem <strong>in</strong> our experiment environment.Furthermore, the vectorization is effective because<strong>LLVM</strong> supports high-level <strong>in</strong>formation <strong>in</strong> its IR. Besides, ourvectorization is <strong>in</strong>dependent of other optimization and analysispasses <strong>in</strong> <strong>LLVM</strong>. Programs could also apply other powerfuloptimizations after the auto-vectorization. In addition, theFig. 12: Auto-vectorization speedup compare with disabl<strong>in</strong>galignment analysis.vectorized code can generate SIMD <strong>in</strong>structions for otherplatforms supported by <strong>LLVM</strong>.Future work would focus on enhance vectorization capabilityand transfer <strong>LLVM</strong> IR to to vector LLVA [11] whichprovide rich set of vector <strong>in</strong>structions based on <strong>LLVM</strong> IR.(e.g., Vector LLVA provides saturation arithmetic) In addition,because auto-vectorization may <strong>in</strong>troduce some extra codesand costs, the suitable cost model would be added for SPARCon <strong>LLVM</strong>.REFERENCES[1] K. Kennedy and J. R. Allen, Optimiz<strong>in</strong>g compilers for modernarchitectures: a dependence-based approach, San Francisco,CA, USA, 2001.[2] A. Shahbahrami and B. Juurl<strong>in</strong>k, “Performance improvementof multimedia kernels by alleviat<strong>in</strong>g overhead <strong>in</strong>structions onSIMD devices,” <strong>in</strong> APPT ’09: Proceed<strong>in</strong>gs of the 8th InternationalSymposium on Advanced Parallel Process<strong>in</strong>g Technologies.Berl<strong>in</strong>, Heidelberg: Spr<strong>in</strong>ger-Verlag, August 24-25 2009,pp. 389–407.
[3] L. Samuel and S. Amaras<strong>in</strong>ghe, “Exploit<strong>in</strong>g superword levelparallelism with multimedia <strong>in</strong>struction sets,” <strong>in</strong> PLDI ’00:Proceed<strong>in</strong>gs of the ACM SIGPLAN 2000 Conference on Programm<strong>in</strong>gLanguage Design and Implementation. New York,NY, USA: ACM, June 18-21 2000, pp. 145–156.[4] M. J. Wolfe, High performance compilers for parallel comput<strong>in</strong>g,C. Shankl<strong>in</strong> and L. Ortega, Eds. Boston, MA, USA:Addison-Wesley Longman Publish<strong>in</strong>g Co., Inc., 1995.[5] D. Naishlos, “Autovectorization <strong>in</strong> GCC,” <strong>in</strong> the GCC DevelopersSummit, June 2-4 2004, pp. 105–117.[6] I. Pryanishnikov, A. Krall, and N. Horspool, “Compiler optimizationsfor processors with SIMD <strong>in</strong>structions,” <strong>in</strong> SoftwarePractice and Experience, vol. 37, no. 1. New York, NY, USA:John Wiley & Sons, Inc., January 2007, pp. 93–113.[7] S. Larsen, R. Rabbah, and S. Amaras<strong>in</strong>ghe, “Exploit<strong>in</strong>g vectorparallelism <strong>in</strong> software pipel<strong>in</strong>ed loops,” <strong>in</strong> MICRO 38:Proceed<strong>in</strong>gs of the 38th annual IEEE/ACM International Symposiumon Microarchitecture. Wash<strong>in</strong>gton, DC, USA: IEEEComputer Society, November 12-16 2005, pp. 119–129.[8] A. Krall and S. Lelait, “Compilation techniques for multimediaprocessors,” <strong>in</strong> International Journal of Parallel Programm<strong>in</strong>g,vol. 28, no. 4. Norwell, MA, USA: Kluwer AcademicPublishers, August 2000, pp. 347–361.[9] P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao, “<strong>An</strong><strong>in</strong>tegrated simdization framework us<strong>in</strong>g virtual vectors,” <strong>in</strong> ICS’05: Proceed<strong>in</strong>gs of the 19th <strong>An</strong>nual International Conferenceon Supercomput<strong>in</strong>g. New York, NY, USA: ACM, June 20-222005, pp. 169–178.[10] M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, and H. Meyr,“A SIMD optimization framework for retargetable compilers,”<strong>in</strong> ACM Transactions on Architecture and Code Optimization,vol. 6, no. 1. New York, NY, USA: ACM, March 2009, pp.1–27.[11] R. L. Bocch<strong>in</strong>o Jr. and V. S. Adve, “Vector LLVA: a virtualvector <strong>in</strong>struction set for media process<strong>in</strong>g,” <strong>in</strong> VEE ’06:Proceed<strong>in</strong>gs of the 2nd International Conference on VirtualExecution Environments. New York, NY, USA: ACM, June14-16 2006, pp. 46–56.[12] D. Nuzman and R. Henderson, “Multi-platform autovectorization,”<strong>in</strong> CGO ’06: Proceed<strong>in</strong>gs of the InternationalSymposium on Code Generation and Optimization. Wash<strong>in</strong>gton,DC, USA: IEEE Computer Society, March 26-29 2006, pp.281–294.[13] A. E. Eichenberger, P. Wu, and K. O’Brien, “<strong>Vectorization</strong>for SIMD architectures with alignment constra<strong>in</strong>ts,” <strong>in</strong> PLDI’04: Proceed<strong>in</strong>gs of the ACM SIGPLAN 2004 Conference onProgramm<strong>in</strong>g Language Design and Implementation. NewYork, NY, USA: ACM, June 09-11 2004, pp. 82–93.[14] Sun Microsystems, Inc., UltraSPARC T2 supplement to theUltraSPARC architecture 2007, 2007.[15] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian, “<strong>Automatic</strong><strong>in</strong>tra-register vectorization for the Intel R○architecture,” <strong>in</strong> InternationalJournal of Parallel Programm<strong>in</strong>g, vol. 30, no. 2.Norwell, MA, USA: Kluwer Academic Publishers, April 2002,pp. 65–98.[16] C. Lattner and V. Adve, “<strong>LLVM</strong>: a compilation framework forlifelong program analysis & transformation,” <strong>in</strong> CGO ’04: Proceed<strong>in</strong>gsof the International Symposium on Code Generationand Optimization: feedback-directed and runtime optimization.Wash<strong>in</strong>gton, DC, USA: IEEE Computer Society, March 20-242004, pp. 75–86.[17] C. Lattner, The <strong>LLVM</strong> Compiler Infrastructure Project.[Onl<strong>in</strong>e]. Available: http://llvm.org/[18] Sun Microsystems, Inc., VIS <strong>in</strong>struction set user’s manual,2002.[19] V. Zivojnovic, J. M. Velarde, C. Schlager, and H. Meyr,“DSPSTONE: A DSP-oriented benchmark<strong>in</strong>g methodology,” <strong>in</strong>Proceed<strong>in</strong>gs of the International Conference on Signal Process<strong>in</strong>gand Technology (ICSPAT’94). Dallas, Tex, USA: MillerFreeman, October 1994, pp. 715–720.