An Automatic Superword Vectorization in LLVM

An Automatic Superword Vectorization in LLVMKuan-Hsu Chen, Bor-Yeh Shen, and Wuu YangDepartment of Computer ScienceNational Chiao Tung UniversityHsinchu City 300, Taiwan{khchen, byshen, wuuyang}@cs.nctu.edu.twAbstract—More and more modern processors supportSIMD instructions for improving performance in mediaapplications. Programmers usually need detailed targetspecificknowledge to use SIMD instructions directly.Thus, an auto-vectorization compiler that automaticallygenerates efficient SIMD instructions is in urgent need. Weimplement an automatic superword vectorization basedon the LLVM compiler infrastructure, to which an autovectorizationand an alignment analysis passes have beenadded.The superword auto-vectorization pass exploits dataparallelismand convert IR instructions from primitivetype to vector type. Then, in code generator, the alignmentanalysis pass analyzes every memory access with respectto those vector instructions and generates the alignmentinformation for generate target-specific alignmentinstructions. In this paper, we use UltraSPARC as ourexperimental platform and two realignment instructionsto perform misaligned access. We also present preliminaryexperimental results, which reveals that our optimizationgenerates vectorized code that are 4% up to 35% speedup. 1Index Terms—Auto-vectorization, UltraSPARC T2,VISInstruction Set.I. INTRODUCTIONIn recent years, in order to improve the performanceof multimedia applications, the multimedia-extension instructionshave been added to most popular generalproposemicroprocessors. These instructions operate onfixed-length vectors. Existing multimedia extensions,such as MMX/SSE for X86, Visual Instruction Set (VIS)for UltraSPARC, and AltiVec for PowerPC, are characterizedas single-instruction multiple-data (SIMD) architectures.SIMD architectures exploit the data-parallelism[1] that exists in many multimedia applications. SIMDtypically involves the simultaneous execution of thesame instruction sequence on different elements in alarge data set. For example, some SIMD instructions1 The work reported in this paper is partially supported by NationalScience Council, Taiwan, Republic of China, under grants NSC 96-2628-E-009-014-MY3, NSC 98-2220-E-009-050, and NSC 98-2220-E-009-051 and a grant from Sun Microsystems OpenSparc Project.simultaneously operate on a block of 128-bit data that ispartitioned into four 32-bit integers. SIMD architecturesinclude not only common arithmetic instructions, butalso other instructions, such as data alignment, datatype conversion, data reorganization, etc., that are alsoneeded to prepare the data in a proper format for SIMDexecution [2].In an auto-vectorization technique, there are generallyfour major issues. First, not all code can be vectorized.For example, tasks with complicated control flowwould not benefit from SIMD processors. Second, SIMDinstructions are commonly used in in-line assemblyroutines or special library routines, which are usually notportable. Manually preparing in-line assembly routines iscostly. The third issue is the alignment problem causedby SIMD load/store instructions. These instructions usuallyrequire that a block of data be aligned at machineword boundary. The last issue is that SIMD instructionsare always architecture-specific. For example, MMXprovides multiply-and add instruction operation ( a = a+ (b x c) ) but VIS does not.We propose the design and implementation of anautomatic superword vectorization based on the LLVMcompiler infrastructure, which is widely used in theresearch community. LLVM supports powerful optimizationsand analyses. However, LLVM currently does notyet support auto-vectorization. Therefore, we implementan auto-vectorization pass and an alignment analysis passin LLVM. Auto-vectorization exploits data-level parallelismin innermost loop and converts IR instructionsfrom primitive type to vector type. Because the operandsin a vector operation must be properly aligned, if datais mis-aligned, additional instructions must be used toaccess correct data. These additional instructions incurextra penalty. Therefore, we use an alignment-analysispass that detects mis-aligned load/store and reducesrealignment instruction use. The alignment informationis used by the target code generator to generate targetspecificalignment instructions. The optimization andanalysis passes operate on LLVM Intermediate Representation(IR).

The rest of this article is organized as follows. Relatedwork and an introduction of LLVM are given in section2. A detailed description of our auto-vectorization andalignment analysis is provided in section 3. The experimentalresults are presented in section 4. Section 5 isthe conclusion and future work.II. BACKGROUNDThis section reviews current auto-vectorization techniquesand the alignment mechanisms. In addition, weintroduces the LLVM compiler infrastructure briefly.A. Overview of Auto-VectorizationVectorization is a process which translates sequentialloops into parallel versions that utilize SIMD instructions[3] SIMD instructions are normally used within in-lineassembly routines or processor-specific libraries. However,this approach is tedious, since a better alternative isto use a compiler that automatically issues SIMD instructions.Many researchers have studied auto-vectorization[1, 3–7]. There are two types of vectorization, loopbasedvectorization and basic-block vectorization (i.e.,superword level parallelism, SLP).Loop-based vectorization attempts to explore datalevelparallelism from a loop nest. In the first step, aloop-based vectorizer creates a data-dependence graphand identifies strongly connected components in thedependence graph. In a dependence graph, nodes representstatements and edges denote dependences betweenstatements. Note that nodes in the same stronglyconnected component are not vectorizable because theyhave cyclic dependences. Next, the vectorizer combineseach strongly connected component into a piblock [1]and rebuilt dependences between operations in differentpiblocks. The resulting graph contains no cyclic dependences.Next, the vectorizer performs a topological sortof the graph to determine a valid execution order of thepiblocks, and then emit a vectorized statement for eachvectorizable node. Finally, the vectorizer construct a loopto execute operations in the piblock of original programorder.The basic-block auto-vectorization (SLP) packs independentisomorphic statements in a basic block. Isomorphicstatements are those that contain the sameoperations in the same order [3], for example, Fig. 1shows the four isomorphic statements. They can executedin parallel because they are data independent with eachothers. In addition, the loops and non-iterative programscould benefit from this approach. Besides, this approachunrolls the innermost loop to increase parallelism insidea basic block, but it limits on the unroll factor whichrelies on the knowledge of target vector register size. ForFig. 1: Isomorphic statements.Fig. 2: Misaligned loads.example, to calculate the unroll factor, the compiler mustknow the architecture’s vector register size (suppose it is8-byte) if the array is assumed to contain 2-byte elementsin the loop then the unroll factor will be 4 (8/2).In addition, if the loop’s trip counts is not divisible byunroll factor, the loop would be split which break oneloop into two, one is vectorized loop and another onefor remaining iteration.There are many related techniques have been proposedin the past. Some techniques focus on vectorizingscalar source [3, 5–9], enable vectorization onnon-vectorizable code and build cost models to selectthe most effective SIMD instructions or target-specificinstructions. Others focus on the portable issue [10–12]because each multimedia extenion are very different.The usual solution is to design vector instructions inIR. For example, Bocchino and Adve [12] designed avector LLVA (Low Level Virtual Architecture) based onLLVM, that can transfer IR to native code by avoidinguse of hardware-specific parameters and achievecode competitive with hand-code native assembly, butprogrammer should hand-tuned vector LLVA code forprogram carefully for each target. This method obviouslycan get more efficient than auto-vectorization. There hasbeen some work focused on generating efficient codethat is influenced by the alignment, vector size, and datamovement constraints [13].B. Overview of Alignment MechanismsThe misalignment problem often occurs in SIMD operations.For example, two SIMD instructions which op-

1) Original:TABLE I: Alignment Mechanism across different platforms [12, 14]Target Unaligned Load Aligned Load Realign Load RTAltiVec/VMX lvx vperm lvslSSE/SEE3 movdqu,lddqu movdqaMIPS-3D luxcl alnv.ps addressMIPS64 ldl,ldralpha ldq u extql,extqh addressVIS1 ldd faligndata alignaddressvoid add ( s h o r t ∗a , s h o r t ∗b , s h o r t ∗c , i n t n ){f o r ( i =0; i

a permutation mask or some other value in differentplatforms. For example, the AltiVec lvsr instructioncomputes a permutation mask which the permutationinstruction vperm could use to select the correct data.The alignaddress instruction in VIS computes a maskand return an aligned address for loading two aligneddata to register. Then the faligndata instruction uses themask to select correct data in two registers. Moreover,in the MIPS-3D and alpha platform, the hardware automaticallyclear the lower-bit of the effective addressand returns an aligned address (RT) for shift, rotates, orpermute to extract the unaligned data elements.In the software mechanisms, the common approachesare dynamic loop peeling and multi-versioning [?,8,15].Multi-versioning is used to avoid misaligned memoryaccesses by using runtime checks, because not all alignmentinformation can be obtained in static time. Dynamicloop peeling focuses on enforcing aligned accesses.Fig. 3 illustrates those approaches.C. The LLVM Compiler InfrastructureThe Low Level Virtual Machine (LLVM) is a compilerinfrastructure which is composed by reusable modularcomponents (passes). Each pass performs one analysis orone transformation in a certain type (function, loop, etc.).LLVM enable effective program optimizations acrossthe entire lifetime which includes compile time, linktime, and run time and provides a powerful IntermediateRepresentation (IR). The LLVM IR is a RISClikeinstruction set but with key higher-level informationfor effective analysis. This includes type information,explicit control flow graphs, and an explicit dataflowrepresentation (using an infinite, typed register set in theSSA form) [16].The type system is the most important feature inLLVM. It is low-level, language-independent and is usedto implement data types and operations from high-levellanguages, exposing their implementation behavior toall stages of optimization [16]. IR instructions performtype conversions and low-level address arithmetic whilepreserving type information [16]. The vector type is thekey type used in this paper which represents a vector ofelements. A companion code generator in LLVM backendalways emits SIMD instructions for the vector typewhen the target platform supports SIMD instructions.Note that the LLVM does not support auto-vectorizationtransformations currently but only provides the vectortype in IR which can be generated by intrinsic functionsor built-in functions in the source code.Fig. 4: Our automatic superword vectorization.III. OVERVIEW OF AUTOMATIC SUPERWORDVECTORIZATIONThis section explains our automatic superword vectorization,which is depicted in Fig. 4. In the frontend,source code is compiled to LLVM IR. LLVMsupports many independent and re-useable optimizationand transformation passes in the LLVM IR level. We addtwo passes: auto-vectorization and alignment analysis.The auto-vectorization pass aims to explore data-levelparallelism in the innermost loop and converts IR instructionsfrom primitive type to vector type. The alignmentanalysis is an interprocedural pointer analysis which calculatesalignment information for instruction selection.The LLVM back-end provides several code generatorsfor various platforms and supports some multimediaextensions but it does not support the VIS extension onSPARC platform. We modified an existing SPARC codegenerator in LLVM for generating SIMD instructions andrealignment instructions with the help of the alignmentanalysis pass.A. Auto-VectorizationThe auto-vectorization pass works on each loopindependently. This pass aims to explore data-levelparallelism and produces vector-type IR instructionfrom primitive types. Our basic-block auto-vectorizationmethod is similar to Larsen and Amarasinghe’s approach(SLP) [3], and we assume overflows will not occur. Thedetail procedure is presented in this section.1) Pre-pass: The pre-pass step performs severalLLVM optimization and transformation passes. Thosepasses include the general SSA-form optimization, con-

Fig. 6: Vectorizable statement classification.Fig. 8: After type transformation in IR.Fig. 7: Type transformation processing.contains more than one node (statement), it indicatesthese nodes cannot be combined together to emit a SIMDinstruction. Fig. 6 shows the vectorizable statementclassification result of Fig. 5 (b) code, there are twocandidate sets.5) Type Transformation: The type-transformationstep compares candidate sets and group them if theyare isomorphic statements. Next, it combines two groupsinto a new one when they use the same candidate set.Lastly, if the number of candidate sets in a group isequals to VF, isomorphic statements will be put togetheras vector instructions in original program order. In otherword, this step transforms instructions from originalprimitive data types to vector types in LLVM IR typerepresentation. The following terms would be comparedin different candidate sets one by one. First, they havethe same numbers of node in the dependency graph, andthe node does not have flow dependency with a nodein another set. Second, two nodes in identical order ofeach candidate sets would be compared according to thefollowing conditions:• The operations are the same.• The address operands must be adjacent memoryreferences.In other words, two candidate sets are isomorphicwhen they are in the same group. Next, we combinetwo groups into a new one when they use the samecandidate set and each different candidate set do nothave dependency relationship. (by using LLVM aliasanalysis) In the next step, if the number of candidatesets in a group equals to VF, the node in the candidatesets of a group would be emitted to vector instructions.In this step, the best choice to pack candidate set to agroup is when a group number is equal to the maximumparallelism degree which target supports. For example,if a group has eight candidate sets, it would be suitablewhen the target provides eight 16-bit operation thanfour 16-bit operation. Unfortunately, auto-vectorizationperforms in retargetable IR, so we assume that the targetprovides VF degree operation. In LLVM IR, we use thebitcast instruction to transfer load/store instructions fromprimitive types to vector type, and insert vector typeinstructions that are equal to the remaining primitivetype instructions in the original order. In the combineand-transferstep, it also handles the loop invariant variableby using the scalar expansion. The loop invariantmeans the statements contain an invariant operand inthe expression. The type transformation processing isshown in Fig. 7, and Fig. 8 shows result IR after typetransformation. Note that the bitcast instruction is usedto converts original type to vector type.B. Alignment AnalysisThe alignment analysis is interprocedural pointer analysis,which is similar to Pryanishnikov’s approach [6].It analyzes every memory access with respect to vectorinstructions. The goal of alignment analysis is to identifymemory references that are unaligned on all executions.The alignment information is calculated with the modulooperator in the memory instructions addresses. There are

two assumptions. First, the compiler must align data.It means that when a program allocates memory, thefirst element should be aligned. Second, the address isassumed to be misaligned when the alignment informationcannot be obtained. The alignment information ofmemory access is given by:AlignmentInfo = P mod NThe AlignmentInfo is a set of alignment value moduleN, the P denotes the pointer address, and the N denotesmemory boundary in byte (usually, the N is equals to targetvector register size). Because alignment information would bechanged when the pointer performs arithmetic operation, suchas *(p+i). In order to evaluate pointer arithmetic, the transferfunction F : M × M ↦→ M is given by:F (A1 ± A2) = [(A1 mod N) ± (A2 mod N)] mod NF (A1 · A2) = [(A1 mod N) · (A2 mod N)] mod NM is the set of all possible alignment values modulo N,and A1 and A2 denote a scalar or a value of AlignmentInfoset.Alignment analysis traverses the IR and propagates thealignment information. It starts from the main function. Eachfunction call is visited sequentially. When a procedure involvesa function call, alignmentinfo of argument variable wouldbe passed across the function call. Otherwise, alignmentinfowould be merged in the different times when a functionis involved. In the loop block, the alignment informationpropagates in each iterator. Consider the following example:the alignment information of b (the vector length is 8) iscalculated. Assume the input argument b’s alignment informationset is 1. In the first iteration, the alignment informationof pointer b is 1, and then the b’s alignment informationis 3 ((1 mod 8 + 2 mod 8) mod 8 = 3) in the statement*b+=2. Next, the alignment information of pointer b is 1,3by propagation (1 ∪ 3). Finally, the analysis would stop whenthe old set equal the new set, the b’s alignment informationset is 1,3,5,7.void sum ( i n t ∗ a , i n t ∗b , i n t ∗ c ){f o r ( i =0; i

Fig. 10: Pre-pass and Auto-vectorization speedup evaluation.Fig. 11: Auto-vectorization speedup compared with Preoptimizationevaluation.optimization, auto-vectorization optimization, and disablingalignment analysis result. Note that the disabling alignmentanalysis always uses realignment instruction to do aligned ormisaligned vector access. In the experience, time is measuredusing the gethrtime() call. The time measured is high resolutionin nanoseconds. Speed up rates when compared with-O0 optimization for the test suites are shown in Fig. 10. Theresults of pre-pass and auto-vectorization are shown in Fig. 11.Fig. 12 shows the comparing result of enabling and disablingalignment analysis that means misaligned instruction would begenerated in every vector memory access. In each figure, thestart symbol represents the misaligned access in the program.In certain cases, the speedup rate is lower because themisaligned access (e.g., lms, fir), or multiply always appearsin the program (e.g., fir, iir) or the smaller fraction of thebenchmark code can be mapped to SIMD instructions (e.g.,lms). Otherwise, by using scalar expansion, the auxiliary arraywould allocate extra memory and use more instructions to access.These would add more penalty. However, the evaluationresult shows a speedup of 4% up to 36% and the averageperformance improvement is up to 17.2% by comparing withpre-pass. If we comparing matrix and matrix add, the averageperformance impact is 14.5%, but it still improves 30.35% bycomparing with pre-pass.V. CONCLUSION AND FUTURE WORKIn this paper we design and implement an automaticsuperword vectorization in LLVM. It can produces SIMDinstructions to get performance improved. Our design includesauto-vectorization and alignment analysis passes. Autovectorizationexploits data parallelism and enables base vectorization.In addition, the alignment analysis pass is used forthe back-end to select target-specific realignment instructionsto handle the misaligned problem in our experiment environment.Furthermore, the vectorization is effective becauseLLVM supports high-level information in its IR. Besides, ourvectorization is independent of other optimization and analysispasses in LLVM. Programs could also apply other powerfuloptimizations after the auto-vectorization. In addition, theFig. 12: Auto-vectorization speedup compare with disablingalignment analysis.vectorized code can generate SIMD instructions for otherplatforms supported by LLVM.Future work would focus on enhance vectorization capabilityand transfer LLVM IR to to vector LLVA [11] whichprovide rich set of vector instructions based on LLVM IR.(e.g., Vector LLVA provides saturation arithmetic) In addition,because auto-vectorization may introduce some extra codesand costs, the suitable cost model would be added for SPARCon LLVM.REFERENCES[1] K. Kennedy and J. R. Allen, Optimizing compilers for modernarchitectures: a dependence-based approach, San Francisco,CA, USA, 2001.[2] A. Shahbahrami and B. Juurlink, “Performance improvementof multimedia kernels by alleviating overhead instructions onSIMD devices,” in APPT ’09: Proceedings of the 8th InternationalSymposium on Advanced Parallel Processing Technologies.Berlin, Heidelberg: Springer-Verlag, August 24-25 2009,pp. 389–407.

[3] L. Samuel and S. Amarasinghe, “Exploiting superword levelparallelism with multimedia instruction sets,” in PLDI ’00:Proceedings of the ACM SIGPLAN 2000 Conference on ProgrammingLanguage Design and Implementation. New York,NY, USA: ACM, June 18-21 2000, pp. 145–156.[4] M. J. Wolfe, High performance compilers for parallel computing,C. Shanklin and L. Ortega, Eds. Boston, MA, USA:Addison-Wesley Longman Publishing Co., Inc., 1995.[5] D. Naishlos, “Autovectorization in GCC,” in the GCC DevelopersSummit, June 2-4 2004, pp. 105–117.[6] I. Pryanishnikov, A. Krall, and N. Horspool, “Compiler optimizationsfor processors with SIMD instructions,” in SoftwarePractice and Experience, vol. 37, no. 1. New York, NY, USA:John Wiley & Sons, Inc., January 2007, pp. 93–113.[7] S. Larsen, R. Rabbah, and S. Amarasinghe, “Exploiting vectorparallelism in software pipelined loops,” in MICRO 38:Proceedings of the 38th annual IEEE/ACM International Symposiumon Microarchitecture. Washington, DC, USA: IEEEComputer Society, November 12-16 2005, pp. 119–129.[8] A. Krall and S. Lelait, “Compilation techniques for multimediaprocessors,” in International Journal of Parallel Programming,vol. 28, no. 4. Norwell, MA, USA: Kluwer AcademicPublishers, August 2000, pp. 347–361.[9] P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao, “Anintegrated simdization framework using virtual vectors,” in ICS’05: Proceedings of the 19th Annual International Conferenceon Supercomputing. New York, NY, USA: ACM, June 20-222005, pp. 169–178.[10] M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, and H. Meyr,“A SIMD optimization framework for retargetable compilers,”in ACM Transactions on Architecture and Code Optimization,vol. 6, no. 1. New York, NY, USA: ACM, March 2009, pp.1–27.[11] R. L. Bocchino Jr. and V. S. Adve, “Vector LLVA: a virtualvector instruction set for media processing,” in VEE ’06:Proceedings of the 2nd International Conference on VirtualExecution Environments. New York, NY, USA: ACM, June14-16 2006, pp. 46–56.[12] D. Nuzman and R. Henderson, “Multi-platform autovectorization,”in CGO ’06: Proceedings of the InternationalSymposium on Code Generation and Optimization. Washington,DC, USA: IEEE Computer Society, March 26-29 2006, pp.281–294.[13] A. E. Eichenberger, P. Wu, and K. O’Brien, “Vectorizationfor SIMD architectures with alignment constraints,” in PLDI’04: Proceedings of the ACM SIGPLAN 2004 Conference onProgramming Language Design and Implementation. NewYork, NY, USA: ACM, June 09-11 2004, pp. 82–93.[14] Sun Microsystems, Inc., UltraSPARC T2 supplement to theUltraSPARC architecture 2007, 2007.[15] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian, “Automaticintra-register vectorization for the Intel R○architecture,” in InternationalJournal of Parallel Programming, vol. 30, no. 2.Norwell, MA, USA: Kluwer Academic Publishers, April 2002,pp. 65–98.[16] C. Lattner and V. Adve, “LLVM: a compilation framework forlifelong program analysis & transformation,” in CGO ’04: Proceedingsof the International Symposium on Code Generationand Optimization: feedback-directed and runtime optimization.Washington, DC, USA: IEEE Computer Society, March 20-242004, pp. 75–86.[17] C. Lattner, The LLVM Compiler Infrastructure Project.[Online]. Available: http://llvm.org/[18] Sun Microsystems, Inc., VIS instruction set user’s manual,2002.[19] V. Zivojnovic, J. M. Velarde, C. Schlager, and H. Meyr,“DSPSTONE: A DSP-oriented benchmarking methodology,” inProceedings of the International Conference on Signal Processingand Technology (ICSPAT’94). Dallas, Tex, USA: MillerFreeman, October 1994, pp. 715–720.

An Automatic Superword Vectorization in LLVM

Create successful ePaper yourself

Delete template?

Save as template?