Suffix Array Performance Analysis for Multi-core Platforms

Suffix Array Performance Analysis for Multi-corePlatformsCesar Ochoa 1 , Veronica Gil-Costa 1,2 and A. Marcela Printista 1,21 LIDIC, University of San Luis, Argentina2 CONICET, ArgentinaAbstract—Performance analysis helps to understandhow a particular invocation of an algorithm executes.Using the information provided by specific tools likethe profiler tool Perf or the Performance ApplicationProgramming Interface (PAPI), the performanceanalysis process provides a bridging relationshipbetween the algorithm execution and processors eventsaccording to the metrics defined by the developer. It isalso useful to find performance limitations whichexclusively depend on the code. Furthermore, tochange an algorithm in order to optimize the coderequires more than understanding the obtainedperformance. It requires to understand the problembeing solved. In this work we evaluate the performanceachieved by the suffix array over a 32-core platform.Suffix arrays are efficient data structures for solvingcomplex queries in a number of applications related totext databases, as for instance biological databases. Werun experiments to evaluate hardware features directlyaimed to parallelize computation. Moreover, accordingto the results obtained by the performance evaluationtools, we propose an optimization technique to improvethe use of the cache memory. In particular, we aim toreduce the number of cache memory replacementperformed every time a new query is processed.I. INTRODUCTIONSeveral microprocessor design techniques have beenused to exploit the parallelism inside a singleprocessor. Among the most relevant techniques wecan mention bit-level parallelism, pipeliningtechniques and multiple functional units [CG99].These three techniques assume a single sequentialflow of control which is provided by the compilerand which determines the order of execution in casethere are dependencies between instructions. For aprogrammer, this kind of internal techniques havethe advantage of allowing parallel execution ofinstructions by a sequential programming language.However, the degree of parallelism and pipeliningobtained by multiple functional units is limited.Recently, the work in [8, 9] showed thatincreasing the number of transistors on a chip leadsto improvements in the architecture of a processor toreduce the average time of execution of aninstruction. Therefore, an alternative approach totake advantage of the increasing number oftransistors on a chip, is to place multiple cores on asingle processor chip. This approach has been usedin desktop processors since 2005 and is known asmulti-core processors. These multi-core chips addnew levels in the memory (cache) hierarchy. Thereare significant differences in how the cores on a chipor socket may be arranged. These multi-core chipsadd new levels in the memory (cache) hierarchy[10]. Memory is divided up into pages and pagescan be variable size (4K, 64K, 4M). Each core hasan L1 cache capable of storing a few Kb. A second

level called L2 cache is shared by all cores groupedon the same chip and has a storage capacity of a fewMb. An example is the Intel machine Sandy Bridge-E which has up to 8 cores on a single chip, 32K L1cache, 256 K L2 cache, and up to 20M L3 cache thatis shared by all cores in the same chip.To take advantage of current commodityarchitectures, it is necessary to devise algorithmsthat exploit (or at least do not get hindered by) thesearchitectures. There are several libraries for multicoresystems as OpenMP[12] which has a high levelof abstraction, TBB [13] that uses cycles to describethe data parallelism, and others like IBM X10 [4]and Fortess [1] which focuses on data parallelismbut also provides task parallelism. A multi-coresystem offers all cores quick access to a singleshared memory, but the amount of memory availableis limited. The challenge in this type of architectureis to reduce the execution time of programs.In this work, we propose to analyze theperformance of a string matching algorithm in amulti-core environment, where communication ismade only through cache and memory, there are nodedicated instructions to do neither synchronizationnor communication, and dispatch must be done by(slow) software. In particular, we propose to use thePerformance Application Programming Interface(PAPI) tool and the Perf tool to performperformance analysis (which helps to improveapplications by revealing its drawbacks) over thesuffix array index [11] and then, guided by theresults obtained, we propose an optimizationtechnique to increase the number of cache hits. Thesuffix array index is used for string matching, whichis perhaps one of those tasks on which computersand servers spend quite a bit of time. Research inthis area spans from genetics (finding DNAsequences), to keyword search in billions of webdocuments, to cyber-espionage. In fact, severalproblems are also recast as a string matchingproblem to be even tractable.The remainder of this article is organized asfollows. In Section 2, we describe the suffix arrayindex and the search algorithm. In Section 3, wepresent the parallel algorithm and the proposaloptimization. In Section 4 we present experimentalresults. Conclusions are presented in Section 5.II. SUFFIX ARRAYString matching is perhaps one of the most studiedareas of computer science [5]. This is not surprising,given that there are multiple areas of application forstring processing, being information retrieval andcomputational biology among the most notablenowadays. The string matching problem consists offinding some string (called the pattern) in someusually much larger string (called the text).Many index structures have been designed tooptimize text indexing [14,16]. In particular, suffixarrays [11] has already 20 years old and it is tendingto be replaced by indices based on compressedsuffix arrays or the Burrows-Wheeler transform[2,3], which require less memory space. However,these newer indexing structures are slower tooperate. A suffix array is a data structure that is usedfor quickly searching for a keyword in a textdatabase.Essentially, the suffix array is a particularpermutation on all the suffixes of a word. Given atext T[1..n] over an alphabet ∑, the correspondingsuffix array SA[1..n] stores pointers to the initialpositions of the text suffixes. The array is sorted inlexicographical order of the suffixes. As shown inFig. 1 for the example text “Performance$”—thesymbol $ ∉ ∑ is a special text terminator, that actsas a sentinel. Given an interval of the suffix array,notice that all the corresponding suffixes in that

interval form a lexicographical subinterval of thetext suffixes.suffix array: one with the immediate predecessor ofX, and other with the immediate successor. Weobtain in this way an interval in the suffix array thatcontains the pattern occurrences. Fig. 2 illustrateshow this search is carried out. Finally, Fig. 3illustrates the code of the sequential search of NQpatterns. To this end, all patterns of length L areloaded into main memory (to avoid the interferenceof disk access overheads). The algorithm scanssequentially the patterns array using the next()function, and for each pattern X the SA searchalgorithm is executed.Fig. 1: Suffix array for the example text “Performance$”.Fig.3: Sequential search algorithm.III. SHARED MEMORY PARALLELIZATIONFig.2: Search algorithm for a pattern X of size L.The suffix array stores information about thelexicographic order of the suffixes of the text T, butthere is no information about the text itself.Therefore, to search for a pattern X[1..m], we haveto access both the suffix array and the text T, withlength |T|=n. Therefore, if we want to find all thetext suffixes that have X as a prefix—i.e., thesuffixes starting with X, and since the array islexicographically sorted, the search for a patternproceeds by performing two binary searches over theParallelization for shared memory parallel hardwareis expected to be both, the most simple and lessscalable parallel approach to a given code [15]. It issimple, due to the shared memory paradigm requiresfew changes in the sequential code and it is almoststraightforward to understand for programmers andalgorithm designers.In this work we use the OpenMP[OMP] library,which allows for a very simple implementation ofintra-node shared memory parallelism by onlyadding a few compiler directives. The OpenMPparallelization is made with some simple changes tothe sequential code. To avoid read/write conflict,every thread should have a local buff variable whichstores a suffix of length L, also local left, right and

cmp variables which are used to decide thesubsection of the SA where to continue the search(see Fig. 2 above). To avoid using a critical sectionwhich incurs into an overhead (delay in executiontime), we replace the result variable of Fig. 3 by anarray results[1..NQ]. Therefore, each thread storesthe results for the pattern Xi into results[i]. The“for” instruction is divided among all threads bymeans of the “#pragma omp for” directive. Fig. 4shows the threaded execution using the OpenMPterminology. Also, the sched_setaffinity function isused in this implementation to obtain performancebenefits and to ensure that threads are allocated incores belonging to the same sockets.overhead includes the administration(creation/deletion) of the local variables such as buff,cmp, left, right, every time we call the Searchfunction. But as the local approach requires morememory, it can be used only when NT*sizeof(SAindex) < Memory size, where NT is the number ofthreads.(a)Fig. 4: Parallel search algorithm.We further investigate the effect of using a localSA index per thread or a global shared SA index. Tothis end, we performed some experiments tomeasure the speedup (measured as the sequentialexecution time divided by the parallel executiontime) achieved with both approaches. Fig. 5 showsthe results obtained with up to 32 threads, an indexsize of 801M and pattern length of L=10 and L=20.In both cases, efficiency is improved by using alocal data structure (17% with L=10 and 41% withL=20). Hence, despite performing read-only accessto the SA index there is an overhead associated withthe search operation over the shared data structurewhen many threads process the search patterns. This(b)Fig. 5: Speedup achieved when each thread uses a local copy ofthe index (Local SA) and when all threads share the same SAindex. (a) Results obtained with L=10 and (b) L=20.A. OptimizationThe suffix array search algorithm presents featureswhich make it suitable for improving its efficiencyin multi-core processors. In particular, we propose

to take advantage of the binary search algorithmwhich makes some elements of the suffix array morefeasible to be accessed than others. Namely, thecenter element of the suffix array and the left andright centers as shown in Fig. 6 for a pattern lengthL=3.This effect is known as temporal locality.Therefore, we propose to introduce an extra datastructure storing the most frequently referencedelements of the suffix array. This new data structureis small enough to fit into the low level cachememory and it is used to prune the search ofpatterns. The proposed algorithm is detailed in Fig.6. To begin searching for a new pattern (iter=1), weaccess to the second position ofthe AUX datastructure If we have to continue the search at the leftside of the text stored into AUX[1],the nextiteration (iter=2) compares the first element of AUXwith the search pattern X, otherwise the thirdelement of AUX and the pattern X are compared.IV. EVALUATIONFig. 6: Cache optimization.A. Computing Hardware and Data PreparationExperiments were performed on a 32-core platformwith 64GB Memory (16x4GB), 1333MHz and adisk of 500GB. 2x AMD Opteron 6128, 2.0GHz,8C, 4M L2/12M L3, 1333 Mhz Maximum Memory.Power Cord, 250 volt, IRSM 2073to C13. Theoperating system is Linux Centos 6 supporting 64bits. As shown in Fig. 7, we used the hwlock(Hardware Locality) tool [7] which collects allinformation from the operating system and builds anabstract and portable object tree: memory nodes(nodes and groups), caches, processor sockets,processor cores, hardware threads, etc.To evaluate the performance of the SA index weuse a 3-megabyte and 100-megabyte DNA text fromthe Pizza&Chili Corpus[6]. The resulting suffixarray requires 25M and 801M respectively. The textlength in each case is n=3145728 and n=104857600.For the queries, we used 1000000 random searchpatterns of length 5, 10, 15 and 20.

hits and the Last level cache (LLC) hits.(a)Fig. 7: Four sockets with two nodes. Each node has four cores.B. Performance EvaluationIn this section we evaluate the performance of theSA index using the PAPI tool version 5.0.1. Inparticular we measure the cache hit ratio, as well asthe execution time. We use the PAPI_thread_initfunction which initializes thread support in the PAPIlibrary. When all threads have finished their work,we execute the PAPI_unregister_thread function.For the Perf performance tool we used the version3.2.30. For the Perf tool we count the L1 data cache(b)(c)

(d)Fig.8: Results obtained with a small index size. (a) Executiontime reported by the baseline and the proposal approach for apattern length L=5; (b) Execution time for different pattern lengthand 32 threads; (c) LLC hits for L=5 and (d) Ratio between thebaseline execution time and the proposal approach execution timewith 32 threads.In Fig. 8 we show results obtained for a smallindex of 25M and different patterns lengths. Fig.8.(a) shows that the normalized running timeobtained with both the baseline and our proposedoptimization algorithm for a pattern length of L=5are almost the same. We show results normalized to1 in order to better illustrate the comparativeperformance. To this end, we divide all quantities bythe observed maximum in each case. Fig, 8.(c)shows the LLC hits reported for the sameexperiment. The proposal repots an improvement ofalmost 20% in average which remains constant withdifferent number of threads. Although, this benefit isnot reflected in the running time. Fig. 8.(b) showsthat the best performance is achieved by ourproposal for a pattern length of L=10. This last resultis confirmed by Fig. 8.(e) where the best ratio(running time reported by the baseline divided bythe running time reported by the proposal) isobtained with a pattern length L=10.(a)(b)(c)

features directly aimed to parallelize computation.Results show that read-only operations performedover shared data structures affect performance. Wealso proposed an optimization scheme which allowsincreasing the number of hit cache and tends toimprove running time.REFERENCES(d)Fig.9: Results obtained with a larger index size. (a) Executiontime for different pattern length and 32 threads; (b) Ratio betweenthe baseline execution time and the proposal approach executiontime. (c) LLC hits for L=5 and (d) LLC hits for L=20.Fig. 9 shows results obtained with a large index ofsize 801M. In this case, the best performance of ouroptimization is reported, in Fig. 9.(a), for a pattern oflength L=15 and L=20. This result is confirmed inFig. 9.(b) where the best ratio between the runningtime reported by the baseline and the running timereported by the proposal is given for a search patternL=15. Fig. 9.(c) and (d) show the LLC hit for L=5and L=20. In both figures, the LLC hits tend to godown as we increase the number of threads. In theformer, our proposal obtains a gain of 15% inaverage but in the last figure the gain is reduced to4%.V. CONCLUSIONSIn this work we analyzed the performance of thesuffix array index by means of the Perf and PAPItool on a multi-core environment. Parallel codeswere implemented with the OpenMP library.Experiments were performed with different indexsize, different patterns length over a 32-coreplatform. We ran experiments to evaluate hardware[1] E. Allen and D. Chase and J. Hallett and V.Luchangco and W. Maessen and S. Ryu and S.Tobin-Hochstadt, S. The Fortress LanguageSpecification, version 1.0beta. 2007.[2] Donald Adjeroh, Tim Bell, and AmarMukherjee. The Burrows-Wheeler Transform:Data Compression, Suffx Arrays, and PatternMatching. Springer, 2008.[3] Michael Burrows and David J. Wheeler. Ablock-sorting lossless data compressionalgorithm. Research Report 124, Digital SystemsResearch Center, Palo Alto California, May1994.[4] P. Charles and C. Grothoff and V.A. Saraswatand C. Donawa and A. Kielstra and K. Ebciogluand von Praun and V. Sarkar. X10: anobjectoriented approach to non-uniform clustercomputing. In Internacional Conference onObject Oriented Programming, Systems,Languages and Applications. pages 519-538.2005.[5] Thomas H. Cormen, Charles E. Leiserson,Ronald L. Rivest, and Clifford Stein.Introduction to Algorithms (3rd ed.). MIT Press,2009.[6] Ferragina, P. and Navarro, G. 2005. ThePizza&Chili corpus — compressed indexes andtheir testbeds.[7] The Portable Hardware Locality (hwloc):http://www. open-mpi. org/ projects/ hwloc/

[8] J. L. Hennesy and D. A. Patterson. ComputerArchitecture - A Quantitative Approach. MorganKaufmann, 4th edition, 2007[9] J. L. Hennesy and D. A. Patterson. ComputerOrganization & Design - The Hardware/SoftwareInterface. Morgan Kaufmann, 4th edition, 2008.[10] Georg Hager, Gerhard Wellein. Introduction toHigh Performance Computing for Scientists andEngineers. Chapman & Hall/CRC ComputationalScience. 2010.[11] Manber, U. and Myers, G., “Suffix arrays: Anew method for on-line string searches”, SIAMJ. Computation, vol.22, no.5, pp.935-948, 1993.[12] OpenMP Architecture Review Board, OpenMPApplication Program Interface - Version 3.1,July 2011.Available athttp://openmp.org/wp/[13] J. Reinders. Intel Threading Building Blocks:Outfitting C++ for Multicore ProcessorParallelism. OReilly (2007).[14] Jens Stoye. Su_x tree construction in ram. InEncyclopedia of Algorithms. Springer, 2008.[15] Fernando G. Tinetti, Sergio M. Martin.Sequential Optimization and Shared andDistributed Memory Optimization in Clusters:N-BODY/Particle Simulation. Parallel andDistributed Computing and Systems -PDCS.2012[16] Peter Weiner. Linear pattern matchingalgorithms. In Proceedings of the 14th AnnualSymposium on Switching and AutomataTheory, pages 1-11, 1973.

Suffix Array Performance Analysis for Multi-core Platforms

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?