09.07.2015 Views

Suffix Array Performance Analysis for Multi-core Platforms

Suffix Array Performance Analysis for Multi-core Platforms

Suffix Array Performance Analysis for Multi-core Platforms

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Suffix</strong> <strong>Array</strong> <strong>Per<strong>for</strong>mance</strong> <strong>Analysis</strong> <strong>for</strong> <strong>Multi</strong>-<strong>core</strong>Plat<strong>for</strong>msCesar Ochoa 1 , Veronica Gil-Costa 1,2 and A. Marcela Printista 1,21 LIDIC, University of San Luis, Argentina2 CONICET, ArgentinaAbstract—<strong>Per<strong>for</strong>mance</strong> analysis helps to understandhow a particular invocation of an algorithm executes.Using the in<strong>for</strong>mation provided by specific tools likethe profiler tool Perf or the <strong>Per<strong>for</strong>mance</strong> ApplicationProgramming Interface (PAPI), the per<strong>for</strong>manceanalysis process provides a bridging relationshipbetween the algorithm execution and processors eventsaccording to the metrics defined by the developer. It isalso useful to find per<strong>for</strong>mance limitations whichexclusively depend on the code. Furthermore, tochange an algorithm in order to optimize the coderequires more than understanding the obtainedper<strong>for</strong>mance. It requires to understand the problembeing solved. In this work we evaluate the per<strong>for</strong>manceachieved by the suffix array over a 32-<strong>core</strong> plat<strong>for</strong>m.<strong>Suffix</strong> arrays are efficient data structures <strong>for</strong> solvingcomplex queries in a number of applications related totext databases, as <strong>for</strong> instance biological databases. Werun experiments to evaluate hardware features directlyaimed to parallelize computation. Moreover, accordingto the results obtained by the per<strong>for</strong>mance evaluationtools, we propose an optimization technique to improvethe use of the cache memory. In particular, we aim toreduce the number of cache memory replacementper<strong>for</strong>med every time a new query is processed.I. INTRODUCTIONSeveral microprocessor design techniques have beenused to exploit the parallelism inside a singleprocessor. Among the most relevant techniques wecan mention bit-level parallelism, pipeliningtechniques and multiple functional units [CG99].These three techniques assume a single sequentialflow of control which is provided by the compilerand which determines the order of execution in casethere are dependencies between instructions. For aprogrammer, this kind of internal techniques havethe advantage of allowing parallel execution ofinstructions by a sequential programming language.However, the degree of parallelism and pipeliningobtained by multiple functional units is limited.Recently, the work in [8, 9] showed thatincreasing the number of transistors on a chip leadsto improvements in the architecture of a processor toreduce the average time of execution of aninstruction. There<strong>for</strong>e, an alternative approach totake advantage of the increasing number oftransistors on a chip, is to place multiple <strong>core</strong>s on asingle processor chip. This approach has been usedin desktop processors since 2005 and is known asmulti-<strong>core</strong> processors. These multi-<strong>core</strong> chips addnew levels in the memory (cache) hierarchy. Thereare significant differences in how the <strong>core</strong>s on a chipor socket may be arranged. These multi-<strong>core</strong> chipsadd new levels in the memory (cache) hierarchy[10]. Memory is divided up into pages and pagescan be variable size (4K, 64K, 4M). Each <strong>core</strong> hasan L1 cache capable of storing a few Kb. A second


level called L2 cache is shared by all <strong>core</strong>s groupedon the same chip and has a storage capacity of a fewMb. An example is the Intel machine Sandy Bridge-E which has up to 8 <strong>core</strong>s on a single chip, 32K L1cache, 256 K L2 cache, and up to 20M L3 cache thatis shared by all <strong>core</strong>s in the same chip.To take advantage of current commodityarchitectures, it is necessary to devise algorithmsthat exploit (or at least do not get hindered by) thesearchitectures. There are several libraries <strong>for</strong> multi<strong>core</strong>systems as OpenMP[12] which has a high levelof abstraction, TBB [13] that uses cycles to describethe data parallelism, and others like IBM X10 [4]and Fortess [1] which focuses on data parallelismbut also provides task parallelism. A multi-<strong>core</strong>system offers all <strong>core</strong>s quick access to a singleshared memory, but the amount of memory availableis limited. The challenge in this type of architectureis to reduce the execution time of programs.In this work, we propose to analyze theper<strong>for</strong>mance of a string matching algorithm in amulti-<strong>core</strong> environment, where communication ismade only through cache and memory, there are nodedicated instructions to do neither synchronizationnor communication, and dispatch must be done by(slow) software. In particular, we propose to use the<strong>Per<strong>for</strong>mance</strong> Application Programming Interface(PAPI) tool and the Perf tool to per<strong>for</strong>mper<strong>for</strong>mance analysis (which helps to improveapplications by revealing its drawbacks) over thesuffix array index [11] and then, guided by theresults obtained, we propose an optimizationtechnique to increase the number of cache hits. Thesuffix array index is used <strong>for</strong> string matching, whichis perhaps one of those tasks on which computersand servers spend quite a bit of time. Research inthis area spans from genetics (finding DNAsequences), to keyword search in billions of webdocuments, to cyber-espionage. In fact, severalproblems are also recast as a string matchingproblem to be even tractable.The remainder of this article is organized asfollows. In Section 2, we describe the suffix arrayindex and the search algorithm. In Section 3, wepresent the parallel algorithm and the proposaloptimization. In Section 4 we present experimentalresults. Conclusions are presented in Section 5.II. SUFFIX ARRAYString matching is perhaps one of the most studiedareas of computer science [5]. This is not surprising,given that there are multiple areas of application <strong>for</strong>string processing, being in<strong>for</strong>mation retrieval andcomputational biology among the most notablenowadays. The string matching problem consists offinding some string (called the pattern) in someusually much larger string (called the text).Many index structures have been designed tooptimize text indexing [14,16]. In particular, suffixarrays [11] has already 20 years old and it is tendingto be replaced by indices based on compressedsuffix arrays or the Burrows-Wheeler trans<strong>for</strong>m[2,3], which require less memory space. However,these newer indexing structures are slower tooperate. A suffix array is a data structure that is used<strong>for</strong> quickly searching <strong>for</strong> a keyword in a textdatabase.Essentially, the suffix array is a particularpermutation on all the suffixes of a word. Given atext T[1..n] over an alphabet ∑, the correspondingsuffix array SA[1..n] stores pointers to the initialpositions of the text suffixes. The array is sorted inlexicographical order of the suffixes. As shown inFig. 1 <strong>for</strong> the example text “<strong>Per<strong>for</strong>mance</strong>$”—thesymbol $ ∉ ∑ is a special text terminator, that actsas a sentinel. Given an interval of the suffix array,notice that all the corresponding suffixes in that


interval <strong>for</strong>m a lexicographical subinterval of thetext suffixes.suffix array: one with the immediate predecessor ofX, and other with the immediate successor. Weobtain in this way an interval in the suffix array thatcontains the pattern occurrences. Fig. 2 illustrateshow this search is carried out. Finally, Fig. 3illustrates the code of the sequential search of NQpatterns. To this end, all patterns of length L areloaded into main memory (to avoid the interferenceof disk access overheads). The algorithm scanssequentially the patterns array using the next()function, and <strong>for</strong> each pattern X the SA searchalgorithm is executed.Fig. 1: <strong>Suffix</strong> array <strong>for</strong> the example text “<strong>Per<strong>for</strong>mance</strong>$”.Fig.3: Sequential search algorithm.III. SHARED MEMORY PARALLELIZATIONFig.2: Search algorithm <strong>for</strong> a pattern X of size L.The suffix array stores in<strong>for</strong>mation about thelexicographic order of the suffixes of the text T, butthere is no in<strong>for</strong>mation about the text itself.There<strong>for</strong>e, to search <strong>for</strong> a pattern X[1..m], we haveto access both the suffix array and the text T, withlength |T|=n. There<strong>for</strong>e, if we want to find all thetext suffixes that have X as a prefix—i.e., thesuffixes starting with X, and since the array islexicographically sorted, the search <strong>for</strong> a patternproceeds by per<strong>for</strong>ming two binary searches over theParallelization <strong>for</strong> shared memory parallel hardwareis expected to be both, the most simple and lessscalable parallel approach to a given code [15]. It issimple, due to the shared memory paradigm requiresfew changes in the sequential code and it is almoststraight<strong>for</strong>ward to understand <strong>for</strong> programmers andalgorithm designers.In this work we use the OpenMP[OMP] library,which allows <strong>for</strong> a very simple implementation ofintra-node shared memory parallelism by onlyadding a few compiler directives. The OpenMPparallelization is made with some simple changes tothe sequential code. To avoid read/write conflict,every thread should have a local buff variable whichstores a suffix of length L, also local left, right and


cmp variables which are used to decide thesubsection of the SA where to continue the search(see Fig. 2 above). To avoid using a critical sectionwhich incurs into an overhead (delay in executiontime), we replace the result variable of Fig. 3 by anarray results[1..NQ]. There<strong>for</strong>e, each thread storesthe results <strong>for</strong> the pattern Xi into results[i]. The“<strong>for</strong>” instruction is divided among all threads bymeans of the “#pragma omp <strong>for</strong>” directive. Fig. 4shows the threaded execution using the OpenMPterminology. Also, the sched_setaffinity function isused in this implementation to obtain per<strong>for</strong>mancebenefits and to ensure that threads are allocated in<strong>core</strong>s belonging to the same sockets.overhead includes the administration(creation/deletion) of the local variables such as buff,cmp, left, right, every time we call the Searchfunction. But as the local approach requires morememory, it can be used only when NT*sizeof(SAindex) < Memory size, where NT is the number ofthreads.(a)Fig. 4: Parallel search algorithm.We further investigate the effect of using a localSA index per thread or a global shared SA index. Tothis end, we per<strong>for</strong>med some experiments tomeasure the speedup (measured as the sequentialexecution time divided by the parallel executiontime) achieved with both approaches. Fig. 5 showsthe results obtained with up to 32 threads, an indexsize of 801M and pattern length of L=10 and L=20.In both cases, efficiency is improved by using alocal data structure (17% with L=10 and 41% withL=20). Hence, despite per<strong>for</strong>ming read-only accessto the SA index there is an overhead associated withthe search operation over the shared data structurewhen many threads process the search patterns. This(b)Fig. 5: Speedup achieved when each thread uses a local copy ofthe index (Local SA) and when all threads share the same SAindex. (a) Results obtained with L=10 and (b) L=20.A. OptimizationThe suffix array search algorithm presents featureswhich make it suitable <strong>for</strong> improving its efficiencyin multi-<strong>core</strong> processors. In particular, we propose


to take advantage of the binary search algorithmwhich makes some elements of the suffix array morefeasible to be accessed than others. Namely, thecenter element of the suffix array and the left andright centers as shown in Fig. 6 <strong>for</strong> a pattern lengthL=3.This effect is known as temporal locality.There<strong>for</strong>e, we propose to introduce an extra datastructure storing the most frequently referencedelements of the suffix array. This new data structureis small enough to fit into the low level cachememory and it is used to prune the search ofpatterns. The proposed algorithm is detailed in Fig.6. To begin searching <strong>for</strong> a new pattern (iter=1), weaccess to the second position ofthe AUX datastructure If we have to continue the search at the leftside of the text stored into AUX[1],the nextiteration (iter=2) compares the first element of AUXwith the search pattern X, otherwise the thirdelement of AUX and the pattern X are compared.IV. EVALUATIONFig. 6: Cache optimization.A. Computing Hardware and Data PreparationExperiments were per<strong>for</strong>med on a 32-<strong>core</strong> plat<strong>for</strong>mwith 64GB Memory (16x4GB), 1333MHz and adisk of 500GB. 2x AMD Opteron 6128, 2.0GHz,8C, 4M L2/12M L3, 1333 Mhz Maximum Memory.Power Cord, 250 volt, IRSM 2073to C13. Theoperating system is Linux Centos 6 supporting 64bits. As shown in Fig. 7, we used the hwlock(Hardware Locality) tool [7] which collects allin<strong>for</strong>mation from the operating system and builds anabstract and portable object tree: memory nodes(nodes and groups), caches, processor sockets,processor <strong>core</strong>s, hardware threads, etc.To evaluate the per<strong>for</strong>mance of the SA index weuse a 3-megabyte and 100-megabyte DNA text fromthe Pizza&Chili Corpus[6]. The resulting suffixarray requires 25M and 801M respectively. The textlength in each case is n=3145728 and n=104857600.For the queries, we used 1000000 random searchpatterns of length 5, 10, 15 and 20.


hits and the Last level cache (LLC) hits.(a)Fig. 7: Four sockets with two nodes. Each node has four <strong>core</strong>s.B. <strong>Per<strong>for</strong>mance</strong> EvaluationIn this section we evaluate the per<strong>for</strong>mance of theSA index using the PAPI tool version 5.0.1. Inparticular we measure the cache hit ratio, as well asthe execution time. We use the PAPI_thread_initfunction which initializes thread support in the PAPIlibrary. When all threads have finished their work,we execute the PAPI_unregister_thread function.For the Perf per<strong>for</strong>mance tool we used the version3.2.30. For the Perf tool we count the L1 data cache(b)(c)


(d)Fig.8: Results obtained with a small index size. (a) Executiontime reported by the baseline and the proposal approach <strong>for</strong> apattern length L=5; (b) Execution time <strong>for</strong> different pattern lengthand 32 threads; (c) LLC hits <strong>for</strong> L=5 and (d) Ratio between thebaseline execution time and the proposal approach execution timewith 32 threads.In Fig. 8 we show results obtained <strong>for</strong> a smallindex of 25M and different patterns lengths. Fig.8.(a) shows that the normalized running timeobtained with both the baseline and our proposedoptimization algorithm <strong>for</strong> a pattern length of L=5are almost the same. We show results normalized to1 in order to better illustrate the comparativeper<strong>for</strong>mance. To this end, we divide all quantities bythe observed maximum in each case. Fig, 8.(c)shows the LLC hits reported <strong>for</strong> the sameexperiment. The proposal repots an improvement ofalmost 20% in average which remains constant withdifferent number of threads. Although, this benefit isnot reflected in the running time. Fig. 8.(b) showsthat the best per<strong>for</strong>mance is achieved by ourproposal <strong>for</strong> a pattern length of L=10. This last resultis confirmed by Fig. 8.(e) where the best ratio(running time reported by the baseline divided bythe running time reported by the proposal) isobtained with a pattern length L=10.(a)(b)(c)


features directly aimed to parallelize computation.Results show that read-only operations per<strong>for</strong>medover shared data structures affect per<strong>for</strong>mance. Wealso proposed an optimization scheme which allowsincreasing the number of hit cache and tends toimprove running time.REFERENCES(d)Fig.9: Results obtained with a larger index size. (a) Executiontime <strong>for</strong> different pattern length and 32 threads; (b) Ratio betweenthe baseline execution time and the proposal approach executiontime. (c) LLC hits <strong>for</strong> L=5 and (d) LLC hits <strong>for</strong> L=20.Fig. 9 shows results obtained with a large index ofsize 801M. In this case, the best per<strong>for</strong>mance of ouroptimization is reported, in Fig. 9.(a), <strong>for</strong> a pattern oflength L=15 and L=20. This result is confirmed inFig. 9.(b) where the best ratio between the runningtime reported by the baseline and the running timereported by the proposal is given <strong>for</strong> a search patternL=15. Fig. 9.(c) and (d) show the LLC hit <strong>for</strong> L=5and L=20. In both figures, the LLC hits tend to godown as we increase the number of threads. In the<strong>for</strong>mer, our proposal obtains a gain of 15% inaverage but in the last figure the gain is reduced to4%.V. CONCLUSIONSIn this work we analyzed the per<strong>for</strong>mance of thesuffix array index by means of the Perf and PAPItool on a multi-<strong>core</strong> environment. Parallel codeswere implemented with the OpenMP library.Experiments were per<strong>for</strong>med with different indexsize, different patterns length over a 32-<strong>core</strong>plat<strong>for</strong>m. We ran experiments to evaluate hardware[1] E. Allen and D. Chase and J. Hallett and V.Luchangco and W. Maessen and S. Ryu and S.Tobin-Hochstadt, S. The Fortress LanguageSpecification, version 1.0beta. 2007.[2] Donald Adjeroh, Tim Bell, and AmarMukherjee. The Burrows-Wheeler Trans<strong>for</strong>m:Data Compression, Suffx <strong>Array</strong>s, and PatternMatching. Springer, 2008.[3] Michael Burrows and David J. Wheeler. Ablock-sorting lossless data compressionalgorithm. Research Report 124, Digital SystemsResearch Center, Palo Alto Cali<strong>for</strong>nia, May1994.[4] P. Charles and C. Grothoff and V.A. Saraswatand C. Donawa and A. Kielstra and K. Ebciogluand von Praun and V. Sarkar. X10: anobjectoriented approach to non-uni<strong>for</strong>m clustercomputing. In Internacional Conference onObject Oriented Programming, Systems,Languages and Applications. pages 519-538.2005.[5] Thomas H. Cormen, Charles E. Leiserson,Ronald L. Rivest, and Clif<strong>for</strong>d Stein.Introduction to Algorithms (3rd ed.). MIT Press,2009.[6] Ferragina, P. and Navarro, G. 2005. ThePizza&Chili corpus — compressed indexes andtheir testbeds.[7] The Portable Hardware Locality (hwloc):http://www. open-mpi. org/ projects/ hwloc/


[8] J. L. Hennesy and D. A. Patterson. ComputerArchitecture - A Quantitative Approach. MorganKaufmann, 4th edition, 2007[9] J. L. Hennesy and D. A. Patterson. ComputerOrganization & Design - The Hardware/SoftwareInterface. Morgan Kaufmann, 4th edition, 2008.[10] Georg Hager, Gerhard Wellein. Introduction toHigh <strong>Per<strong>for</strong>mance</strong> Computing <strong>for</strong> Scientists andEngineers. Chapman & Hall/CRC ComputationalScience. 2010.[11] Manber, U. and Myers, G., “<strong>Suffix</strong> arrays: Anew method <strong>for</strong> on-line string searches”, SIAMJ. Computation, vol.22, no.5, pp.935-948, 1993.[12] OpenMP Architecture Review Board, OpenMPApplication Program Interface - Version 3.1,July 2011.Available athttp://openmp.org/wp/[13] J. Reinders. Intel Threading Building Blocks:Outfitting C++ <strong>for</strong> <strong>Multi</strong><strong>core</strong> ProcessorParallelism. OReilly (2007).[14] Jens Stoye. Su_x tree construction in ram. InEncyclopedia of Algorithms. Springer, 2008.[15] Fernando G. Tinetti, Sergio M. Martin.Sequential Optimization and Shared andDistributed Memory Optimization in Clusters:N-BODY/Particle Simulation. Parallel andDistributed Computing and Systems -PDCS.2012[16] Peter Weiner. Linear pattern matchingalgorithms. In Proceedings of the 14th AnnualSymposium on Switching and AutomataTheory, pages 1-11, 1973.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!