A distributed multilevel ant-colony algorithm for the multi-way graph ...

286 Int. J. Bio-Inspired Computation, Vol. 3, No. 5, 2011A distributed multilevel ant-colony algorithm for themulti-way graph partitioningK. Tashkova*, P. Korošec and J. ŠilcComputer Systems Department,Jožef Stefan Institute,Jamova cesta 39, SI-1000 Ljubljana, SloveniaE-mail: katerina.taskova@ijs.siE-mail: peter.korosec@ijs.siE-mail: jurij.silc@ijs.si*Corresponding authorAbstract: The graph-partitioning problem arises as a fundamental problem in many importantscientific and engineering applications. A variety of optimisation methods are used for solvingthis problem and among them the meta-heuristics outstand for its efficiency and robustness.Here, we address the performance of the distributed multilevel ant-colony algorithm (DMACA),a meta-heuristic approach for solving the multi-way graph partitioning problem, which is basedon the ant-colony optimisation paradigm and is integrated with a multilevel procedure. The basicidea of the DMACA consists of parallel, independent runs enhanced with cooperation in the formof a solution exchange among the concurrent searches. The objective of the DMACA is to reducethe overall computation time, while preserving the quality of the solutions obtained by thesequential version. The experimental evaluation on a two-way and four-way partitioning with 1%and 5% imbalance confirms that with respect to the sequential version, the DMACA obtainsstatistically, equally good solutions at a 99% confidence level within a reduced overallcomputation time.Keywords: ant-colony optimisation; bio-inspired computation; distributed computing; graphpartitioning; multilevel approach.Reference to this paper should be made as follows: Tashkova, K., Korošec, P. and Šilc, J. (2011)‘A distributed multilevel ant-colony algorithm for the multi-way graph partitioning’, Int. J.Bio-Inspired Computation, Vol. 3, No. 5, pp.286–296.Biographical notes: K. Tashkova is a PhD student at the Jožef Stefan International PostgraduateSchool, Ljubljana, Slovenia. She has received her BS in Electrical Engineering from ‘St. Cyriland Methodius’ University, Skopje, Macedonia, in 2005. Since 2007, she is a young Researcherat Computer System Department, Jožef Stefan Institute, Ljubljana, Slovenia. Her current areas ofresearch include numerical optimisation, mathematical modelling and equation discovery.P. Korošec is a Researcher at the Jožef Stefan Institute, Ljubljana, and Assistant Professor at theUniversity of Primorska, Koper, Slovenia. His current areas of research include combinatorialand numerical optimisation with ant-based meta-heuristics, and distributed computing.J. Šilc is the Deputy Head of the Department of Computer Systems at the Jožef Stefan Institute,Ljubljana, Slovenia and an Assistant Professor at the Jožef Stefan Postgraduate School,Ljubljana, Slovenia. His research interests include processor architecture, parallel computing,combinatorial and numerical optimisation.1 IntroductionThe problem of finding a partitioning for a given graphG into several subgraphs with respect to constraints(determined by the specific application), while minimising agiven objective function is the most general formulation ofthe graph-partitioning problem. It arises as a fundamentalproblem in many important scientific and engineeringapplications, like parallel computation, sparse matrix-vectormultiplication, sparse Gaussian elimination, VLSI design,image segmentation, telephone-network design, air-trafficmanagement, data clustering, the physical mapping of DNAand many others (Alpert and Kahng, 1995; Jain et al., 1999;Simon, 1991; Ucar et al., 2007; Toril et al., 2010).The most common formulation of this problem is knownas a multi-way graph partitioning problem. It consists offinding a partitioning of the given graph into k subgraphs insuch a way that the sum of the vertex weights is almostequal in each subgraph, while the number of edges crossingbetween the subgraphs is minimised. The multi-way graphpartitioning problem is most probably NP-hard (Gareyet al., 1974). Typically, this problem is too difficult to beCopyright © 2011 Inderscience Enterprises Ltd.

288 K. Tashkova et al.Based on the formal definition given by Bichot (2007), thegraph-partitioning problem can be formulated as follows.Given a graph G = (V, E), where V is the set of verticesand E is the set of edges. If edges or vertices are notweighted, then set all of them to a unit weight. For eachvertex v i ∈ V, let w(v i ) be its weight and for each edgee = (v i , v j ) ∈ E, let w(v i , v j ) be its weight. Find a partitionkπ k of k subsets V 1 , …, V k of V such that: ∪ i= 1 Vi= V;∀i, j∉{1, ..., k}, i ≠ j, Vi∩ Vj= 0; / the constraint C(V i ) istrue; and the cost function cut_size(π k ) is minimised. Let cutbe the cut between two distinct parts:cut( ViVj), = ∑ w( u, v),(1)u∈Vi,v∈Vji.e., the sum of the weights of the edges between part V i andpart V j for i ≠ j. Let W be the weight of a part V i :( ) w(),vW Vi= ∑ (2)v∈Vii.e., the sum of the weights of the vertices of part V i . Thek-way graph-partitioning problem is defined with theconstraint C k :WV ( )∀ iW , ( Vi) < β ⎡⎤ ,⎢ k⎥where the function ⎡x⎤ returns the smallest integer greater orequal than x, and the imbalance factor β is low (from 1.0 to1.1). The k-way graph-partitioning problem uses the costfunction:( π ) ( Vi,Vj)cut _ size = ∑ cut ,(4)ki 0 then8: Graph[l – 1] = Refinement(Graph[l])9: Bucket_Initialisation()10: end if11: end for12: BestPartition = BestLevelPartition3.1 SolverThe main idea of the solver, i.e., the algorithm for k-waygraph partitioning, is very simple (Langham and Grant,1999). It uses k colonies of ants (artificial agents), which aremediated by pheromone trails and a local heuristic, toperform a probabilistic move on a grid (which represents theants’ habitat), while competing for food (initially randomlyplaced on the grid cells) that is represented by the verticesof the graph. The result of the foraging behaviour of the kcolonies is stored food in k nests, i.e., they decompose thegraph into k subgraphs.3.2 Multilevel frameworkThe multilevel framework (Barnard and Simon, 1994)as presented in Algorithm 1 and Figure 1 combines alevel-based coarsening strategy together with a level-basedrefinement method (in reverse order) to promote fasterconvergence and solution to a larger problems.Figure 1The three phases of multilevel k-way graph partitioning3 Multilevel ant-colony algorithmThe MACA is an algorithm for k-way graph partitioningcombining ant-colony optimisation paradigm with amultilevel technique (Walshaw and Cross, 2001) in a waythat provides a more efficient behaviour and a higherflexibility when dealing with real world and large-scaleproblems. The MACA is a recursive-like approachthat combines four basic methods: graph partitioning(Solver, i.e., the method based on the ant-colonyoptimisation paradigm), graph contraction (Coarsening),graph expansion (Refinement) and vertex arrangement(Bucket_Sorting). Algorithm 1 outlines the top-levelMACA pseudo code.In order to be able to present the distributed version ofthe MACA, a brief description of the particular methodswill be covered in this section. Further details about thesemethods can be found in Korošec et al. (2004).

A distributed multilevel ant-colony algorithm for the multi-way graph partitioning 289Coarsening is a graph contraction procedure that is iteratedL times (on L levels). A coarser graph G l+1 (V l+1 , E l+1 ) isobtained from a graph G l (V l , E l ) by finding the largestindependent subset of graph edges and then collapsingthem. On the other hand, refinement is a graph expansionprocedure that applies on a partitioned graph G l , whichexpands it onto its parent graph G l–1 . The idea behind this isto solve the problem iteratively, step by step, starting with avery condensed problem representation (the smallest graph)which is then partitioned with the solver. The obtainedsolution is then expanded to the next level graph (bigger insize) and its partitioning is further refined with new iterationof optimisation. In this way, we expand the graph to itsoriginal size, and on every level l of our expansion we runour solver.Large graph problems and the multilevel process byitself induce rapid increase of the number of vertices ina single grid cell as the number of levels goes up.To overcome this problem, the MACA employs a methodbased on the basic bucket sort idea (Fiduccia andMattheyses, 1982) that accelerates and improves thealgorithm’s convergence by choosing the most ‘promising’vertex from a given cell. Inside the cell, all vertices with aparticular gain g are put together in a ‘bucket’ ranked g andall non-empty buckets, implemented as double-linked list ofvertices, are organised in a two to three tree (Bayer andMcCreight, 1972).Additionally, the MACA keeps separate two to threetree for each colony on every grid cell that has vertices inorder to gain even faster searches.4 Distributed multilevel ant-colony algorithmAn initial study on the parallelisation of the MACA(Tashkova et al., 2008) examined two distributed versionsof the MACA.The first one was based on the parallel interactivecolony approach, which, by definition, implied a master/slave implementation and synchronised communication. Adisadvantage of this version was the synchronisation/communication overhead, since an information exchangeacross the concurrent processors was initiated every time apiece of food was taken or dropped at a new position.Furthermore, the master kept and updated its own local gridmatrix of temporal food positions (played the role of ashared memory) in order to maintain normal and consistentslave activities.Trying to avoid the communication and still exploitsome level of parallelism, the second version distributes theMACA on the idea of the parallel, independent runs(Stützle, 1998) enhanced with cooperation in the form of asolution exchange among the concurrent searches. Inthis paper, we consider the second approach in aslightly modified version than the initially introduced theSIDMACA in Tashkova et al. (2008) and we simply refer toas the DMACA. The DMACA modifies the SIDMACAsearch method with respect to the number of iterations,the imbalance setting and the buffer size used forcommunication in the way that will be described in thefollowing paragraphs.The DMACA is basically an approach that allows theexchange of the best temporal solution at the end of everylevel of the multilevel optimisation process. This exchangerequires that the parallel executions of the MACA instanceson the available processors have to be synchronised onceper level. This means that the master processor isresponsible for synchronising the work of all the slaveprocessors that execute a copy of the DMACA, bymanaging the exchange information and communicationprocess. The slave processors have to execute the instancesof the DMACA code, make signals when the current levelof optimisation is finished and send the best partition to themaster. When all the slaves finish the current level, themaster determines the best solution and broadcasts it to theslaves. In order to proceed with the next level ofoptimisation, the slave processors have to first update thelocal memory structures (grid matrix) and afterwardsperform a partition expansion (refinement). The main ideaof the DMACA is outlined in Algorithm 2.Algorithm 2 DMACAMaster:1: Start_All_Slaves()2: repeat3: while all slaves not finished level do4: Receive_From_Slave(SlaveBestLevelPartition)5: BestLevelPartitions =Add(SlaveBestLevelPartition)6: end while7: BestLevelPartition = Calculate(BestLevelPartitions)8: Broadcast_To_Slaves(BestLevelPartition)9: if last level finished then10: BestPartition = BestLevelPartition11: end if12: until last level finished13: Stop_All_Slaves()Slave:1: Receive_From_Master(Parameters)2: Graph[0] = Initiliazation(Parameters)3: for l = 0 to L – 1 do4: Graph[l + 1] = Coarsening(Graph[l])5: end for6: for l = L down to 0 do7: SlaveBestLevelPartition = Solver(Graph[l])8: Send_To_Master(SlaveBestLevelPartition)9: Receive_From_Master(BestLevelPartition)10: Update(Graph[l], BestLevelPartition)11: if l > 0 then12: Graph[l – 1] = Refinement(Graph[l])

290 K. Tashkova et al.13: Bucket_Initialisation()14: end if15: end forFirst characteristic of the MACA is that it does not rigidlyfix the total number of iterations per level. More precisely,if the number of iterations per level is set to some arbitrarynumber m, the search procedure stops when in the lastsuccessive m iterations no improvement over the bestsolution is obtained. Second, it employs a limited search ofthe optimal subgraph inside a given constant imbalance(only) on the last three levels.Furthermore, the initial distributed version SIDMACAinherited the above described search procedure completelyunmodified from the original MACA, meaning that in thecomparison in the previous study was not based on fixednumber of iterations. In order to obtain an appropriatecomparison on parallel performance, the DMACA employsa fixed number of iterations per corresponding level l,according to the formula⎧ m = 0, 1⎪iter[] = ⎨iter[ − 1] + m = 2,..., L −2⎪⎩ 2 iter[ − 1] = L −1.This scaling of the number of iterations is in a wayproportional with the size of the graphs on the differentlevels. An intuitive explanation comes form the fact that atthe higher levels associated with the smaller graphs(contracted graphs) the search needs less iterations than thelarger graphs in the lower levels. Consequently, at the finalstage when we obtain the original graph, we allow thesearch a largest number of iteration for final refinement ofthe graph partitioning. Moreover, the imbalance factor wasscaled by levels as well: starting from level L – 3 constrainton the imbalance was introduced, and its value wasdecreased down to a specified threshold (in our caseβ = 1.01 or β = 1.05) at the first level. There was nolimitation on the imbalance at the last two levels, as thegraphs are very small in size.Finally, the buffer used for communication in theSIDMACA version was allocated in advance. It was fixed ata value big enough to support the transfer of the bestsolution of the largest (in terms of the number of vertices)graph tested. To avoid the unnecessary communicationoverhead in the case of partitioning differently size graphs,in the DMACA the buffer was dynamically allocatedaccording to the size of the graph.5 Experimental evaluationThe proposed DMACA was applied to a set of benchmarkgraphs and the results from the experimental evaluation ontwo-way and four-way graph partitioning are presented anddiscussed in this section.(5)5.1 Performance measuresThe quality of the graph partitioning is described by thecut-size measure and the imbalance factor. Since theimbalance of the obtained solutions is kept in a predefinedrange of values, we report on the quality in terms of thenumber of cut edges, cut_size(π k ).A statistical significance test was performed to checkthe difference in the quality of the obtained solutions withthe MACA and the DMACA. We used pairwisecomparisons with the signed-rank test proposed byWilcoxon (1945) and multiple comparisons with thedynamic post-hoc procedure proposed by Bergmann andHommel (1988). Based on these procedures with a chosensignificance level α (in our case 0.01), we make a decisionabout the null-hypothesis that ‘there is no difference inperformance between the compared methods’. If α < p-value then the null hypothesis is rejected, otherwise it is notrejected. Here, the p-value is determined according to theFriedman’s statistic using the cut-size results.Finally, the effectiveness of the parallel algorithm is, inour case, given by the speed-up measurestSSa( n)(6)t ( n )PandtP(1)Sr( n) = ,t ( n)(7)Pwhere t S is the time to solve a problem with the sequentialcode (MACA), and t P (1) and t P (n) are the times to solve thesame problem with the parallel code (DMACA) on a singleprocessor and on n processors, respectively. According tothe study (Barr and Hickman, 1993), the speed-up resultswere calculated based on the mean value of the time for theserial code, while the final result was presented as theharmonic mean of the speed-up values of all the runs.5.2 SetupThe DMACA is implemented in Borland Delphi, usingthe TCP/IP protocol for the server/client communication,based on the open-source library Indy Sockets 10. All theexperiments were performed on a eight-node clusterconnected via a Giga-bit switch, where each node consistsof two AMD Opteron 1.8 GHz processors, 2 GB of RAM,and the Windows XP operating system.The benchmark graphs used in the experimentalanalysis were taken from The Graph PartitioningArchive available online at http://staffweb.cms.gre.ac.uk/~c.walshaw/partition/. Their description, in terms of numberof graph vertices and number of graph edges is presented inTable 1, while the best available solutions for the particulargraphs up to June 2010 are presented in Table 2.

A distributed multilevel ant-colony algorithm for the multi-way graph partitioning 291Table 1Graph nameBenchmark graphsGDegree| V | | E | min max Avg.add20 2,395 7,462 1 123 6.23data 2,851 15,093 3 17 10.59uk 4,824 6,837 1 3 2.83bcsstk33 8,738 291,583 19 140 66.74crack 10,240 30,380 3 9 5.93wing_nodal 10,937 75,488 5 28 13.80vibrobox 12,328 165,250 8 120 26.814elt 15,606 45,878 3 10 5.88memplus 17,758 54,196 1 573 6.10cs4 22,499 43,858 2 4 3.90The total number of ants per colony was set to 120. Thenumber of ants per sub-colony was determined from thenumber n of processors as 1 of the total number of ants.nWith regard to the imbalance, we performed two sets ofexperiments, for β = 1.01 and β = 1.05. All the experimentswere run 30 times.Since the original MACA method and the proposedDMACA have slightly different search settings, in order togive them an equal chance in the experimental evaluation,we defined the MACA with a scaled number of iterationsand a scaled imbalance, as described in the previoussection when the DMACA was introduced. The number ofiterations m was set to 200.5.3 ResultsReporting results from the experiments with parallelalgorithms is not a straightforward task (Barr and Hickman,1993). Moreover, in the case of stochastic algorithms (likethe DMACA) the repeatability of the algorithm’s outcomeis questionable, making the performance-evaluationprocedure even more difficult. The standard way ofreporting results with the mean value and the correspondingvariance of the best found solutions over all the performedexecutions (runs) is not always sufficient (the mean valuecould be far away from the best obtained solution).Because of the general practice of reporting thebest obtained solutions in the field of multi-way graphpartitioning, we report the best value for the cut-sizemeasure obtained from 30 runs. The results on cut-sizeperformance obtained with the MACA and the DMACA arepresented in Tables 2 and 3, respectively. The relativedistance dist [%] of the DMACA solutions with regard tothe best available solutions are also calculated and given inTable 3.Compared with the currently available best solutions forthe given graphs, the best solutions obtained with theDMACA are worse from the ones given in Table 2, exceptfor two-way partitioning of the uk graph in the case ofimbalance β = 1.01 when we get the same solution.According to Table 3, the relative distances of thebest solutions obtained by the DMACA for the two-waypartitioning problem are smallest for crack (less than 2.2%),wing_nodal (less than 3.8%) and uk (less than 5.6%) graph.The largest deviation from the best available solutions in thecase of the two-way partitioning problem are observed forthe add20 (up to 30.4%) and 4elt (up to 42.3%) graph. Inthe case of the four-way partitioning problem, the relativedistances of the best solutions obtained by the DMACA aresmallest for the add20 (less than 2.4%) and wing_nodal(less than 4.5%) graph; solutions of the crack, vibroboxand memplus graph are worse up to 8.6% from the bestavailable, while the largest deviation of 40.5% is observedfor the solution of the uk graph.Table 2Best available (June 2010) and best cut-size values for the benchmark graphs obtained with the MACABest known value for cut-size measureMACAGraph name β = 1.01 β = 1.05 β = 1.01 β = 1.05k = 2 k = 4 k = 2 k = 4 k = 2 k = 4 k = 2 k = 4add20 594 1,177 550 1,157 718 1,165 698 1,192data 188 383 181 368 208 432 208 432uk 19 42 18 40 19 56 19 56bcsstk33 10,097 21,508 9,914 20,584 10,814 23,003 10,710 22,744crack 183 362 182 360 184 375 184 370wing_nodal 1,696 3,572 1,668 3,536 1,725 3,644 1,709 3,670vibrobox 10,310 19,199 10,310 18,778 11,256 19,996 11,421 20,0994elt 138 321 137 315 140 356 139 342memplus 5,489 9,559 5,267 9,299 6,227 10,076 6,122 10,072cs4 367 940 365 936 397 1,033 394 1,025

292 K. Tashkova et al.Table 3Best values for cut_size(π k ) measure obtained with the DMACA and corresponding relative distance dist in percentage withregard to the best available solution from Table 2Graph namenβ = 1.01 β = 1.05k = 2 k = 4 k = 2 k = 4cut_size dist [%] cut_size dist [%] cut_size dist [%] cut_size dist [%]add201 701 18.0 1,178 0.1 716 30.2 118 2.22 717 20.7 1,184 0.6 706 28.4 118 2.14 717 20.7 1,184 0.6 717 30.4 118 2.38 717 20.7 1,184 0.6 717 30.4 118 2.416 717 20.7 1,187 0.8 717 30.4 118 2.4data1 208 10.6 425 11.0 208 14.9 40 10.62 208 10.6 407 6.3 208 14.9 40 10.94 208 10.6 428 11.7 210 16.0 40 11.18 210 11.7 432 12.8 210 16.0 43 18.516 223 18.6 408 6.5 230 27.1 43 19.0uk1 19 0.0 59 40.5 19 5.6 5 35.02 19 0.0 51 21.4 19 5.6 5 27.54 19 0.0 46 9.5 19 5.6 4 20.08 19 0.0 48 14.3 19 5.6 5 30.016 19 0.0 50 19.0 19 5.6 4 22.5bcsstk331 10,719 6.2 23,170 7.7 10,706 8.0 2,285 11.02 10,805 7.0 23,430 8.9 10,874 9.7 2,332 13.34 10,884 7.8 23,156 7.7 10,545 6.4 2,308 12.28 10,555 4.5 23,463 9.1 10,586 6.8 2,324 12.916 10,882 7.8 22,810 6.1 10,922 10.2 2,355 14.4crack1 184 0.5 369 1.9 184 1.1 37 3.32 184 0.5 382 5.5 184 1.1 37 2.84 184 0.5 374 3.3 184 1.1 37 3.18 187 2.2 378 4.4 184 1.1 37 5.016 185 1.1 384 6.1 184 1.1 38 8.1wing_nodal1 1,722 1.5 3,619 1.3 1,723 3.3 364 2.92 1,715 1.1 3,659 2.4 1,710 2.5 366 3.84 1,713 1.0 3,676 2.9 1,716 2.9 367 3.88 1,718 1.3 3,651 2.2 1714 2.8 367 3.816 1,736 2.4 3,700 3.6 1,732 3.8 369 4.5vibrobox4elt1 11,583 12.3 19,905 3.7 11,528 11.8 2,000 6.52 11,313 9.7 20,198 5.2 11,471 11.3 2,035 8.44 11,376 10.3 20,078 4.6 11,567 12.2 2,003 6.78 11,244 9.1 20,340 5.9 11,226 8.9 2,038 8.616 11,693 13.4 20,244 5.4 11,626 12.8 2,038 8.51 140 1.4 348 8.4 154 12.4 36 16.52 173 25.4 334 4.0 139 1.5 34 7.94 139 0.7 350 9.0 140 2.2 35 14.08 185 34.1 344 7.2 141 2.9 34 9.216 179 29.7 348 8.4 195 42.3 34 9.8

A distributed multilevel ant-colony algorithm for the multi-way graph partitioning 293Table 3Best values for cut_size(π k ) measure obtained with the DMACA and corresponding relative distance dist in percentage withregard to the best available solution from Table 2 (continued)Graph namemempluscs4β = 1.01 β = 1.05nk = 2 k = 4 k = 2 k = 4cut_size dist [%] cut_size dist [%] cut_size dist [%] cut_size dist [%]1 6,207 13.1 10,058 5.2 6,191 17.5 1,005 8.22 6,219 13.3 9,977 4.4 6,198 17.7 1,005 8.14 6,168 12.4 10,071 5.4 6,196 17.6 997 7.28 6,239 13.7 10,030 4.9 6,172 17.2 1,002 7.816 6,206 13.1 9,994 4.6 6,204 17.8 1,001 7.71 391 6.5 1,022 8.7 391 7.1 1,008 7.72 399 8.7 1,040 10.6 403 10.4 1,038 10.94 403 9.8 1,033 9.9 402 10.1 1,032 10.38 406 10.6 1,060 12.8 408 11.8 1,041 11.216 411 12.0 1,041 10.7 422 15.6 1,040 11.1As our primary goal was not to get the best possiblesolutions out of the DMACA and compare them with thestate-of-the-art algorithms for graph partitioning, but topreserve the quality and improve the execution time of theMACA when distributed in a multiprocessor environment,we did not fine-tune the algorithms’ parameters whenapplied to a specific graph problem. This means that for allthe experiments with all the graphs we used the samesetting, no matter how big or complex the graph was.Based on the cut-size results, Table 4 presents pairwisecomparisons with the Wilcoxon signed-rank test. The testconfirms that, in general, there is not a significant differencein the quality of the generated solutions with the MACAand the DMACA, except in the case of solving the two-waypartitioning problem for imbalance β = 1.05 when theMACA is significantly better than the DMACA n = 16 at a1% significance level (α = 0.01).Table 4HypothesisPairwise comparisons with the Wilcoxon test(p-value)β = 1.01 β = 1.05k = 2 k = 4 k = 2 k = 4MACA vs. DMACA n=1 0.44 0.32 0.08 0.36MACA vs. DMACA n=2 0.94 0.61 0.03 1MACA vs. DMACA n=4 0.78 0.43 0.20 0.55MACA vs. DMACA n=8 0.85 0.36 0.72 0.31MACA vs. DMACA n=16 0.78 0.94 0.01 0.23Similarly, Table 5 presents the results on multiplecomparisons with the Bergmann-Hommel dynamic posthocprocedure between the instants of the DMACA applied ondifferent number of processors. For a 1% significance levelall the hypotheses are retained, meaning that there is nosignificant difference between the generated solutions withthe DMACA on different number of processors.Since the quality of the DMACA is preserved, then anyspeed-up we can gain is beneficial. The mean (harmonic)values of the absolute and relative speedups when theDMACA was applied to the 10 graphs are presented inFigure 2.Table 5HypothesisMultiple comparisons with the Bergmann-Hommelprocedure (adjusted p-value)β = 1.01 β = 1.05k = 2 k = 4 k = 2 k = 4DMACA n=1 vs. DMACA n=2 1 1 1 1DMACA n=1 vs. DMACA n=4 1 1 1 1DMACA n=1 vs. DMACA n=8 0.82 1 1 0.24DMACA n=1 vs. DMACA n=16 0.20 0.05 1 0.24DMACA n=2 vs. DMACA n=4 1 1 1 1DMACA n=2 vs. DMACA n=8 0.82 1 1 0.24DMACA n=2 vs. DMACA n=16 0.20 0.02 1 0.24DMACA n=4 vs. DMACA n=8 0.82 1 1 0.24DMACA n=4 vs. DMACA n=16 0.20 0.08 1 0.24DMACA n=8 vs. DMACA n=16 1 0.05 1 1In general, the observed speed-ups for the four-waypartitioning task are slightly higher than the one for thetwo-way partitioning task, obtaining a speed-up of up to 2,3, 5.3, and 8, in the case of executing the DMACA forn = 2, 4, 8, and 16, respectively. When solving the two-waypartitioning task with the DMACA for n = 2, 4, 8, and 16,the speed-up is up to 2, 2.6, 4, and 6, respectively.Based on the speed-up results visualised in Figure 2,Table 6 summarises the performance of the DMACA withrespect to the minimum and maximum speed-up values overall test graphs for both partitioning problems. The resultsshow that maximal speed-ups are obtained for the uk, crackand cs4 graph. While uk graph is the smallest in size amongthese three graphs, and cs4 the biggest in size from all testedgraphs, all of them have in average relatively small numberof connections per vertex. Moreover, the minimal speed-upsevident in the case of executing the DMACA for n = 2, 4, 8,

294 K. Tashkova et al.and 16, on bcsstk33 graph and relatively small speed-upsfor the vibrobox graph as well, reveals that the DMACAcode is potentially weak on graphs with a high degree ofconnections per vertex and a bigger size graph, in terms ofthe number of vertices. This comes from the bucket-sortingprocedure in the MACA, inherited completely unmodifiedby the DMACA code. Based on it, the food (vertices) insidea grid cell is sorted in buckets of particular gain organised intwo to three trees, for every colony separately. Thisprocedure is triggered every time food is taken by an ant:a bigger graph means more food for foraging, andconsequently more frequent calls to the procedure. Inaddition, a more densely connected graph, like bcsstk33,means a bigger two to three tree for searching and updating.All of this is maintained by every processor that executes aninstance of the DMACA.Figure 2 Observed DMACA speed-ups for the two-way and four-way partitioning task on the benchmarks graphs constrained to 1%(triangle marker) and 5% (square marker) imbalanceadd20dataukbcsstk33crackwing_nodalvibrobox4eltmempluscs4Note: Solid lines with black markers correspond to the absolute speed-up values, while the dashed lines with white markerscorrespond to the relative speed-up value.

A distributed multilevel ant-colony algorithm for the multi-way graph partitioning 295Table 6Speed-upS a (n)S r (n)Absolute and relative speed-up obtained with the DMACAnk = 2 k = 4Best Mean ± StD Worst Best Mean ± StD Worst2 2.27 uk 1.75 ± 0.23 1.39 vibrobox 1.94 add20 1.59 ± 0.19 1.31 data4 2.56 uk 1.97 ± 0.27 1.50 bcsstk33 2.77 crack 2.09 ± 0.33 1.49 bcsstk338 3.62 uk 2.83 ± 0.51 1.74 bcsstk33 4.51 crack 3.33 ± 0.75 1.65 bcsstk3316 5.27 crack 3.84 ± 0.92 1.88 bcsstk33 6.94 crack 4.93 ± 1.39 1.65 bcsstk332 2.11 cs4 1.86 ± 0.25 1.41 bcsstk33 2.33 cs4 1.80 ± 0.32 1.25 vibrobox4 2.56 data 2.11 ± 0.31 1.42 bcsstk33 3.13 data 2.36 ± 0.44 1.53 bcsstk338 3.97 crack 3.03 ± 0.63 1.64 bcsstk33 5.26 cs4 3.78 ± 0.99 1.69 bcsstk3316 6.01 crack 4.12 ± 1.39 1.78 bcsstk33 8.26 cs4 5.62 ± 1.75 1.65 bcsstk33Note: Statistics is calculated based on the results obtained for all graph and both imbalance factors.A general observation is that the parallel performance of thesystem with respect to speed-up over the serial MACA ispoor compared to the theoretically expected speed-up of nwhen using n processors. This is to some extentexpected, since the MACA was originally developed forsingle-processor execution.6 ConclusionsThis paper addressed the distributed multilevel ant-colonyalgorithm for multi-way graph partitioning, which is basedon the idea of parallel, independent runs enhanced withcooperation in the form of a solution exchange among theconcurrent searches. Driven by the primary goals of theparallel computation, the objective of the paper was not tofind the optimum solution in terms of quality, but to findreasonably good solutions in shorter computation times.The experimental evaluation on a two-way and four-waypartitioning of benchmark graphs, using an eight-nodecluster with distributed memory, showed that the distributedalgorithm can obtain the same quality as the sequentialalgorithm, while reducing the overall computation time.A high degree of graph connections can noticeablydegrade the parallel performance of the distributedalgorithm in terms of speed-up. This is mainly because ofthe computationally demanding updates in the memorystructures used by the bucket sorting procedure andmaintained by every processors on all levels. Consequently,the bucket sorting procedure combined with the multilevelprocess can result in high time consumption.Since the proposed distributed implementation suffersfrom increased communication and local memory updates,as initially discussed by Tashkova et al. (2008), a logicaland possible further step will be to test a correspondingshared-memory implementation.ReferencesAlpert, C.J. and Kahng, A.B. (1995) ‘Recent directions innetlist partitioning: a survey’, Integration, Vol. 19, Nos. 1–2,pp.1–81.Bahreininejad, A., Topping, B.H.V. and Khan, A.I. (1996) ‘Finiteelement mesh partitioning using neural networks’, Advancesin Engineering Software, Vol. 27, Nos. 1–2, pp.103–115.Baños, R., Gil, C., Ortega, J. and Montoya, F.G. (2003)‘Multilevel heuristic algorithm for graph partitioning’,Lecture Notes in Computer Science, Vol. 2611, pp.143–153.Barnard, S.T. and Simon, H.D. (1994) ‘Fast multilevelimplementation of recursive spectral bisection for partitioningunstructured problems’, Concurrency and Computation:Practice and Experience, Vol. 6, No. 2, pp.101–117.Barr, R.S. and Hickman, B.L. (1993) ‘Reporting computationalexperiments with parallel algorithms: issues, measures, andexperts’ opinion’, ORSA Journal on Computing, Vol. 5,No. 1, pp.2–18.Bayer, R. and McCreight, E.M. (1972) ‘Organization andmaintenance of large ordered indexes’, Acta Informatica,Vol. 1, No. 3, pp.173–189.Bergmann, B. and Hommel, G. (1988) ‘Multiplehypothesenprüfung – multiple hypotheses testing’, inBauer, P., Hommel, G. and Sonnemann, E. (Eds.):Improvements of General Multiple Test Procedures forRedundant Systems of Hypotheses, pp.100–115,Springer-Verlag.Bichot, C-E. (2007) ‘A new method, the fusion fission, for therelaxed k-way graph partitioning problem, and comparisonswith some multilevel algorithms’, Journal of MathematicalModelling and Algorithms, Vol. 6, No. 3, pp.319–344.Dorigo, M. (1992) ‘Optimization, learning and natural algorithms’,PhD thesis, Dipartimento di Elettronica, Politecnico diMilano, Milan, Italy.Dorigo, M. and Stützle, T. (2004) Ant Colony Optimization,The MIT Press, Cambridge, MA.Fiduccia, C.M. and Mattheyses, R.M. (1982) ‘A linear timeheuristic for improving network partitions’, Proceedings ofthe 19th IEEE Design Automation Conference, pp.175–181,Las Vegas, NV.Flyn, M.J. (1972) ‘Some computer organization and theireffectiveness’, IEEE Transactions on Computers, Vol. 21,No. 9, pp.948–960.Garey, M.R., Johnson, D.S. and Stockmeyer, L. (1974) ‘Somesimplified NP-complete problems’, Proceedings of the 6thAnnual ACM Symposium on Theory of Computing, pp.47–63,Seattle, WA.Hendrickson, B. and Leland, R. (1995) ‘A multilevel algorithm forpartitioning graphs’, Proceedings of the Supercomputing ‘95,San Diego, CA.

296 K. Tashkova et al.Jain, A.K., Murthy, M.N. and Flynn, P.J. (1999) ‘Data clustering:a review’, ACM Computing Surveys, Vol. 31, No. 3,pp.264–323.Kad luczka, P. and Wala, K. (1995) ‘Tabu search and geneticalgorithms for the generalized graph partitioning problem’,Control and Cybernetics, Vol. 24, No. 4, pp.459–476.Karypis, G. and Kumar, V. (1998) ‘Multilevel k-way partitioningscheme for irregular graphs’, Journal of Parallel andDistributed Computing, Vol. 48, No. 1, pp.96–129.Kaveh, A. and Shojaee, S. (2008) ‘Optimal domain decompositionvia p-median methodology using ACO and hybrid ACGA’,Finite Elements in Analysis and Design, Vol. 44, No. 8,pp.505–512.Kernighan, B.W. and Lin, S. (1970) ‘An efficient heuristicprocedure for partitioning graphs’, The Bell System TechnicalJournal, Vol. 49, No. 2, pp.291–307.Korošec, P., Šilc, J. and Robič, B. (2004) ‘Solving themeshpartitioning problem with an ant-colony algorithm’,Parallel Computing, Vol. 30, Nos. 5–6, pp.785–801.Langham, A.E. and Grant, P.W. (1999) ‘Using competing antcolonies to solve k-way partitioning problems with foragingand raiding strategies’, Lecture Notes in Computer Science,Vol. 1674, pp.621–625.Randall, M. and Lewis, A. (2002) ‘A parallel implementation ofant colony optimization’, Journal of Parallel and DistributedComputing, Vol. 62, No. 9, pp.1421–1432.Schloegel, K., Karypis, G. and Kumar, V. (2003) ‘Graphpartitioning for high performance scientific simulation’, inDongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K.,Torczon, L. and White, A. (Eds.): Sourcebook of ParallelComputing, pp.491–541, Morgan Kaufmann Publishers,San Francisco, CA.Simon, H.D. (1991) ‘Partitioning of unstructured problems forparallel processing’, Computing Systems in Engineering,Vol. 2, Nos. 2–3, pp.135–148.Soper, A.J., Walshaw, C. and Cross, M. (2000) ‘A combinedevolutionary search and multilevel approach to graphpartitioning’, Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO 2000), pp.674–681,Las Vegas, NV.Stützle, T. (1998) ‘Parallelization strategies for ant colonyoptimization’, Lecture Notes in Computer Science, Vol. 1498,pp.722–731.Tashkova, K., Korošec, P. and Šilc, J. (2008) ‘A distributedmultilevel ant colonies approach’, Informatica, Vol. 32,No. 3, pp.307–317.Toril, M., Molina-Fernández, I., Wille, V. and Walshaw, C. (2010)‘Analysis of heuristic graph partitioning methods for theassignment of packet control units in GERAN’, WirelessPersonal Communications, in press, doi: 10.1007/s11277-010-9963-1.Ucar, D., Neuhaus, I., Ross-MacDonald, P., Tilford, C.,Parthasarathy, S., Siemers, N. and Ji, R-R. (2007)‘Construction of a reference gene association network frommultiple profiling data: application to data analysis’,Bioinformatics, Vol. 23, No. 20, pp.2716–2724.Walshaw, C. and Cross, M. (2001) ‘Mesh partitioning: a multilevelbalancing and refinement algorithm’, SIAM Journal onScientific Computing, Vol. 22, No. 1, pp.63–80.Wilcoxon, F. (1945) ‘Individual comparisons by ranking methods’,Biometrics Bulletin, Vol. 1, No. 6, pp.80–83.

A distributed multilevel ant-colony algorithm for the multi-way graph ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?