15.11.2012 Views

Parallelized Critical Path Search in Electrical Circuit Designs

Parallelized Critical Path Search in Electrical Circuit Designs

Parallelized Critical Path Search in Electrical Circuit Designs

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Parallelized</strong> <strong>Critical</strong> <strong>Path</strong> <strong>Search</strong> <strong>in</strong> <strong>Electrical</strong> <strong>Circuit</strong> <strong>Designs</strong><br />

Pascal Bolzhauser1 , Anthony Sulistio2 , Gerhard Angst1 and Christoph Reich2 1Concept Eng<strong>in</strong>eer<strong>in</strong>g GmbH<br />

2Department of Computer Science<br />

Boetz<strong>in</strong>ger Str. 29, Freiburg, Germany Hochschule Furtwangen University, Germany<br />

{pascal, gerhard}@concept.de {anthony.sulistio, christoph.reich}@hs-furtwangen.de<br />

Abstract<br />

For f<strong>in</strong>d<strong>in</strong>g the critical path <strong>in</strong> electrical circuit designs,<br />

a shortest-path search must be carried out. This paper<br />

<strong>in</strong>troduces a new two-level shortest-path search algorithm<br />

specially adapted for parallelization. The proposed algorithm<br />

is based on a module-based partition<strong>in</strong>g algorithm<br />

and a shortest-path search parallelized for the usage on<br />

multi-core systems. Experimental results show the impact<br />

of this approach.<br />

1. Introduction<br />

One of the most fundamental problems <strong>in</strong> numerous applications<br />

of various IT fields is the shortest path search<br />

(SPS) problem, i.e. f<strong>in</strong>d<strong>in</strong>g a path between two nodes <strong>in</strong><br />

a weighted directed graph, such that the sum of the weights<br />

of the edges is m<strong>in</strong>imized. SPS <strong>in</strong> an application used for<br />

a chip design is to f<strong>in</strong>d the logic on the shortest path from<br />

one clocked element to another <strong>in</strong> the electrical circuit. The<br />

vertices represent logic gates of the circuit and the edges<br />

represent the electrical connection by nets or net buses. The<br />

cost can be calculated by count<strong>in</strong>g the logic levels or the<br />

nets are weighted by specific tim<strong>in</strong>g values to reach the next<br />

gate. Such tim<strong>in</strong>g values can come, e.g. from static tim<strong>in</strong>g<br />

analysis tools.<br />

Edsger Dijkstra presented the well known Dijkstra algorithm<br />

[5] to solve this problem. Dijkstra’s graph search algorithm<br />

solves the s<strong>in</strong>gle-source SPS problem for a graph<br />

with non negative weights for the edges. The algorithm<br />

works by visit<strong>in</strong>g vertices <strong>in</strong> the graph start<strong>in</strong>g with the<br />

start<strong>in</strong>g po<strong>in</strong>t. It then repeatedly exam<strong>in</strong>es the closest not<br />

yet exam<strong>in</strong>ed vertex. It expands from the start<strong>in</strong>g po<strong>in</strong>t towards<br />

the target until it reaches the goal. S<strong>in</strong>ce then, many<br />

other algorithms have been developed to improve the orig<strong>in</strong>al<br />

algorithm, e.g. the A ∗ algorithm [4]. The Dijkstra algorithm<br />

as well as most of the variations of the orig<strong>in</strong>al algorithm<br />

also exist <strong>in</strong> a parallelized version [17]. The parallel<br />

algorithm can be used to perform the shortest path calcula-<br />

tion <strong>in</strong> a distributed environment, e.g. a multi-core system<br />

or cluster.<br />

Unfortunately, a parallel algorithm often does not scale<br />

well with the <strong>in</strong>creas<strong>in</strong>g number of processors or available<br />

calculation nodes, due to long memory latencies and high<br />

synchronization costs [1]. To address this problem, this paper<br />

<strong>in</strong>troduces a comb<strong>in</strong>ed technique of partition<strong>in</strong>g an orig<strong>in</strong>al<br />

graph <strong>in</strong>to smaller ones, and uses an adapted Arc-Flag<br />

approach [16] for speed<strong>in</strong>g-up the SPS problem on multicore<br />

systems.<br />

The rest of this paper is organized as follows. Section 2<br />

describes a brief overview to the shortest path problem and<br />

bidirectional Dijkstra search algorithm. Section 3 mentions<br />

several partition<strong>in</strong>g graph models, whereas Section 4 expla<strong>in</strong>s<br />

the Arc-Flag approach. Section 5 conducts an experiment,<br />

whereas Section 6 mentions some related works. F<strong>in</strong>ally,<br />

Section 7 concludes the paper and gives future work.<br />

2. Shortest <strong>Path</strong> Problem<br />

In graph theory, the problem of f<strong>in</strong>d<strong>in</strong>g a path between<br />

two vertices such that the sum of the weights is m<strong>in</strong>imized<br />

is called the shortest path problem [3]. A weighted graph is<br />

formally given as a set V of vertices, a set E of edges and<br />

a weight function f : E → R. To f<strong>in</strong>d a shortest path, start<br />

from one element v of V and f<strong>in</strong>d a path P from v to each<br />

v ′ of V such that<br />

�<br />

p∈P f(p)<br />

is m<strong>in</strong>imal among all paths connect<strong>in</strong>g v to v ′ . This is<br />

also called the s<strong>in</strong>gle-pair shortest path problem.<br />

Two generalizations are:<br />

• The s<strong>in</strong>gle-source shortest path problem is more general<br />

than the s<strong>in</strong>gle-pair one. The goal is to f<strong>in</strong>d shortest<br />

paths from a source vertex v to all other vertices <strong>in</strong><br />

the graph.<br />

• The all-pairs shortest path problem is the most general<br />

problem of all. All shortest paths between every pair<br />

of vertices v, v ′ <strong>in</strong> the graph need to be found.


In practice, for both the aforementioned generalizations,<br />

there exist more algorithms than for runn<strong>in</strong>g a s<strong>in</strong>gle-pair<br />

shortest path algorithm on all relevant pairs of vertices.<br />

2.1. Bidirectional Dijkstra <strong>Search</strong><br />

If a po<strong>in</strong>t-to-po<strong>in</strong>t shortest path search is performed, then<br />

the start po<strong>in</strong>t as well as the target po<strong>in</strong>t are known. The<br />

orig<strong>in</strong>al unidirectional shortest path search algorithm <strong>in</strong>troduced<br />

by Dijkstra [5], starts from the start po<strong>in</strong>t and performs<br />

a shortest path search <strong>in</strong> a depth first manner until<br />

the target po<strong>in</strong>t is reached. The runtime for this search is<br />

O(n log n).<br />

Luby and Ragde [15] presented a bidirectional version<br />

of the Dijkstra algorithm with an expected run time of<br />

O( √ n log n). Such a bidirectional search consists of two<br />

phases. In the first phase, two unidirectional Dijkstra search<br />

runs start from the start po<strong>in</strong>t as well as from the target.<br />

Both runs span a tree by alternat<strong>in</strong>g between start and target,<br />

and expand<strong>in</strong>g the next level of reachable nodes. From<br />

these trees, the m<strong>in</strong>imum distance to the start with respect<br />

to the target is known. As long as there are no visible edges<br />

<strong>in</strong> both runs, alternate nodes can be expanded and added to<br />

the tree. Thus, the shortest path between the start and target<br />

nodes is with<strong>in</strong> the two expanded search trees. In the second<br />

phase, the shortest path out of the two trees needs to be<br />

collected.<br />

Accord<strong>in</strong>g to Fu et al. [9], the time complexity of the<br />

bidirectional Dijkstra algorithm is 1<br />

8 O(n2 ) on a s<strong>in</strong>gle-core<br />

or uniprocessor systems. It is obvious that this approach<br />

can optimally be distributed over two processes runn<strong>in</strong>g on<br />

a dual-core or multi-processor mach<strong>in</strong>e.<br />

3. General Approach to the Shortest <strong>Path</strong><br />

Problem<br />

A more general approach to speed-up the po<strong>in</strong>t-to-po<strong>in</strong>t<br />

shortest path problem is to divide the graph <strong>in</strong>to separate<br />

sub-graphs. Each of this sub-graphs can be processed on its<br />

own <strong>in</strong> a separate process.<br />

The ma<strong>in</strong> challenge is to divide the circuit represented<br />

by the graph <strong>in</strong>to mean<strong>in</strong>gful units. Figure 1(a) shows a<br />

graph with 16 nodes. The only mean<strong>in</strong>gful criteria by which<br />

the graph could be divided are partition size and number<br />

of <strong>in</strong>terl<strong>in</strong>k<strong>in</strong>g nodes. Usually, the created sub-graphs are<br />

balanced, i.e. each partition should have the same amount<br />

of nodes, as shown <strong>in</strong> Figure 1(b).<br />

The graph partition<strong>in</strong>g of our approach uses arithmetic<br />

and logic modules <strong>in</strong>formation, as depicted <strong>in</strong> Figure 2. As<br />

a result, this heuristic can be used to speed-up the partition<strong>in</strong>g<br />

process. In general, there are two methods of graph partitions:<br />

static and dynamic. These methods are expla<strong>in</strong>ed<br />

next.<br />

(a) A graph with 16 nodes.<br />

(b) A graph with three balanced partitions.<br />

Figure 1. Graph divided by m<strong>in</strong>imiz<strong>in</strong>g the<br />

number of <strong>in</strong>terl<strong>in</strong>k nodes.<br />

Figure 2. Two different ways to partition a circuit<br />

depend<strong>in</strong>g on the electrical modules.


3.1. Static Partition<strong>in</strong>g of Graphs<br />

The partition<strong>in</strong>g is only done once, before the actual<br />

search. Thus, a static partition<strong>in</strong>g cannot reflect the current<br />

situation <strong>in</strong> a multi-core system or cluster. The follow<strong>in</strong>g<br />

are various static partition<strong>in</strong>g methods. Detailed explanations<br />

of these static methods can be found <strong>in</strong> [16].<br />

Rectangular Partition<strong>in</strong>g The easiest way to partition a<br />

graph with a 2D representation is to divide the graph<br />

<strong>in</strong>to rectangular regions. This is done by us<strong>in</strong>g n × m<br />

grid of rectangles. A rectangular region is def<strong>in</strong>ed by<br />

its bound<strong>in</strong>g box. This method only respects the geography<br />

of the graph, but does not respect the structure<br />

(geometry), node density or any other attributes of the<br />

underly<strong>in</strong>g graph.<br />

Quad Trees [8] represent a two dimensional space which<br />

is divided <strong>in</strong>to four quadrants or regions, until the desired<br />

resolution is achieved and the recursion ends <strong>in</strong> a<br />

leaf of the tree. Quad Trees are not only a graph partition<strong>in</strong>g<br />

method, but they are also an effective data<br />

structure to store po<strong>in</strong>ts, l<strong>in</strong>es or curves <strong>in</strong> a pla<strong>in</strong>.<br />

Quad Trees are typically used for geometric algorithms<br />

and image process<strong>in</strong>g like spatial <strong>in</strong>dex<strong>in</strong>g, image representation<br />

or efficient collision detection <strong>in</strong> two dimensions.<br />

k-Dimensional (kd) Trees generalize the Quad Tree partition<strong>in</strong>g<br />

<strong>in</strong> a so called kd-tree [2]. A kd-tree is a partition<strong>in</strong>g<br />

data structure which also recursively divides a<br />

pla<strong>in</strong> <strong>in</strong>to rectangles. Thus, this data structure can deal<br />

with a k-dimensional Euclidean space with exactly k<br />

orthogonal axes.<br />

Kernighan-L<strong>in</strong> (KL) Heuristics [12] is a 2-way local ref<strong>in</strong>ement<br />

algorithm, and used for bisect<strong>in</strong>g graphs. It<br />

is also known as a m<strong>in</strong>-cut or group migration procedure.<br />

The objective of the KL heuristics is to partition<br />

a graph or a circuit <strong>in</strong> such a way that the number of<br />

connections between the subgraphs is m<strong>in</strong>imized. In<br />

addition, it is able to reduce the edge-cut of an exist<strong>in</strong>g<br />

bisection. However, the disadvantage of this heuristic<br />

is that it can only be used on graphs with an even<br />

number of nodes. As a consequence, each of the bipartitions<br />

are equally sized, and the complexity of the<br />

Kernighan-L<strong>in</strong> heuristics is O(n 3 ) which makes it unusable<br />

for large graphs. Furthermore, it cannot handle<br />

multi term<strong>in</strong>al nets which are common <strong>in</strong> the field of<br />

electrical circuits.<br />

Fiduccia-Mattheyses (FM) Heuristics [7] for partition<strong>in</strong>g<br />

hypergraphs is an iterative algorithm which improves<br />

the result with every iteration and promises to<br />

solve all these problems. It is an improvement of the<br />

KL heuristics and can operate on even and odd number<br />

of nodes. The bi-partitions can be unequally sized.<br />

The complexity is O(n).<br />

3.2. Dynamic Partition<strong>in</strong>g of Graphs<br />

For a dynamic partition<strong>in</strong>g, one very important aspect is<br />

that the workload is balanced and that the <strong>in</strong>terprocess communication<br />

overhead is m<strong>in</strong>imized. This is a NP-complete<br />

problem [22], therefore heuristics have been developed to<br />

solve this problem [21].<br />

A dynamic partition<strong>in</strong>g algorithm needs to respect the<br />

cost of the re-balanc<strong>in</strong>g. If frequent load balanc<strong>in</strong>g is required,<br />

the re-balanc<strong>in</strong>g costs need to be low proportionally<br />

to the solution algorithm. If a node is migrated to a new<br />

processor for better load balanc<strong>in</strong>g, then this could also <strong>in</strong>clude<br />

heavy data migration. Reus<strong>in</strong>g already migrated data<br />

should be considered while calculat<strong>in</strong>g the new balance.<br />

Look<strong>in</strong>g at the complete graph as a whole, a graph reduction<br />

can help to reduce the aforementioned problems. The<br />

idea of the graph reduction or coarsen<strong>in</strong>g is to form clusters<br />

by group<strong>in</strong>g vertices together. These clusters are used<br />

to form a new graph. Then, this procedure is recursively<br />

repeated until the desired coarsen<strong>in</strong>g is reached.<br />

Reachable Function In order to decide whether dynamic<br />

partition<strong>in</strong>g and distribution on multiple nodes makes sense,<br />

a very fast reachable function could be used. This function<br />

should be able to answer the question if a target is reachable<br />

<strong>in</strong> a fraction of the time the real query would need. Based<br />

on the result of the reachable function, dynamic partition<strong>in</strong>g<br />

and distribution can be started. Also, this function can<br />

return with a negative result, which means that the desired<br />

target is not reachable. In this case, the shortest path search<br />

is f<strong>in</strong>ished and no further effort need to put on partition<strong>in</strong>g<br />

and distribution.<br />

Ideally, the runtime of the reachable function should be<br />

m<strong>in</strong>uted. Depend<strong>in</strong>g on the run time of the reachable function,<br />

the real search time should be estimated by <strong>in</strong>terpolation<br />

of the measured time. Moreover, considerations about<br />

the run time of dynamic partition<strong>in</strong>g and the overhead for<br />

distribution should be comb<strong>in</strong>ed with the estimated search<br />

time to decide how much effort should be put on partition<strong>in</strong>g<br />

and distribution.<br />

Our implementation of the reachable function also performs<br />

the same operations of a standard Dijkstra algorithm,<br />

and is based on [13]. Once a target is found, then the recursion<br />

term<strong>in</strong>ates and a backtrace is started. Dur<strong>in</strong>g the backtrace<br />

process, no result list is created and allocated. Thus,<br />

no nodes to remember the way from the start to the target<br />

are stored. The waiver of creat<strong>in</strong>g the result list promises a<br />

faster execution time, because no memory allocation need


to be done. In addition, the <strong>in</strong>sertion of new elements to the<br />

result list is not applicable.<br />

4. The Arc-Flag Approach<br />

The Arc-Flag approach [16] presumes that a graph has<br />

already been divided <strong>in</strong>to partitions. It is irrelevant which<br />

of the partition<strong>in</strong>g methods has been used to perform the<br />

partition<strong>in</strong>g. The Arc Flag approach calculates all shortest<br />

paths for each possible entry po<strong>in</strong>t <strong>in</strong>to a region to all possible<br />

exit po<strong>in</strong>ts. The path to the exit and the name of the exit<br />

po<strong>in</strong>t is stored at each entry po<strong>in</strong>t. This annotation is called<br />

the arc to the exit.<br />

The arcs can be created dynamically. Each time a shortest<br />

path calculation is performed and a path through a region<br />

from a new entry po<strong>in</strong>t is requested, this shortest path<br />

is calculated and stored at the entry po<strong>in</strong>t. If a shortest path<br />

calculation enters a region aga<strong>in</strong> at a po<strong>in</strong>t where the shortest<br />

path through the region has been calculated before, then<br />

the stored path is used. This saves a lot of calculation time,<br />

because the shortest path search through this partition will<br />

not perform aga<strong>in</strong>. From time to time, more and more paths<br />

will be cached. The more shortest path calculations are performed,<br />

the more speed-up can be achieved.<br />

The calculation of the arcs can also be performed as a<br />

preprocess to the actual shortest path calculation. As a result,<br />

all shortest path pairs through a partition are already<br />

known when a shortest path search on the graph is performed.<br />

4.1. Preprocess<strong>in</strong>g the Graph<br />

Preprocess<strong>in</strong>g a graph is needed to calculate and store<br />

the Arc-Flag entries for each region of a static partitioned<br />

graph. Therefore, a one-to-all shortest path computation us<strong>in</strong>g<br />

a standard Dijkstra algorithm (all-pairs shortest-path) is<br />

performed. This Dijkstra run can be <strong>in</strong>terrupted if all nodes<br />

<strong>in</strong> a region are marked as visited.<br />

In the worst case scenario, (n nodes and m pairs) the<br />

complexity is: O(m(m + n + n log n)) with m = O(n)<br />

this will result <strong>in</strong>: O(n 2 log n). For large n, it is obvious<br />

that this preprocess<strong>in</strong>g takes far too long.<br />

There are two possible solutions suggested by Moehr<strong>in</strong>g<br />

et al. [16]. First, they showed that it is possible to preprocess<br />

the graph without calculat<strong>in</strong>g all-pairs of shortest paths. F<strong>in</strong>ally,<br />

the storage of pruned shortest path trees can help to<br />

avoid this complexity problem.<br />

4.2. Two-Level Partition<strong>in</strong>g Arc Flag Approach<br />

Us<strong>in</strong>g one of the partition<strong>in</strong>g methods mentioned earlier,<br />

<strong>in</strong> comb<strong>in</strong>ation with the Arc Flag approach and preprocess-<br />

Figure 3. Coars<strong>in</strong>g<br />

<strong>in</strong>g the graph, the region conta<strong>in</strong><strong>in</strong>g the target node can be<br />

reached very fast. However, <strong>in</strong>side the target region, the<br />

path from the entrance po<strong>in</strong>t to the target node needs to be<br />

found by a separate search. Depend<strong>in</strong>g on the granularity<br />

of the partition and the size of each region, such a search<br />

may need to visit many nodes <strong>in</strong> the region where the target<br />

belongs to. To avoid this bad behavior, two partition levels<br />

can be used. The first partition level is a coarse level while<br />

the second one is a detailed level.<br />

As an optimization, the detailed level could be stored<br />

only for heavy loaded regions. In Figure 3, a 5 × 5 coarse<br />

partition and a 3 × 3 f<strong>in</strong>e partition for each coarse partition<br />

is used. Coarsen<strong>in</strong>g and stor<strong>in</strong>g detailed levels is more<br />

memory efficient than us<strong>in</strong>g a 15 × 15 grid <strong>in</strong> this case.<br />

5. Experiment<br />

In this section, we evaluate three methods: bidirectional<br />

search, reachable function and our adapted Arc-Flag approach.<br />

We compare them aga<strong>in</strong>st a base algorithm, i.e.<br />

the standard Dijkstra algorithm. The bidirectional search<br />

is chosen because it is a very simple method, whereas<br />

the reachable function is compared due to its fast runtime.<br />

In addition, we compare Intel C++ compiler [10] (Intel-<br />

10.0.023) that supports auto-parallelization with gcc version<br />

3.4.6.<br />

For the test environment, we use an Opteron server with<br />

two Dual Core AMD Opteron processors (275 HE) with 2.2<br />

GHz. Thus, four CPU cores available for calculations. This<br />

mach<strong>in</strong>e is equipped with 16 GB of physical ma<strong>in</strong> memory.<br />

From each core, all the available memory can be accessed.<br />

The available hard disk space is add<strong>in</strong>g up to 1 TB mirrored<br />

us<strong>in</strong>g a RAID level 1. As for the operat<strong>in</strong>g system, a 64-bit<br />

Red Hat Enterprise L<strong>in</strong>ux Workstation version 4 (update 4)<br />

is <strong>in</strong>stalled.<br />

For the test data, we use the freely available chip design,<br />

i.e. the Verilog’s register transfer level (RTL) description


of the Sun’s OpenSPARC T1 (Niagara) processor [18]. The<br />

data represent a large graph of an electrical circuit.<br />

5.1 Test Method<br />

A RTLvision Pro program [19] is used to implement the<br />

algorithms and the shortest path extraction. List<strong>in</strong>g 1 shows<br />

the script used to start and measure the shortest path extraction.<br />

It is written <strong>in</strong> a Tcl/Tk like language which is used to<br />

customize and program RTLvision PRO with user def<strong>in</strong>ed<br />

functions called Userware.<br />

After read<strong>in</strong>g the OpenSPARC RTL test data with the<br />

RTLvision PRO software, a synthesized netlist with a gate<br />

equivalent of over 30 million gates is produced. Particularly<br />

this results <strong>in</strong> a graph with 31,593,068 vertices and<br />

160,617,454 edges.<br />

s e t db [ zdb open −readonly / tmp / T1.zdb ]<br />

s e t i n P o r t {}<br />

s e t o u t P o r t {}<br />

s e t t o p {}<br />

$db foreach t o p t o p break<br />

$db foreach p o r t $top p o r t {<br />

s w i t c h [ $db d i r e c t i o n O f $ p o r t ] {<br />

‘ ‘ i n p u t ’ ’ { lappend i n P o r t $ p o r t }<br />

‘ ‘ i n o u t ’ ’ −<br />

‘ ‘ o u t p u t ’ ’ { lappend o u t P o r t $ p o r t }<br />

}<br />

}<br />

foreach s t a r t P o r t $ i n P o r t {<br />

foreach e n d P o r t $ o u t P o r t {<br />

s e t t [ time {<br />

s e t r e s [ $db cone −out − t a r g e t O b j<br />

$ e n d P o r t $ s t a r t P o r t ]<br />

}]<br />

s e t t [ l i n d e x $ t 0]<br />

s e t t [ expr {round ( $ t / 1000 . 0 ) } ] ; # ms<br />

s e t t [ expr { $ t / 1000 . 0 }] ; # s e c o n d s<br />

puts ‘ ‘ $ s t a r t P o r t t o I /O\ t $ t \ t<br />

[ l l e n g t h $ r e s ] ’ ’<br />

}<br />

}<br />

$db c l o s e<br />

List<strong>in</strong>g 1. A script runn<strong>in</strong>g the all-pairs path<br />

extraction.<br />

At the start po<strong>in</strong>t, each top level <strong>in</strong>put port is selected.<br />

The end po<strong>in</strong>t of the path extraction on the netlist database<br />

is one of the output ports. Thus, from each top level <strong>in</strong>put<br />

port a shortest path extraction to all output ports is performed.<br />

The time needed to extract all paths between all<br />

Figure 4. Runtime of the Intel icc compiler<br />

compared to the gcc compiler.<br />

<strong>in</strong>put and output port pairs is measured us<strong>in</strong>g the Tcl time<br />

command which returns the CPU time for a command <strong>in</strong><br />

microseconds.<br />

This k<strong>in</strong>d of test is - compared to the real usage - somehow<br />

artificial but it is the worst case scenario. No longer<br />

runn<strong>in</strong>g path extraction on the circuit can be performed, because<br />

the extraction starts at a top level <strong>in</strong>put port and ends<br />

at a top level output port. In addition, the implementation<br />

of the Dijkstra algorithm is highly recursive. Due to this<br />

fact, the program exhaustively uses stack memory dur<strong>in</strong>g<br />

the runtime. In order to run this test script without an out of<br />

stack memory error, the operat<strong>in</strong>g system limits need to be<br />

adjusted with the<br />

u l i m i t −s u n l i m i t e d<br />

command. Dur<strong>in</strong>g the tests a stack memory consumption of<br />

about 2 GB was measured.<br />

5.2. Experiment Results<br />

5.2.1 Auto-Paralleliz<strong>in</strong>g<br />

Intel offers a commercial compiler with an auto parallelization<br />

option [10]. Based on a static source code analysis, the<br />

Intel compiler promises to create code that run as parallel<br />

threads and takes the advantage of multi-core CPUs. Depend<strong>in</strong>g<br />

on the complexity of the source code and on the<br />

number of <strong>in</strong>dependent loops, a significant speed-up can be<br />

expected. Therefore, we first look <strong>in</strong>to how the Intel icc<br />

compiler can offer such benefit with zero effort.<br />

For the comparison of the Intel’s icc compiler aga<strong>in</strong>st<br />

the GNU’s gcc compiler, the exist<strong>in</strong>g path extraction code<br />

without any optimization for parallel execution is used. The


exist<strong>in</strong>g code uses a standard Dijkstra algorithm. We compile<br />

the code with the same optimization level.<br />

Figure 4 compares the runtime of the path search described<br />

<strong>in</strong> List<strong>in</strong>g 1. From this figure, the Intel compiler<br />

performs better <strong>in</strong> the first half of the experiment, by up<br />

to 28 seconds faster. However, as more samples are taken,<br />

the ga<strong>in</strong> is greatly reduced with<strong>in</strong> 5 seconds of each other.<br />

Overall, with 40 measurements, the Intel compiler offers a<br />

better performance of approximately 4.5 seconds on average<br />

compared to gcc.<br />

Due to the highly recursive implementation, we do not<br />

expect to much speed-up on the Intel compiler. The result,<br />

shown <strong>in</strong> Figure 4, confirms our assumption. To achieve<br />

a better performance, manual changes to the code us<strong>in</strong>g<br />

pragma directives can assist the compiler to parallelize the<br />

code. However, this would need an <strong>in</strong> deep <strong>in</strong>vestigation of<br />

the source code and is a very time consum<strong>in</strong>g task. Therefore,<br />

this experiment emphasizes the need to adapt shortest<br />

path algorithms runn<strong>in</strong>g on multi-core systems.<br />

5.2.2 Bi-Directional <strong>Search</strong> Experiment<br />

In Figure 5, the runtime of a bidirectional search implementations<br />

of one and two CPU cores are compared with the<br />

runtime of a standard Dijkstra run. From this figure, on<br />

average, the average runtime of the bidirectional search on<br />

two cores is about 22 seconds faster than the runtime of the<br />

standard Dijkstra run. If only one processor is used, the<br />

speed-up can be about 7 seconds below the standard Dijkstra<br />

runtime on average.<br />

Although the bidirectional search algorithm performs<br />

better than the standard Dijkstra algorithm, <strong>in</strong> theory, it is<br />

not suitable for a massive parallel execution. The best environment<br />

is one with two core processors, as mentioned <strong>in</strong><br />

Section 2.1.<br />

5.2.3 Reachable Experiment<br />

To optimize the execution time of the reachable function,<br />

our implementation ignores collect<strong>in</strong>g results dur<strong>in</strong>g backtrack<strong>in</strong>g,<br />

as mentioned <strong>in</strong> Section 3.2. Figure 6 shows the<br />

advantage of this approach compared to the standard Dijkstra<br />

algorithm. From this figure, the runtime of the reachable<br />

function is approximately a straight l<strong>in</strong>e. The curve<br />

which represents the runtime of the standard Dijkstra algorithm<br />

varies <strong>in</strong> the runtime depend<strong>in</strong>g on the source and<br />

target. This could be expla<strong>in</strong>ed as follows. The standard<br />

Dijkstra algorithm collects the result dur<strong>in</strong>g backtrace. This<br />

is because the collection of the result can take a significant<br />

amount of time, ma<strong>in</strong>ly due to memory allocation.<br />

However, compar<strong>in</strong>g the runtime of the reachable function<br />

with the standard Dijkstra algorithm, the speed-up is<br />

not as high as expected. On average, about 20 seconds can<br />

Figure 5. Run time of the bidirectional search<br />

function compared to standard Dijkstra algorithm.<br />

be achieved. Thus, <strong>in</strong> order to decide whether dynamic partition<strong>in</strong>g<br />

and distribution on multiple nodes makes sense,<br />

the reachable function can not perform well <strong>in</strong> this test. In<br />

this test case, wait<strong>in</strong>g for 170 seconds only answers the<br />

reachable question. Wait<strong>in</strong>g for 20 more seconds and the<br />

result is already calculated. No doubt which one is better.<br />

Nevertheless, one positive feature of the reachable function<br />

is that the memory consumption is very low. In cases<br />

where the real path search cannot be performed due to an<br />

out of memory error, the reachable function is still helpful<br />

to answer the question whether there is a path to a specific<br />

target or not. Once the partition<strong>in</strong>g is implemented, then<br />

the reachable query can also be distributed and the runtime<br />

should be compare aga<strong>in</strong>st the standard Dijkstra algorithm.<br />

5.2.4 Arc Flag Approach Experiment<br />

The adapted Arc Flag approach uses given partitions of an<br />

electrical circuit described <strong>in</strong> the Verilog RTL code. It is<br />

common to divide the design <strong>in</strong>to modules. A module is a<br />

functional or logical unit that performs a specific task. In<br />

such a module, arithmetic or logical operations on the data<br />

are mapped to operators such as adder, multiplier,<br />

equal, greater than, less than, NAND and NOR.<br />

These operators operate on bus signals, e.g. a 32-bit data<br />

bus.<br />

The implementation of the operator is done by s<strong>in</strong>gle bit<br />

gates. However, search<strong>in</strong>g a path through an implementation<br />

of a 32-bit NAND or 32-bit full adder is time consum<strong>in</strong>g.<br />

Thus, the prototypical implementation of this method<br />

stores only Arc Flags for the operators. The runtime measurement<br />

and comparison to the standard Dijkstra algorithm


Figure 6. Runtime of the reachable function<br />

compared to standard Dijkstra algorithm.<br />

that is presented <strong>in</strong> Figure 7, is based on this implementation.<br />

In Figure 7, the ga<strong>in</strong>ed average speed-up <strong>in</strong> the runtime<br />

is 15 seconds on average. Due to the limitation that only operators<br />

are flagged with the arcs, this value oscillates with<br />

the way how the design under test is implemented. For example,<br />

if long cha<strong>in</strong>s of operators are used for the design<br />

than a higher speed-up can be expected.<br />

6. Related Work<br />

Some articles related to dynamic partition<strong>in</strong>g penned by<br />

Walshaw et al. [21], D<strong>in</strong>iz et al. [6] and Lohner et al. [14]<br />

discuss parallel algorithms that dynamically partition unstructured<br />

grids or mesh networks for load balanc<strong>in</strong>g which<br />

is somehow related to graph partition<strong>in</strong>g. All of them try to<br />

improve the performance on multi-core systems.<br />

To handle search <strong>in</strong> large graphs, the memory of the<br />

mach<strong>in</strong>e has to be taken <strong>in</strong>to account. A special graph<br />

partion<strong>in</strong>g algorithm us<strong>in</strong>g hMetis partition<strong>in</strong>g is proposed<br />

by [11]. An approach adapted and optimized for the Blue-<br />

Gene/L system is the scalable parallel breadth-first search<br />

algorithm [20]. However, this algorithm is limited to Poisson<br />

random graphs.<br />

7. Conclusion and Future Work<br />

Partition<strong>in</strong>g of large graph data is a compute-<strong>in</strong>tensive<br />

task. However, once the partition<strong>in</strong>g is done, succeed<strong>in</strong>g<br />

shortest path queries can be performed reasonably fast.<br />

Comb<strong>in</strong>ed with preprocess<strong>in</strong>g and the Arc-Flag approach,<br />

the response time can be further reduced.<br />

Figure 7. Runtime of the Arc-Flag approach<br />

compared to standard Dijkstra algorithm.<br />

For achiev<strong>in</strong>g a significant speed-up, the comb<strong>in</strong>ation of<br />

various methods is useful. First, the graph need to be partitioned<br />

us<strong>in</strong>g one of the static partition<strong>in</strong>g algorithm <strong>in</strong>troduced.<br />

Once the partition<strong>in</strong>g is done, each partition needs<br />

to be preprocessed, and at all entry po<strong>in</strong>ts the “arcs” to all<br />

exit po<strong>in</strong>ts need to be annotated. To calculate all the arcs,<br />

it is required to perform an all-pair shortest path search.<br />

This path search can be comb<strong>in</strong>ed with the bidirectional<br />

approach to achieve better run time performance. This is<br />

because preprocess<strong>in</strong>g the partitioned graph can be parallelized<br />

and scales almost l<strong>in</strong>ear. F<strong>in</strong>ally, to be able to run<br />

arbitrary parallel algorithms on a graph and ga<strong>in</strong> a speedup,<br />

partition<strong>in</strong>g is the most promis<strong>in</strong>g way.<br />

As for future work, the suggested and implemented<br />

reachable function needs further analysis because the experimental<br />

result is much slower than expected. There is<br />

a good chance that the reachable function can be further<br />

improved for the use <strong>in</strong> an application. In addition, more<br />

experimental results collected by more prototypical implementations<br />

are needed to rate the performance and usability<br />

of the various presented speed-up techniques.<br />

Acknowledgment<br />

This paper is funded by the Federal M<strong>in</strong>istry of Education<br />

and Research (BMBF) project, “Hardware Design<br />

Techniques for Zero Defect <strong>Designs</strong>” (HERKULES), grant<br />

number 01M3082.


References<br />

[1] D. Bader, G. Cong, and J. Feo. On the Architectural Requirements<br />

for Efficient Execution of Graph Algorithms. In<br />

Proceed<strong>in</strong>gs of the 33rd International Conference on Parallel<br />

Process<strong>in</strong>g (ICPP), Oslo, Norway, June 14–17 2005.<br />

[2] J. L. Bentley. Multidimensional B<strong>in</strong>ary <strong>Search</strong> Trees used<br />

for Associative <strong>Search</strong><strong>in</strong>g. Communications of the ACM,<br />

18(9):509–517, 1975.<br />

[3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Ste<strong>in</strong>.<br />

Introduction to Algorithms. MIT Press and McGraw-Hill,<br />

Cambridge, USA, 2001.<br />

[4] R. Dechter and J. Pearl. Generalized Best-first <strong>Search</strong> Strategies<br />

and the Optimality of A*. Journals of the ACM,<br />

32(3):505–536, 1985.<br />

[5] E. W. Dijkstra. A Note on Two Problems <strong>in</strong> Connexion with<br />

Graphs. Numerische Mathematik, 1:269–271, 1959.<br />

[6] P. D<strong>in</strong>iz, S. Plimpton, B. Hendrickson, and R. Leland. Parallel<br />

Algorithms for Dynamically Partition<strong>in</strong>g Unstructured<br />

Grids. In Proceed<strong>in</strong>gs of the 7th SIAM Conference on Parallel<br />

Process<strong>in</strong>g for Scientific Comput<strong>in</strong>g, San Francisco,<br />

USA, Feb 15–17 1995.<br />

[7] C. M. Fiduccia and R. M. Mattheyses. A L<strong>in</strong>ear-time<br />

Heuristic for Improv<strong>in</strong>g Network Partitions. In Proceed<strong>in</strong>gs<br />

of 19th Conference on Design Automation (DAC), Las Vegas,<br />

USA, June 14–16 1982.<br />

[8] R. F<strong>in</strong>kel and J. L. Bentley. Quad Trees: A Data Structure<br />

for Retrieval on Composite Keys. Acta Informatica, 4:1–9,<br />

1974.<br />

[9] M.-Y. Fu, J. Li, and P.-D. Zhou. Design and Implementation<br />

of Bidirectional Dijkstra Algorithm. Journal of Beij<strong>in</strong>g<br />

Institute of Technology, 12(4):366–370, 2003.<br />

[10] Intel Compiler Professional Editions.<br />

http://software.<strong>in</strong>tel.com/en-us/<strong>in</strong>tel-compilers/, March<br />

2009.<br />

[11] S. Idwan and W. Etaiwi. Comput<strong>in</strong>g breadth first search <strong>in</strong><br />

large graph us<strong>in</strong>g hmetis partition<strong>in</strong>g. European Journal of<br />

Scientific Research, 29(2):215–221, 2009.<br />

[12] B. W. Kernighan and S. L<strong>in</strong>. An Efficient Heuristic Procedure<br />

for Partition<strong>in</strong>g Graphs. Bell Systems Technical Journal,<br />

49(2):291–307, 1970.<br />

[13] R. E. Korf, W. Zhang, I. Thayer, and H. Hohwald. Frontier<br />

<strong>Search</strong>. Journal of the ACM, 52(5):715–748, 2005.<br />

[14] R. Lohner, R. Ramamurti, and D. Mart<strong>in</strong>. A Parallelizable<br />

Load Balanc<strong>in</strong>g Algorithm. In Proceed<strong>in</strong>gs of the AIAA 31st<br />

Aerospace Sciences Meet<strong>in</strong>g and Exhibit, Reno, Nevada,<br />

USA, Jan. 11–14 1993.<br />

[15] M. Luby and P. Ragde. A Bidirectional Shortest-<strong>Path</strong> Algorithm<br />

with Good Average-Case Behavior. Algorithmica,<br />

4(4):551–567, 1989.<br />

[16] R. H. Möhr<strong>in</strong>g, H. Schill<strong>in</strong>g, B. Schütz, D. Wagner, and<br />

T. Willhalm. Partition<strong>in</strong>g Graphs to Speedup Dijkstra’s<br />

Algorithm. Journal of Experimental Algorithmics (JEA),<br />

11:2.8, 2006.<br />

[17] A. S. Nepomniaschaya and M. A. Dvosk<strong>in</strong>a. A Simple Implementation<br />

of Dijkstra’s Shortest <strong>Path</strong> Algorithm on Associative<br />

Parallel Processors. Fundamenta Informaticae,<br />

43(1–4):227–243, 2000.<br />

[18] OpenSPARC. http://www.opensparc.net/, March 2009.<br />

[19] Concept Eng<strong>in</strong>eer<strong>in</strong>g’s: RTLvision PRO.<br />

http://www.concept.de/rtl <strong>in</strong>dex.html, March 2009.<br />

[20] D. Scarpazza, O. Villa, and F. Petr<strong>in</strong>i. Efficient breadthfirst<br />

search on the cell/be processor. IEEE Transactions on<br />

Parallel and Distributed Systems, 10:1381 – 1395, October<br />

2008.<br />

[21] C. Walshaw, M. Cross, and M. G. Everett. Dynamic Mesh<br />

Partition<strong>in</strong>g: A Unified Optimisation and Load-balanc<strong>in</strong>g<br />

Algorithm. Technical Report 95/IM/06, University of<br />

Greenwich, UK, 1995.<br />

[22] C. Walshaw, M. Cross, and M. G. Everett. Parallel Dynamic<br />

Partition<strong>in</strong>g for Adaptive Unstructured Meshes. Journal of<br />

Parallel and Distributed Comput<strong>in</strong>g, 47:102–108, 1997.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!