Parallelized Critical Path Search in Electrical Circuit Designs

Parallelized Critical Path Search in Electrical Circuit Designs 

Pascal Bolzhauser1 , Anthony Sulistio2 , Gerhard Angst1 and Christoph Reich2 1Concept Engineering GmbH 

2Department of Computer Science 

Boetzinger Str. 29, Freiburg, Germany Hochschule Furtwangen University, Germany 

{pascal, gerhard}@concept.de {anthony.sulistio, christoph.reich}@hs-furtwangen.de 

Abstract 

For finding the critical path in electrical circuit designs, 

a shortest-path search must be carried out. This paper 

introduces a new two-level shortest-path search algorithm 

specially adapted for parallelization. The proposed algorithm 

is based on a module-based partitioning algorithm 

and a shortest-path search parallelized for the usage on 

multi-core systems. Experimental results show the impact 

of this approach. 

1. Introduction 

One of the most fundamental problems in numerous applications 

of various IT fields is the shortest path search 

(SPS) problem, i.e. finding a path between two nodes in 

a weighted directed graph, such that the sum of the weights 

of the edges is minimized. SPS in an application used for 

a chip design is to find the logic on the shortest path from 

one clocked element to another in the electrical circuit. The 

vertices represent logic gates of the circuit and the edges 

represent the electrical connection by nets or net buses. The 

cost can be calculated by counting the logic levels or the 

nets are weighted by specific timing values to reach the next 

gate. Such timing values can come, e.g. from static timing 

analysis tools. 

Edsger Dijkstra presented the well known Dijkstra algorithm 

[5] to solve this problem. Dijkstra’s graph search algorithm 

solves the single-source SPS problem for a graph 

with non negative weights for the edges. The algorithm 

works by visiting vertices in the graph starting with the 

starting point. It then repeatedly examines the closest not 

yet examined vertex. It expands from the starting point towards 

the target until it reaches the goal. Since then, many 

other algorithms have been developed to improve the original 

algorithm, e.g. the A ∗ algorithm [4]. The Dijkstra algorithm 

as well as most of the variations of the original algorithm 

also exist in a parallelized version [17]. The parallel 

algorithm can be used to perform the shortest path calcula- 

tion in a distributed environment, e.g. a multi-core system 

or cluster. 

Unfortunately, a parallel algorithm often does not scale 

well with the increasing number of processors or available 

calculation nodes, due to long memory latencies and high 

synchronization costs [1]. To address this problem, this paper 

introduces a combined technique of partitioning an original 

graph into smaller ones, and uses an adapted Arc-Flag 

approach [16] for speeding-up the SPS problem on multicore 

systems. 

The rest of this paper is organized as follows. Section 2 

describes a brief overview to the shortest path problem and 

bidirectional Dijkstra search algorithm. Section 3 mentions 

several partitioning graph models, whereas Section 4 explains 

the Arc-Flag approach. Section 5 conducts an experiment, 

whereas Section 6 mentions some related works. Finally, 

Section 7 concludes the paper and gives future work. 

2. Shortest Path Problem 

In graph theory, the problem of finding a path between 

two vertices such that the sum of the weights is minimized 

is called the shortest path problem [3]. A weighted graph is 

formally given as a set V of vertices, a set E of edges and 

a weight function f : E → R. To find a shortest path, start 

from one element v of V and find a path P from v to each 

v ′ of V such that 

� 

p∈P f(p) 

is minimal among all paths connecting v to v ′ . This is 

also called the single-pair shortest path problem. 

Two generalizations are: 

• The single-source shortest path problem is more general 

than the single-pair one. The goal is to find shortest 

paths from a source vertex v to all other vertices in 

the graph. 

• The all-pairs shortest path problem is the most general 

problem of all. All shortest paths between every pair 

of vertices v, v ′ in the graph need to be found.

In practice, for both the aforementioned generalizations, 

there exist more algorithms than for running a single-pair 

shortest path algorithm on all relevant pairs of vertices. 

2.1. Bidirectional Dijkstra Search 

If a point-to-point shortest path search is performed, then 

the start point as well as the target point are known. The 

original unidirectional shortest path search algorithm introduced 

by Dijkstra [5], starts from the start point and performs 

a shortest path search in a depth first manner until 

the target point is reached. The runtime for this search is 

O(n log n). 

Luby and Ragde [15] presented a bidirectional version 

of the Dijkstra algorithm with an expected run time of 

O( √ n log n). Such a bidirectional search consists of two 

phases. In the first phase, two unidirectional Dijkstra search 

runs start from the start point as well as from the target. 

Both runs span a tree by alternating between start and target, 

and expanding the next level of reachable nodes. From 

these trees, the minimum distance to the start with respect 

to the target is known. As long as there are no visible edges 

in both runs, alternate nodes can be expanded and added to 

the tree. Thus, the shortest path between the start and target 

nodes is within the two expanded search trees. In the second 

phase, the shortest path out of the two trees needs to be 

collected. 

According to Fu et al. [9], the time complexity of the 

bidirectional Dijkstra algorithm is 1 

8 O(n2 ) on a single-core 

or uniprocessor systems. It is obvious that this approach 

can optimally be distributed over two processes running on 

a dual-core or multi-processor machine. 

3. General Approach to the Shortest Path 

Problem 

A more general approach to speed-up the point-to-point 

shortest path problem is to divide the graph into separate 

sub-graphs. Each of this sub-graphs can be processed on its 

own in a separate process. 

The main challenge is to divide the circuit represented 

by the graph into meaningful units. Figure 1(a) shows a 

graph with 16 nodes. The only meaningful criteria by which 

the graph could be divided are partition size and number 

of interlinking nodes. Usually, the created sub-graphs are 

balanced, i.e. each partition should have the same amount 

of nodes, as shown in Figure 1(b). 

The graph partitioning of our approach uses arithmetic 

and logic modules information, as depicted in Figure 2. As 

a result, this heuristic can be used to speed-up the partitioning 

process. In general, there are two methods of graph partitions: 

static and dynamic. These methods are explained 

next. 

(a) A graph with 16 nodes. 

(b) A graph with three balanced partitions. 

Figure 1. Graph divided by minimizing the 

number of interlink nodes. 

Figure 2. Two different ways to partition a circuit 

depending on the electrical modules.

3.1. Static Partitioning of Graphs 

The partitioning is only done once, before the actual 

search. Thus, a static partitioning cannot reflect the current 

situation in a multi-core system or cluster. The following 

are various static partitioning methods. Detailed explanations 

of these static methods can be found in [16]. 

Rectangular Partitioning The easiest way to partition a 

graph with a 2D representation is to divide the graph 

into rectangular regions. This is done by using n × m 

grid of rectangles. A rectangular region is defined by 

its bounding box. This method only respects the geography 

of the graph, but does not respect the structure 

(geometry), node density or any other attributes of the 

underlying graph. 

Quad Trees [8] represent a two dimensional space which 

is divided into four quadrants or regions, until the desired 

resolution is achieved and the recursion ends in a 

leaf of the tree. Quad Trees are not only a graph partitioning 

method, but they are also an effective data 

structure to store points, lines or curves in a plain. 

Quad Trees are typically used for geometric algorithms 

and image processing like spatial indexing, image representation 

or efficient collision detection in two dimensions. 

k-Dimensional (kd) Trees generalize the Quad Tree partitioning 

in a so called kd-tree [2]. A kd-tree is a partitioning 

data structure which also recursively divides a 

plain into rectangles. Thus, this data structure can deal 

with a k-dimensional Euclidean space with exactly k 

orthogonal axes. 

Kernighan-Lin (KL) Heuristics [12] is a 2-way local refinement 

algorithm, and used for bisecting graphs. It 

is also known as a min-cut or group migration procedure. 

The objective of the KL heuristics is to partition 

a graph or a circuit in such a way that the number of 

connections between the subgraphs is minimized. In 

addition, it is able to reduce the edge-cut of an existing 

bisection. However, the disadvantage of this heuristic 

is that it can only be used on graphs with an even 

number of nodes. As a consequence, each of the bipartitions 

are equally sized, and the complexity of the 

Kernighan-Lin heuristics is O(n 3 ) which makes it unusable 

for large graphs. Furthermore, it cannot handle 

multi terminal nets which are common in the field of 

electrical circuits. 

Fiduccia-Mattheyses (FM) Heuristics [7] for partitioning 

hypergraphs is an iterative algorithm which improves 

the result with every iteration and promises to 

solve all these problems. It is an improvement of the 

KL heuristics and can operate on even and odd number 

of nodes. The bi-partitions can be unequally sized. 

The complexity is O(n). 

3.2. Dynamic Partitioning of Graphs 

For a dynamic partitioning, one very important aspect is 

that the workload is balanced and that the interprocess communication 

overhead is minimized. This is a NP-complete 

problem [22], therefore heuristics have been developed to 

solve this problem [21]. 

A dynamic partitioning algorithm needs to respect the 

cost of the re-balancing. If frequent load balancing is required, 

the re-balancing costs need to be low proportionally 

to the solution algorithm. If a node is migrated to a new 

processor for better load balancing, then this could also include 

heavy data migration. Reusing already migrated data 

should be considered while calculating the new balance. 

Looking at the complete graph as a whole, a graph reduction 

can help to reduce the aforementioned problems. The 

idea of the graph reduction or coarsening is to form clusters 

by grouping vertices together. These clusters are used 

to form a new graph. Then, this procedure is recursively 

repeated until the desired coarsening is reached. 

Reachable Function In order to decide whether dynamic 

partitioning and distribution on multiple nodes makes sense, 

a very fast reachable function could be used. This function 

should be able to answer the question if a target is reachable 

in a fraction of the time the real query would need. Based 

on the result of the reachable function, dynamic partitioning 

and distribution can be started. Also, this function can 

return with a negative result, which means that the desired 

target is not reachable. In this case, the shortest path search 

is finished and no further effort need to put on partitioning 

and distribution. 

Ideally, the runtime of the reachable function should be 

minuted. Depending on the run time of the reachable function, 

the real search time should be estimated by interpolation 

of the measured time. Moreover, considerations about 

the run time of dynamic partitioning and the overhead for 

distribution should be combined with the estimated search 

time to decide how much effort should be put on partitioning 

and distribution. 

Our implementation of the reachable function also performs 

the same operations of a standard Dijkstra algorithm, 

and is based on [13]. Once a target is found, then the recursion 

terminates and a backtrace is started. During the backtrace 

process, no result list is created and allocated. Thus, 

no nodes to remember the way from the start to the target 

are stored. The waiver of creating the result list promises a 

faster execution time, because no memory allocation need

to be done. In addition, the insertion of new elements to the 

result list is not applicable. 

4. The Arc-Flag Approach 

The Arc-Flag approach [16] presumes that a graph has 

already been divided into partitions. It is irrelevant which 

of the partitioning methods has been used to perform the 

partitioning. The Arc Flag approach calculates all shortest 

paths for each possible entry point into a region to all possible 

exit points. The path to the exit and the name of the exit 

point is stored at each entry point. This annotation is called 

the arc to the exit. 

The arcs can be created dynamically. Each time a shortest 

path calculation is performed and a path through a region 

from a new entry point is requested, this shortest path 

is calculated and stored at the entry point. If a shortest path 

calculation enters a region again at a point where the shortest 

path through the region has been calculated before, then 

the stored path is used. This saves a lot of calculation time, 

because the shortest path search through this partition will 

not perform again. From time to time, more and more paths 

will be cached. The more shortest path calculations are performed, 

the more speed-up can be achieved. 

The calculation of the arcs can also be performed as a 

preprocess to the actual shortest path calculation. As a result, 

all shortest path pairs through a partition are already 

known when a shortest path search on the graph is performed. 

4.1. Preprocessing the Graph 

Preprocessing a graph is needed to calculate and store 

the Arc-Flag entries for each region of a static partitioned 

graph. Therefore, a one-to-all shortest path computation using 

a standard Dijkstra algorithm (all-pairs shortest-path) is 

performed. This Dijkstra run can be interrupted if all nodes 

in a region are marked as visited. 

In the worst case scenario, (n nodes and m pairs) the 

complexity is: O(m(m + n + n log n)) with m = O(n) 

this will result in: O(n 2 log n). For large n, it is obvious 

that this preprocessing takes far too long. 

There are two possible solutions suggested by Moehring 

et al. [16]. First, they showed that it is possible to preprocess 

the graph without calculating all-pairs of shortest paths. Finally, 

the storage of pruned shortest path trees can help to 

avoid this complexity problem. 

4.2. Two-Level Partitioning Arc Flag Approach 

Using one of the partitioning methods mentioned earlier, 

in combination with the Arc Flag approach and preprocess- 

Figure 3. Coarsing 

ing the graph, the region containing the target node can be 

reached very fast. However, inside the target region, the 

path from the entrance point to the target node needs to be 

found by a separate search. Depending on the granularity 

of the partition and the size of each region, such a search 

may need to visit many nodes in the region where the target 

belongs to. To avoid this bad behavior, two partition levels 

can be used. The first partition level is a coarse level while 

the second one is a detailed level. 

As an optimization, the detailed level could be stored 

only for heavy loaded regions. In Figure 3, a 5 × 5 coarse 

partition and a 3 × 3 fine partition for each coarse partition 

is used. Coarsening and storing detailed levels is more 

memory efficient than using a 15 × 15 grid in this case. 

5. Experiment 

In this section, we evaluate three methods: bidirectional 

search, reachable function and our adapted Arc-Flag approach. 

We compare them against a base algorithm, i.e. 

the standard Dijkstra algorithm. The bidirectional search 

is chosen because it is a very simple method, whereas 

the reachable function is compared due to its fast runtime. 

In addition, we compare Intel C++ compiler [10] (Intel- 

10.0.023) that supports auto-parallelization with gcc version 

3.4.6. 

For the test environment, we use an Opteron server with 

two Dual Core AMD Opteron processors (275 HE) with 2.2 

GHz. Thus, four CPU cores available for calculations. This 

machine is equipped with 16 GB of physical main memory. 

From each core, all the available memory can be accessed. 

The available hard disk space is adding up to 1 TB mirrored 

using a RAID level 1. As for the operating system, a 64-bit 

Red Hat Enterprise Linux Workstation version 4 (update 4) 

is installed. 

For the test data, we use the freely available chip design, 

i.e. the Verilog’s register transfer level (RTL) description

of the Sun’s OpenSPARC T1 (Niagara) processor [18]. The 

data represent a large graph of an electrical circuit. 

5.1 Test Method 

A RTLvision Pro program [19] is used to implement the 

algorithms and the shortest path extraction. Listing 1 shows 

the script used to start and measure the shortest path extraction. 

It is written in a Tcl/Tk like language which is used to 

customize and program RTLvision PRO with user defined 

functions called Userware. 

After reading the OpenSPARC RTL test data with the 

RTLvision PRO software, a synthesized netlist with a gate 

equivalent of over 30 million gates is produced. Particularly 

this results in a graph with 31,593,068 vertices and 

160,617,454 edges. 

s e t db [ zdb open −readonly / tmp / T1.zdb ] 

s e t i n P o r t {} 

s e t o u t P o r t {} 

s e t t o p {} 

$db foreach t o p t o p break 

$db foreach p o r t $top p o r t { 

s w i t c h [ $db d i r e c t i o n O f $ p o r t ] { 

‘ ‘ i n p u t ’ ’ { lappend i n P o r t $ p o r t } 

‘ ‘ i n o u t ’ ’ − 

‘ ‘ o u t p u t ’ ’ { lappend o u t P o r t $ p o r t } 

} 

} 

foreach s t a r t P o r t $ i n P o r t { 

foreach e n d P o r t $ o u t P o r t { 

s e t t [ time { 

s e t r e s [ $db cone −out − t a r g e t O b j 

$ e n d P o r t $ s t a r t P o r t ] 

}] 

s e t t [ l i n d e x $ t 0] 

s e t t [ expr {round ( $ t / 1000 . 0 ) } ] ; # ms 

s e t t [ expr { $ t / 1000 . 0 }] ; # s e c o n d s 

puts ‘ ‘ $ s t a r t P o r t t o I /O\ t $ t \ t 

[ l l e n g t h $ r e s ] ’ ’ 

} 

} 

$db c l o s e 

Listing 1. A script running the all-pairs path 

extraction. 

At the start point, each top level input port is selected. 

The end point of the path extraction on the netlist database 

is one of the output ports. Thus, from each top level input 

port a shortest path extraction to all output ports is performed. 

The time needed to extract all paths between all 

Figure 4. Runtime of the Intel icc compiler 

compared to the gcc compiler. 

input and output port pairs is measured using the Tcl time 

command which returns the CPU time for a command in 

microseconds. 

This kind of test is - compared to the real usage - somehow 

artificial but it is the worst case scenario. No longer 

running path extraction on the circuit can be performed, because 

the extraction starts at a top level input port and ends 

at a top level output port. In addition, the implementation 

of the Dijkstra algorithm is highly recursive. Due to this 

fact, the program exhaustively uses stack memory during 

the runtime. In order to run this test script without an out of 

stack memory error, the operating system limits need to be 

adjusted with the 

u l i m i t −s u n l i m i t e d 

command. During the tests a stack memory consumption of 

about 2 GB was measured. 

5.2. Experiment Results 

5.2.1 Auto-Parallelizing 

Intel offers a commercial compiler with an auto parallelization 

option [10]. Based on a static source code analysis, the 

Intel compiler promises to create code that run as parallel 

threads and takes the advantage of multi-core CPUs. Depending 

on the complexity of the source code and on the 

number of independent loops, a significant speed-up can be 

expected. Therefore, we first look into how the Intel icc 

compiler can offer such benefit with zero effort. 

For the comparison of the Intel’s icc compiler against 

the GNU’s gcc compiler, the existing path extraction code 

without any optimization for parallel execution is used. The

existing code uses a standard Dijkstra algorithm. We compile 

the code with the same optimization level. 

Figure 4 compares the runtime of the path search described 

in Listing 1. From this figure, the Intel compiler 

performs better in the first half of the experiment, by up 

to 28 seconds faster. However, as more samples are taken, 

the gain is greatly reduced within 5 seconds of each other. 

Overall, with 40 measurements, the Intel compiler offers a 

better performance of approximately 4.5 seconds on average 

compared to gcc. 

Due to the highly recursive implementation, we do not 

expect to much speed-up on the Intel compiler. The result, 

shown in Figure 4, confirms our assumption. To achieve 

a better performance, manual changes to the code using 

pragma directives can assist the compiler to parallelize the 

code. However, this would need an in deep investigation of 

the source code and is a very time consuming task. Therefore, 

this experiment emphasizes the need to adapt shortest 

path algorithms running on multi-core systems. 

5.2.2 Bi-Directional Search Experiment 

In Figure 5, the runtime of a bidirectional search implementations 

of one and two CPU cores are compared with the 

runtime of a standard Dijkstra run. From this figure, on 

average, the average runtime of the bidirectional search on 

two cores is about 22 seconds faster than the runtime of the 

standard Dijkstra run. If only one processor is used, the 

speed-up can be about 7 seconds below the standard Dijkstra 

runtime on average. 

Although the bidirectional search algorithm performs 

better than the standard Dijkstra algorithm, in theory, it is 

not suitable for a massive parallel execution. The best environment 

is one with two core processors, as mentioned in 

Section 2.1. 

5.2.3 Reachable Experiment 

To optimize the execution time of the reachable function, 

our implementation ignores collecting results during backtracking, 

as mentioned in Section 3.2. Figure 6 shows the 

advantage of this approach compared to the standard Dijkstra 

algorithm. From this figure, the runtime of the reachable 

function is approximately a straight line. The curve 

which represents the runtime of the standard Dijkstra algorithm 

varies in the runtime depending on the source and 

target. This could be explained as follows. The standard 

Dijkstra algorithm collects the result during backtrace. This 

is because the collection of the result can take a significant 

amount of time, mainly due to memory allocation. 

However, comparing the runtime of the reachable function 

with the standard Dijkstra algorithm, the speed-up is 

not as high as expected. On average, about 20 seconds can 

Figure 5. Run time of the bidirectional search 

function compared to standard Dijkstra algorithm. 

be achieved. Thus, in order to decide whether dynamic partitioning 

and distribution on multiple nodes makes sense, 

the reachable function can not perform well in this test. In 

this test case, waiting for 170 seconds only answers the 

reachable question. Waiting for 20 more seconds and the 

result is already calculated. No doubt which one is better. 

Nevertheless, one positive feature of the reachable function 

is that the memory consumption is very low. In cases 

where the real path search cannot be performed due to an 

out of memory error, the reachable function is still helpful 

to answer the question whether there is a path to a specific 

target or not. Once the partitioning is implemented, then 

the reachable query can also be distributed and the runtime 

should be compare against the standard Dijkstra algorithm. 

5.2.4 Arc Flag Approach Experiment 

The adapted Arc Flag approach uses given partitions of an 

electrical circuit described in the Verilog RTL code. It is 

common to divide the design into modules. A module is a 

functional or logical unit that performs a specific task. In 

such a module, arithmetic or logical operations on the data 

are mapped to operators such as adder, multiplier, 

equal, greater than, less than, NAND and NOR. 

These operators operate on bus signals, e.g. a 32-bit data 

bus. 

The implementation of the operator is done by single bit 

gates. However, searching a path through an implementation 

of a 32-bit NAND or 32-bit full adder is time consuming. 

Thus, the prototypical implementation of this method 

stores only Arc Flags for the operators. The runtime measurement 

and comparison to the standard Dijkstra algorithm

Figure 6. Runtime of the reachable function 

compared to standard Dijkstra algorithm. 

that is presented in Figure 7, is based on this implementation. 

In Figure 7, the gained average speed-up in the runtime 

is 15 seconds on average. Due to the limitation that only operators 

are flagged with the arcs, this value oscillates with 

the way how the design under test is implemented. For example, 

if long chains of operators are used for the design 

than a higher speed-up can be expected. 

6. Related Work 

Some articles related to dynamic partitioning penned by 

Walshaw et al. [21], Diniz et al. [6] and Lohner et al. [14] 

discuss parallel algorithms that dynamically partition unstructured 

grids or mesh networks for load balancing which 

is somehow related to graph partitioning. All of them try to 

improve the performance on multi-core systems. 

To handle search in large graphs, the memory of the 

machine has to be taken into account. A special graph 

partioning algorithm using hMetis partitioning is proposed 

by [11]. An approach adapted and optimized for the Blue- 

Gene/L system is the scalable parallel breadth-first search 

algorithm [20]. However, this algorithm is limited to Poisson 

random graphs. 

7. Conclusion and Future Work 

Partitioning of large graph data is a compute-intensive 

task. However, once the partitioning is done, succeeding 

shortest path queries can be performed reasonably fast. 

Combined with preprocessing and the Arc-Flag approach, 

the response time can be further reduced. 

Figure 7. Runtime of the Arc-Flag approach 

compared to standard Dijkstra algorithm. 

For achieving a significant speed-up, the combination of 

various methods is useful. First, the graph need to be partitioned 

using one of the static partitioning algorithm introduced. 

Once the partitioning is done, each partition needs 

to be preprocessed, and at all entry points the “arcs” to all 

exit points need to be annotated. To calculate all the arcs, 

it is required to perform an all-pair shortest path search. 

This path search can be combined with the bidirectional 

approach to achieve better run time performance. This is 

because preprocessing the partitioned graph can be parallelized 

and scales almost linear. Finally, to be able to run 

arbitrary parallel algorithms on a graph and gain a speedup, 

partitioning is the most promising way. 

As for future work, the suggested and implemented 

reachable function needs further analysis because the experimental 

result is much slower than expected. There is 

a good chance that the reachable function can be further 

improved for the use in an application. In addition, more 

experimental results collected by more prototypical implementations 

are needed to rate the performance and usability 

of the various presented speed-up techniques. 

Acknowledgment 

This paper is funded by the Federal Ministry of Education 

and Research (BMBF) project, “Hardware Design 

Techniques for Zero Defect Designs” (HERKULES), grant 

number 01M3082.

References 

[1] D. Bader, G. Cong, and J. Feo. On the Architectural Requirements 

for Efficient Execution of Graph Algorithms. In 

Proceedings of the 33rd International Conference on Parallel 

Processing (ICPP), Oslo, Norway, June 14–17 2005. 

[2] J. L. Bentley. Multidimensional Binary Search Trees used 

for Associative Searching. Communications of the ACM, 

18(9):509–517, 1975. 

[3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 

Introduction to Algorithms. MIT Press and McGraw-Hill, 

Cambridge, USA, 2001. 

[4] R. Dechter and J. Pearl. Generalized Best-first Search Strategies 

and the Optimality of A*. Journals of the ACM, 

32(3):505–536, 1985. 

[5] E. W. Dijkstra. A Note on Two Problems in Connexion with 

Graphs. Numerische Mathematik, 1:269–271, 1959. 

[6] P. Diniz, S. Plimpton, B. Hendrickson, and R. Leland. Parallel 

Algorithms for Dynamically Partitioning Unstructured 

Grids. In Proceedings of the 7th SIAM Conference on Parallel 

Processing for Scientific Computing, San Francisco, 

USA, Feb 15–17 1995. 

[7] C. M. Fiduccia and R. M. Mattheyses. A Linear-time 

Heuristic for Improving Network Partitions. In Proceedings 

of 19th Conference on Design Automation (DAC), Las Vegas, 

USA, June 14–16 1982. 

[8] R. Finkel and J. L. Bentley. Quad Trees: A Data Structure 

for Retrieval on Composite Keys. Acta Informatica, 4:1–9, 

1974. 

[9] M.-Y. Fu, J. Li, and P.-D. Zhou. Design and Implementation 

of Bidirectional Dijkstra Algorithm. Journal of Beijing 

Institute of Technology, 12(4):366–370, 2003. 

[10] Intel Compiler Professional Editions. 

http://software.intel.com/en-us/intel-compilers/, March 

2009. 

[11] S. Idwan and W. Etaiwi. Computing breadth first search in 

large graph using hmetis partitioning. European Journal of 

Scientific Research, 29(2):215–221, 2009. 

[12] B. W. Kernighan and S. Lin. An Efficient Heuristic Procedure 

for Partitioning Graphs. Bell Systems Technical Journal, 

49(2):291–307, 1970. 

[13] R. E. Korf, W. Zhang, I. Thayer, and H. Hohwald. Frontier 

Search. Journal of the ACM, 52(5):715–748, 2005. 

[14] R. Lohner, R. Ramamurti, and D. Martin. A Parallelizable 

Load Balancing Algorithm. In Proceedings of the AIAA 31st 

Aerospace Sciences Meeting and Exhibit, Reno, Nevada, 

USA, Jan. 11–14 1993. 

[15] M. Luby and P. Ragde. A Bidirectional Shortest-Path Algorithm 

with Good Average-Case Behavior. Algorithmica, 

4(4):551–567, 1989. 

[16] R. H. Möhring, H. Schilling, B. Schütz, D. Wagner, and 

T. Willhalm. Partitioning Graphs to Speedup Dijkstra’s 

Algorithm. Journal of Experimental Algorithmics (JEA), 

11:2.8, 2006. 

[17] A. S. Nepomniaschaya and M. A. Dvoskina. A Simple Implementation 

of Dijkstra’s Shortest Path Algorithm on Associative 

Parallel Processors. Fundamenta Informaticae, 

43(1–4):227–243, 2000. 

[18] OpenSPARC. http://www.opensparc.net/, March 2009. 

[19] Concept Engineering’s: RTLvision PRO. 

http://www.concept.de/rtl index.html, March 2009. 

[20] D. Scarpazza, O. Villa, and F. Petrini. Efficient breadthfirst 

search on the cell/be processor. IEEE Transactions on 

Parallel and Distributed Systems, 10:1381 – 1395, October 

2008. 

[21] C. Walshaw, M. Cross, and M. G. Everett. Dynamic Mesh 

Partitioning: A Unified Optimisation and Load-balancing 

Algorithm. Technical Report 95/IM/06, University of 

Greenwich, UK, 1995. 

[22] C. Walshaw, M. Cross, and M. G. Everett. Parallel Dynamic 

Partitioning for Adaptive Unstructured Meshes. Journal of 

Parallel and Distributed Computing, 47:102–108, 1997.

Parallelized Critical Path Search in Electrical Circuit Designs

Create successful ePaper yourself

Delete template?

Save as template?