Parallelized Critical Path Search in Electrical Circuit Designs

More documents

Recommendations

Info

existing code uses a standard Dijkstra algorithm. We compile the code with the same optimization level. Figure 4 compares the runtime of the path search described in Listing 1. From this figure, the Intel compiler performs better in the first half of the experiment, by up to 28 seconds faster. However, as more samples are taken, the gain is greatly reduced within 5 seconds of each other. Overall, with 40 measurements, the Intel compiler offers a better performance of approximately 4.5 seconds on average compared to gcc. Due to the highly recursive implementation, we do not expect to much speed-up on the Intel compiler. The result, shown in Figure 4, confirms our assumption. To achieve a better performance, manual changes to the code using pragma directives can assist the compiler to parallelize the code. However, this would need an in deep investigation of the source code and is a very time consuming task. Therefore, this experiment emphasizes the need to adapt shortest path algorithms running on multi-core systems. 5.2.2 Bi-Directional Search Experiment In Figure 5, the runtime of a bidirectional search implementations of one and two CPU cores are compared with the runtime of a standard Dijkstra run. From this figure, on average, the average runtime of the bidirectional search on two cores is about 22 seconds faster than the runtime of the standard Dijkstra run. If only one processor is used, the speed-up can be about 7 seconds below the standard Dijkstra runtime on average. Although the bidirectional search algorithm performs better than the standard Dijkstra algorithm, in theory, it is not suitable for a massive parallel execution. The best environment is one with two core processors, as mentioned in Section 2.1. 5.2.3 Reachable Experiment To optimize the execution time of the reachable function, our implementation ignores collecting results during backtracking, as mentioned in Section 3.2. Figure 6 shows the advantage of this approach compared to the standard Dijkstra algorithm. From this figure, the runtime of the reachable function is approximately a straight line. The curve which represents the runtime of the standard Dijkstra algorithm varies in the runtime depending on the source and target. This could be explained as follows. The standard Dijkstra algorithm collects the result during backtrace. This is because the collection of the result can take a significant amount of time, mainly due to memory allocation. However, comparing the runtime of the reachable function with the standard Dijkstra algorithm, the speed-up is not as high as expected. On average, about 20 seconds can Figure 5. Run time of the bidirectional search function compared to standard Dijkstra algorithm. be achieved. Thus, in order to decide whether dynamic partitioning and distribution on multiple nodes makes sense, the reachable function can not perform well in this test. In this test case, waiting for 170 seconds only answers the reachable question. Waiting for 20 more seconds and the result is already calculated. No doubt which one is better. Nevertheless, one positive feature of the reachable function is that the memory consumption is very low. In cases where the real path search cannot be performed due to an out of memory error, the reachable function is still helpful to answer the question whether there is a path to a specific target or not. Once the partitioning is implemented, then the reachable query can also be distributed and the runtime should be compare against the standard Dijkstra algorithm. 5.2.4 Arc Flag Approach Experiment The adapted Arc Flag approach uses given partitions of an electrical circuit described in the Verilog RTL code. It is common to divide the design into modules. A module is a functional or logical unit that performs a specific task. In such a module, arithmetic or logical operations on the data are mapped to operators such as adder, multiplier, equal, greater than, less than, NAND and NOR. These operators operate on bus signals, e.g. a 32-bit data bus. The implementation of the operator is done by single bit gates. However, searching a path through an implementation of a 32-bit NAND or 32-bit full adder is time consuming. Thus, the prototypical implementation of this method stores only Arc Flags for the operators. The runtime measurement and comparison to the standard Dijkstra algorithm
Figure 6. Runtime of the reachable function compared to standard Dijkstra algorithm. that is presented in Figure 7, is based on this implementation. In Figure 7, the gained average speed-up in the runtime is 15 seconds on average. Due to the limitation that only operators are flagged with the arcs, this value oscillates with the way how the design under test is implemented. For example, if long chains of operators are used for the design than a higher speed-up can be expected. 6. Related Work Some articles related to dynamic partitioning penned by Walshaw et al. [21], Diniz et al. [6] and Lohner et al. [14] discuss parallel algorithms that dynamically partition unstructured grids or mesh networks for load balancing which is somehow related to graph partitioning. All of them try to improve the performance on multi-core systems. To handle search in large graphs, the memory of the machine has to be taken into account. A special graph partioning algorithm using hMetis partitioning is proposed by [11]. An approach adapted and optimized for the Blue- Gene/L system is the scalable parallel breadth-first search algorithm [20]. However, this algorithm is limited to Poisson random graphs. 7. Conclusion and Future Work Partitioning of large graph data is a compute-intensive task. However, once the partitioning is done, succeeding shortest path queries can be performed reasonably fast. Combined with preprocessing and the Arc-Flag approach, the response time can be further reduced. Figure 7. Runtime of the Arc-Flag approach compared to standard Dijkstra algorithm. For achieving a significant speed-up, the combination of various methods is useful. First, the graph need to be partitioned using one of the static partitioning algorithm introduced. Once the partitioning is done, each partition needs to be preprocessed, and at all entry points the “arcs” to all exit points need to be annotated. To calculate all the arcs, it is required to perform an all-pair shortest path search. This path search can be combined with the bidirectional approach to achieve better run time performance. This is because preprocessing the partitioned graph can be parallelized and scales almost linear. Finally, to be able to run arbitrary parallel algorithms on a graph and gain a speedup, partitioning is the most promising way. As for future work, the suggested and implemented reachable function needs further analysis because the experimental result is much slower than expected. There is a good chance that the reachable function can be further improved for the use in an application. In addition, more experimental results collected by more prototypical implementations are needed to rate the performance and usability of the various presented speed-up techniques. Acknowledgment This paper is funded by the Federal Ministry of Education and Research (BMBF) project, “Hardware Design Techniques for Zero Defect Designs” (HERKULES), grant number 01M3082.
Page 1 and 2: Parallelized Critical Path Search i
Page 3 and 4: 3.1. Static Partitioning of Graphs
Page 5: of the Sun’s OpenSPARC T1 (Niagar

Parallelized Critical Path Search in Electrical Circuit Designs

Create successful ePaper yourself

Delete template?

Save as template?