Parallelizing the Construction of Static Single Assignment Form

More documents

Recommendations

Info

Thus we adopt a different parallelization strategy for the variable renaming phase. Note how the renaming algorithm makes use of the recursive subroutine Search, which is defined in Algorithm 3 from lines 7–37, and called recursively at line 29. Successive recursive calls of Search perform a pre-order traversal of the dominator tree of the control flow graph. Each individual invocation of Search operates on an individual region of the control flow graph. Therefore, it is straightforward to parallelize this algorithm based on the dominator tree. We begin by invoking Search on the entry node for the control flow graph, then we continue into all children blocks in the dominator tree in parallel. Note that different definitions of the same original variable V in sibling basic blocks M and N in the dominator tree cannot interfere with one another at all. A definition of V in M is not in scope in N, and is guaranteed not to come into scope in any nodes reachable from both M and N due to the previously inserted φ-functions. This parallelism requires a modification to the data structure that keeps track of the current new variable name for each original variable name. Whereas previously the stack S(V ) for original variable V was a single, global data structure, now each invocation of the Search routine must maintain its own private version of S(V ). If V is not defined in the current block, then S(V ) could be shared with the dominating parent block and other siblings, using appropriate shallow cloning techniques. However when V is defined in the current block, the new name pushed onto S(V ) must only be visible in this block, so S(V ) must be privatized to this invocation of Search. Therefore, when Search is invoked, at lines 5 and 29 in Algorithm 3, we need to supply an array of stacks for the child function, based on current invocation’s array of stacks S. This may be implemented as a direct copy, either a shallow or deep clone. There is an additional benefit from the privatization of the variable renaming stacks: Now there is no need to pop variable names off the stack at the end of each invocation of Search, as in lines 31–35 of Algorithm 3. Instead, when a Search invocation completes, its private stack goes out of scope and the reference to this private stack does not escape the subroutine. With these modifications to the algorithm to support parallelism, there are no data conflicts between the concurrent threads. They have private stacks for renaming. Each block is processed by a single thread, with the exception of φfunctions, where each right-hand operand may be processed by a different thread. Thus the φ-function data structure must be able to handle concurrent modifications to different right-hand operands. The only global variable accessed by concurrent threads is the Count array, where each individual element Count(V ) indicates the integer subscript to use for the next definition of original variable V . We can deploy an atomic get/increment operation to prevent multiple threads from being allocated the same integer subscript for V . (In fact, it is not necessary that new variable names should be formed by appending a sequential integer subscript to the original variable name. If making the Count array global causes problems due to contention, then we could append the block number to the variable to get a unique name, rather than using the global Count array.) This leads to the following modifications to Algorithm 3 to enable parallelization: The Search subroutine declared at line 7 now takes two parameters, X is a BasicBlock where we begin processing, and S is the private array of stacks, indexed by variable name, to maintain the mapping between old and new variable names for this invocation. Since we retain a global Count array, we ensure that its elements are updated atomically 1 by removing line 18, and modifying line 15 to: 15: i ← GetAndIncrementAtomically(Count(V )) The forall loop at line 28 becomes a doall loop as shown in Algorithm 4, to enable parallel execution of the recursive calls to Search. Each separate call gets its own private copy of the array of stacks S. This could be implemented as a shallow copy (pointer to parent’s stack) to avoid memory bloat, or as a deep copy (entirely new array with copied data) to avoid excessive pointer chasing. Algorithm 4 Amendment for parallel recursion 28: doall Z ∈ Children(X) 29: call Search(Z, newCopyOf(S)) 30: end doall Finally, we remove lines 31–36, since private stacks are effectively deallocated automatically when they go out of scope when the invocation returns. Alternatively, if dynamic manual memory management is used to handle the stacks, a single call to free() at the end of the routine body suffices to deallocate an invocation’s private stack. 3. IMPLEMENTATION We chose to implement our parallel SSA construction techniques using the Soot compiler framework [31, 32] v2.4.0. Soot is an open-source, research-oriented static analysis toolkit written in Java, that operates on Java bytecode with a wide range of data flow analysis (DFA) passes. Most of the DFA takes place on a three-address intermediate representation called Jimple [33]. There is an SSA construction pass, which transforms the Jimple into SSA (known as Shimple). This SSA construction pass follows the classical algorithm of Cytron et al [7] exactly. Thus it provides an ideal vehicle for demonstrating and evaluating our parallelization strategies outlined in Section 2. We use the Java fork/join framework [15] to provide lightweight threading mechanisms to take advantage of underlying hardware parallelism available on multi-core platforms. The fork/join classes will be included in Java 7, but since we are working with Java 6, we use the JSR166y 2 library. This provides a ForkJoinPool of worker threads, which can execute ForkJoinTask instances. Each individual worker thread has a double-ended queue (dequeue) of tasks. It is assumed that there are no data dependence conflicts between the tasks. If a worker thread has an empty dequeue, then it can steal a task from the end of another thread’s dequeue to maintain load-balancing, à la Cilk [5]. In adapting the existing SSA construction algorithm for Soot, we took a pragmatic approach. Many design decisions 1 This has not become a bottleneck. If it should, we could avoid synchronization by making each subscript unique within its basic block, then appending the basic block identifier onto the subscript, i.e. variable_subscript_block. 2 http://gee.cs.oswego.edu/dl/concurrency-interest/
(e.g. the control flow graph and basic block data structures) could perhaps have been optimized for parallel access. However since these data structures cross-cut the entire analysis framework, any changes would have far-reaching implications. Instead, we modify the existing codebase as little as possible, only where absolutely necessary to support parallel SSA construction. This involves replacing some non-thread-safe objects e.g. HashMap with thread-safe alternatives e.g. ConcurrentHashMap. We also had to introduce a limited number of Semaphore objects to guard multi-threaded access to data, e.g. the first and last fields in Soot HashChain objects. The doall loops from the parallelized versions of Algorithms 1 and 3 are implemented as invokeAll() operations on ArrayLists of RecursiveTask objects. Most of the rewritten code is localized to the soot.shimple.internal package. We intend to contribute a patch back to the Soot maintainers. The whole code took around three man-weeks of development effort. 4. EVALUATION 4.1 Benchmarks We use standard Java benchmark programs from the Da- Capo and Java Grande suites, to evaluate our parallel SSA construction algorithm. The DaCapo suite of Java benchmarks [4], version 9.12 (bach), consists of large, widely used, real-world open-source applications: in fact, typical input for a client-side JIT compiler. The Java Grande suite of benchmarks [27] contains five large-scale scientific applications. We use the sequential versions of these programs. We observe that the Java Grande codebase is significantly smaller than DaCapo, however it is representative of code from the scientific domain. We execute each analysed application with a standard workload (default for DaCapo and SizeA for Java Grande) and record the classes that the JVM classloader accesses. We ignore classes that do not belong to the benchmark distribution, e.g. Java standard library classes. Now, given a list of classes for each application, we run these classes through the Soot analysis. Table 1 summarizes the applications we select. For each benchmark, we report (i) the number of methods that we will analyse in Soot, (ii) the arithmetic mean of the bytecode instruction lengths of these selected methods, and (iii) the bytecode instruction length of the longest method. 4.2 Platform Our commodity multi-core evaluation platform is described in Table 2. We use a Java fork-join pool with twice the number of worker threads as hardware threads in the system. This parameter may be tuned; it has some effect on performance, but we cannot explore tuning due to space restrictions. 4.3 Experiments We compare two different SSA construction techniques. The sequential version is the default Shimple builder pass supplied in Soot v2.4.0. The parallel version is our custom pass that uses Java fork/join parallelism to implement the parallel SSA construction algorithm outlined in Section 2 earlier in this paper. Vendor Intel Codename Nehalem Architecture Core i7-920 Cores x SMT Contexts 4x2 Per-core L1 i/d 32KB/32KB Per-core L2 256KB Shared L3 8MB Core freq 2.67GHz RAM size 6GB OS Linux 2.6.31 (64bit) JVM (1.6) 14.0-b16 max JVM heap 4GB # FJ threads 16 Table 2: Evaluation platform for experiments For all tests, we measure execution times of compilation phases using the Java library call System.nanoTime(). All times are the arithmetic means of five measurements, which have a low coefficient of variation. We reduce timing variance by using large fixed size JVM heaps. We compute speedups as: mean sequential time / mean parallel time. For the benchmark applications, we report the speedup in overall SSA construction time, i.e. the sum of φ-function insertion and variable renaming. This SSA construction time is summed over all methods analysed, for each individual benchmark. 4.3.1 Comparison on Standard Benchmarks Figure 1 shows speedups for parallel SSA construction on the selected benchmark applications, on the Intel Core i7 platform. The speedup varies with the method size threshold t. For a particular point on a benchmark’s speedup curve, we apply the sequential SSA construction algorithm to all methods with size < t, whereas we apply the parallel SSA construction algorithm to all methods with size ≥ t. If t is set too small, then almost methods are analysed indiscriminately using the parallel algorithm. For many small methods, the overhead of parallelism outweighs any overlapped execution gains. In many cases, this causes an overall slowdown (speedup scores below 1 in the graph). On the other hand, if t is set too large, then hardly any methods are analysed in parallel, so the SSA construction is almost always sequential. In all cases, the curves tend to speedup = 1 as t → ∞. Parallel SSA construction is more beneficial for large methods. The benchmarks with the largest mean method sizes in Table 1, namely fop and batik in DaCapo and euler and moldyn in Java Grande, show the best speedups in Figure 1. We have not investigated alternative heuristics for selecting which individual methods for which the parallel SSA construction algorithm is to be preferred to the sequential algorithm. Since the parallelism depends on (i) number of variables, and (ii) dominance properties of the control flow graph, method size seems like a simple proxy measure for overall complexity. Other heuristics might include software metrics such as cyclomatic complexity [20]. 4.3.2 Comparison on Inlined Benchmarks The reason why many benchmarks do not give significant speedups with parallel SSA construction is that most meth-
Page 1 and 2: ABSTRACT Parallelizing the Construc
Page 3: Algorithm 2 Amendment for lazy φ-f
Page 7 and 8: speedup 1.6 1.4 1.2 1 0.8 0.6 0.4 0
Page 9: transformation. In: Code Generation

Parallelizing the Construction of Static Single Assignment Form

Create successful ePaper yourself

Delete template?

Save as template?