30.12.2013 Views

Final report

Final report

Final report

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Compiler Optimizations Using Data-Triggered Threads<br />

Alex Limpaecher<br />

alimpaecher@gmail.com<br />

Deby Katz<br />

dskatz@cs.cmu.edu<br />

May 1, 2013<br />

1 Introduction<br />

Inspired by research conducted by Hung-Wei Tseng and Dean M. Tullsen, we programmed a compiler<br />

optimization that implemented data-triggered threads.<br />

1.1 Motivation<br />

Many programs perform redundant work, and few programs take full advantage of the parallel<br />

processing capabilities of modern hardware. As presented in the papers [3] and [4], data-triggered<br />

threads are designed to reduce some of the wastefulness in both scenarios. Tseng and Tullsen’s<br />

Data-Triggered Threads concept is discussed further in the next section.<br />

1.2 Related Work<br />

In [3] and [4], Tseng and Tullsen presented the concept of Data-Triggered Threads. The basic idea<br />

behind data-triggered threads is that a section of code (“skippable region) is only executed if a<br />

certain piece of data (its “trigger) has changed. If the “skippable region is executed, it is spun into<br />

a different thread (a “support thread”) to enable paralellization. If the trigger’s value changes while<br />

the support thread is running, such that the input value to the support thread would not have been<br />

the value used, the support thread is canceled. If the support thread has not finished running by<br />

the time that the execution of the main thread reaches a use of a value that is computed in the<br />

support thread, the main thread halts until the support thread completes its work.<br />

The general operation of data-triggered threads is illustrated in Figure 1 and Figure 2.<br />

1.3 Contributions<br />

We implemented some of the concepts presented in Tseng and Tullsen’s work, and we attempted to<br />

provide a user-friendly interface that would allow a user to easily incorporate the mechanisms for<br />

Data-Triggered Threads. We also implemented a profiling tool that may allow users to determine<br />

where their code might benefit from the use of data-triggered threads. We also demonstrate that,<br />

although this technique can be useful, there can be significant overhead from checking whether a<br />

value has changed, spawning threads, canceling threads, waiting for threads to complete, and other<br />

actions taken in a data-triggered thread version of a program that would not occur in standard<br />

execution. This overhead can outweigh the speedups from executing fewer redundant instructions<br />

1


time<br />

main thread<br />

Trigger Variable = ...<br />

Skippable Region<br />

Non-Dependent Code<br />

Dependent Code<br />

Figure 1: The execution of a single-threaded program, in a program that could take advantage of<br />

data-triggered threads.<br />

and taking advantage of parallel processing. In our work, we demonstrate the varying usefulness<br />

of the technique for different scenarios.<br />

2 Implementation Details<br />

We used LLVM to implement a simplified version of Software Data-Triggered Threads.<br />

2.1 End-User Instructions<br />

Our implementation makes it easy for end users to incorporate data-triggered threads into their<br />

C code. We have written a header file, triggerThreading.h, which the user must include to<br />

take advantage of our features. We have provided mechanisms for the user to identify the trigger<br />

variable, the skippable region, and the dependence variables. Specifically, the variable named<br />

t triggerVariable is a global variable included in triggerThreading.h. The user should store the<br />

data that operates as their trigger variable in this global. When the data contained in this variable<br />

changes, it will trigger the trigger function. The user should place the trigger function, which<br />

should contain the calculations that depend on the value of the trigger variable, in a function called<br />

t triggerFunction. This function should only contain code that does not have other complex<br />

dependencies because the code will be executed out-of-order. Our code will run the code within<br />

that function only when the value of t triggerVariable changes and will do so in a new thread. The<br />

user must also specify which variables depend on data calculated in t triggerFunction by using the<br />

prefix m in the variable name.<br />

2


time<br />

main thread<br />

Trigger Variable = ...<br />

support thread<br />

Non-Dependent Code<br />

Skippable Region<br />

Dependent Code<br />

Figure 2: The execution of the same program, running in two threads with data-triggered threads.<br />

We have provided a script run.sh that will compile a .c file that conforms to the above properties.<br />

The script takes the main part of the filename as an argument (omitting the .c extension).<br />

The script emits two executables corresponding to the file – one with our optimizations and one<br />

without.<br />

2.2 Technical Implementation<br />

Our code adds the following to the user’s code:<br />

• When a value is stored into t triggerFunction, our software inserts code to check at runtime<br />

if the value of t triggerVariable has changed. If it has changed, it will follow the control<br />

path to spin off a new thread that executes the skippable region, designated in function<br />

t triggerFunction.<br />

• Our code removes the skippable region, contained in t triggerFunction, from the user’s main<br />

thread. That code will only run in a support thread and not a part of the main thread code.<br />

• We find all the points in the code that depend on the skippable code. These dependence<br />

points are those that use the variables that the user has prefixed with m Our code adds<br />

a code barrier that will not be crossed unless the thread executing the skippable code has<br />

finished.<br />

2.3 Techinical Details<br />

We implemented the functions described above by by writing high-level helper functions in C++.<br />

We then designed an LLVM pass to manipulate those functions along with the user code. We<br />

understand that this approach is standard practice in implementing compiler optimizations.<br />

2.3.1 Creating a Thread<br />

Designing our threading code was challenging because we needed to have code that we could easily<br />

manipulate in our LLVM pass. We settled on using the C++ library p thread. We created a<br />

3


function called createThread, which checks if the trigger variable has changed. If it has changed,<br />

createThread spins off a thread that runs threadFunction.<br />

Because we wanted the user-designated t triggerFunction to run in the new thread, our LLVM<br />

pass creates a new instruction that calls t triggerFunction. The pass inserts the new instruction at<br />

the appropriate location within threadFunction.<br />

Our pass then locates store instructions that store a value into t triggerVariable. Immediately<br />

after these store instructions, the pass inserts a new instruction that calls createThread, described<br />

above. In this manner, the compiler pass ensures that every time the value of t triggerVariable may<br />

have changed, createThread checks if it has, in fact, changed and if so spawns a new thread to do<br />

the necessary computation.<br />

2.3.2 Cancelling Threads<br />

If t triggerVariable changes while a support thread is running, the results of the current thread<br />

may no longer be valid, and the thread itself may be out of date and unnecessary. Because we<br />

have barriers that wait at the next use of data that depend on the t triggerFunction, we know<br />

that any given support thread will be out-of-date if t triggerVariable changes while it is running.<br />

Therefore, our code provides a mechanism to cancel an out-of-date thread to reduce the number<br />

of unnecessary computations. The pThread library requires that, in order to cancel threads, the<br />

thread code must handle cancellation calls.<br />

Code run in support threads is primarily user-written, so our LLVM pass had to modify the<br />

code in the support threads by modifying threadFunction. In threadFunction, we set the thread as<br />

’cancelable’. When we insert the user’s code from t triggerFunction into threadFunction, we insert<br />

it after the call making the thread cancelable.<br />

In addition, the support thread needs to have certain threading handles that allow it to be<br />

cancelled. Without those handles, a cancelled thread would still run to completion. Because<br />

the user defines the code running in the support thread, our LLVM pass inserts these hooks<br />

into t triggerFunction. At the beginning of every BasicBlock of t triggerFunction, we insert a<br />

call to pthread testcancel, which stops the thread from executing if someone has canceled it. The<br />

pthread testcancel call is designed to use very little computation, and we place the call at the beginning<br />

of every BasicBlock to trade off the additional computation of the call against the additional<br />

computation and waiting caused by a thread running after it has been cancelled.<br />

2.3.3 Waiting for Threads<br />

When variables’ values are calculated in t triggerFunction, the thread must complete before those<br />

variables can be used. Therefore, our LLVM pass inserts barriers to halt execution before the uses<br />

of those variables – identified by the prefix m – until the thread completes execution. The inserted<br />

instruction calls waitForThread, a function that calls pthread join, which waits for any existing<br />

threads to finish.<br />

2.4 Profiling<br />

As also mentioned in 3, we designed and implemented a profiling tool that may help users to identify<br />

regions that may benefit from the addition of data-triggered threads. As our implementation<br />

requires users to identify regions that are potentially skippable, this information should help users<br />

4


to identify potentially skippable regions. Among other functions, the profiling tool can identify the<br />

number of “silent” loads and stores in each module of the program. A silent load, as used in this<br />

paper, is a load that loads a value from memory into a register, where the value loaded is identical<br />

to the value the register previously contained. Similarly, a silent store is a store that stores a value<br />

from a register into memory, where the value stored in memory is identical to the value previously<br />

contained in the affected memory addresses. Implemented in the dynamic binary instrumentation<br />

tool PIN, [1] our profiling tool provides a listing of the numbers of silent loads and stores in each<br />

program region. There are some usability enhancements to only provide data relating to routines<br />

likely of interest to the programmer, such as those the programmer coded.<br />

3 Experimental Setup<br />

To evaluate the effectiveness and correctness of our optimizations, we designed several toy programs<br />

with specific intended behaviors after being processed by our compiler. We evaluated the programs<br />

compiled with the standard clang and LLVM compilers settings, against the programs compiled<br />

with our compiler pass using the -pthread flag. For each toy program, we used our profiling tool to<br />

<strong>report</strong> the total number of instructions executed, the total number of silent loads, the total number<br />

of silent stores, and the numbers of silent loads and stores in each program region, for both the<br />

original program and our optimized program. We also ran most programs 1,000 times consecutively<br />

and compared the running times for the original version and our optimized version.<br />

4 Experimental Evaluation<br />

We will discuss each example toy program we used in turn. For the purposes of this discussion,<br />

we will call the version compiled with Clang and LLVM’s standard options the unoptimized case<br />

and the version compiled using our pass the optimized case, even though our pass does not always<br />

perform better than the version compiled with the standard options. Results are shown in figures<br />

Example 1 Example 1 ran with 152,449 instructions in the unoptimized case, 277,228 instructions<br />

in the optimized case, which meant that there were 124,779 more instructions in the optimized case,<br />

which were about 82% more than the unoptimized case. The unoptimized runtime was 1 second,<br />

while the optimized runtime was 2 seconds. We believe that this case produced a longer runtime and<br />

more instructions executed because there was not much work that took place inside the spawned<br />

thread. Therefore, the overhead of spawning a new thread was much greater than any time saved<br />

from parallelization or any work saved from skipping.<br />

Example 2 Example 2 ran with 7,183,517 instructions in the unoptimized case, 35,875,135 instructions<br />

in the optimized case, meaning there were 28,691,618 more instructions in the optimized<br />

case. The unoptimized runtime was about 22 seconds while the optimized runtime was about 2<br />

minutes and 9 seconds, more than five times longer. This example was designed as a situation<br />

where data-triggered threads would not provide a good result. The trigger variable changes value<br />

on each iteration of a loop, leading the optimized program to spawn a new thread at each iteration<br />

of the loop. Furthermore, there is no independent work for the main thread to complete before it<br />

needs to use a value calculated in the support thread. In such a scenario, data-triggered threads<br />

would not be a good approach.<br />

5


225%<br />

Percentage Improvement Running Data Triggered Threads<br />

Time Improvement<br />

Instruction Improvement<br />

0%<br />

-225%<br />

-450%<br />

-675%<br />

-900%<br />

Example 1 Example 2 best Case cancel redundant Work thread Case thread Sometimes worst Case<br />

Figure 3: The percentage improvement and decline in performance for toy programs compiled with<br />

our compiler pass. Some examples were designed to illustrate pathological cases.<br />

Example BestCase Example BestCase ran with 2,147,483,647 instructions in the unoptimized<br />

case, and 47,348,764 instructions in the optimized case, meaning that there were 2,100,134,883<br />

fewer instructions in the optimized case, a savings of about 97.8%. The unoptimized runtime was<br />

about two minutes, 36 seconds, while the optimized runtime was about 13 seconds, a savings of<br />

about 5x. Example BestCase was designed as a scenario that would be well-adapted to the use of<br />

data-triggered threads. The value of the trigger variable only changes once, so the trigger function<br />

only executes once. Because its trigger function was so time consuming, executing that function<br />

only once saves a lot of time.<br />

Example Cancel Example Cancel ran with 5,821,578 instructions in the unoptimized case, and<br />

2,012,340 instructions in the optimized case, meaning that there were 3,809,238 fewer instructions in<br />

the optimized case. However, the runtime increased from 2 seconds to 13 seconds in the optimized<br />

case. We designed this example to test and demonstrate the ability of programs compiled with our<br />

pass to cancel threads that have become out-of-date. We believe that the additional overhead from<br />

spawning and canceling threads can account for the discrepancy in executing fewer instructions<br />

versus taking a longer time.<br />

6


Original Optimized<br />

Example 1 152449 277228<br />

Example 2 7183517 35875135<br />

best Case 2147483647 47348764<br />

cancel 5821578 2012340<br />

redundant Work 14387393 12264835<br />

thread Case 433235441 433024417<br />

thread Sometimes 433240969 420622307<br />

worst Case 1682046 15370376<br />

Figure 4: Instruction counts of unoptimized and optimized toy programs.<br />

Original Optimized<br />

Example 1 00:00:01 00:00:02<br />

Example 2 00:00:22 00:02:09<br />

best Case 02:36:47 00:00:13<br />

cancel 00:00:02 00:00:13<br />

redundant Work 00:03:44 00:03:42<br />

thread Case 00:36:15 00:36:47<br />

thread Sometimes 00:36:56 00:37:14<br />

worst Case 00:00:22 00:02:02<br />

Figure 5: Running time of unoptimized and optimized toy programs.<br />

Example Redundant Work We created an example with a lot of redundant work. This example<br />

ran with 14,387,393 instructions in the unoptimized case, and 12,264,835 instructions in the<br />

optimized case, a savings of 2,122,558 instructions or about 14.7%. The unoptimized version ran in<br />

three minutes, 44 seconds, while the optimized version ran in three minutes 42 seconds, a savings<br />

of only two seconds. We believe that in this example, we were able to skip a lot of redundant work,<br />

bringing the instruction count down. However, because the calculated value is used right after it<br />

is calculated and because there is not much time-consuming work involved in running the function<br />

in the support thread, the time saved by skipping may be about equal to the overhead generated.<br />

Example Thread Case In this example, the number of instructions and elapsed time both<br />

stayed relatively flat. The trigger variable changes on each loop iteration, meaning that we do not<br />

get the benefit of skipping any unnecessary work. There is an amount of work in the main thread<br />

and in the support thread that can run in parallel. However, the amount of work run in parallel is<br />

likely not enough to outweigh the overhead of spawning threads.<br />

Example Worst Case This example demonstrates the high cost of threading when there is no<br />

benefit to be had from skipping work or parallelization. The unoptimized version had 1,682,046<br />

instructions, while the optimized version had 15,370,376 instructions, a difference of -13,688,330<br />

7


which is more than eight times the original number of instructions. Similarly, the unoptimized<br />

running time was 22 seconds, while the optimized running time was two minutes and two seconds,<br />

about six times as much as the unoptimized running time.<br />

Note that the same program might increase in running time and decrease in number of instructions<br />

executed in the optimized version versus the unoptimized version. (See, for example, the<br />

Cancel toy program.) We believe that this may occur because of system overhead due to spawning<br />

threads or because of synchronization overhead, such as waiting for a lock to be released.<br />

While our examples do not all show much improvement, we believe there is much potential to<br />

be had from this technique. We designed several examples to show pathological situations, and<br />

their time losses swamp the time savings of the good cases. A well-designed implementation of<br />

data-triggered threads would include safeguards to guard against using the threads for pathological<br />

cases. It is vitally important to weigh the overhead of generating new threads with the potential<br />

gains of parallelization and skipping redundant work. Our experiments can contribute to finding<br />

the right level for that trade off.<br />

5 Surprises and Lessons Learned<br />

We were too ambitious with the scope of the project, which is unfortunate, because the parts of<br />

the project we did not get to involved the more interesting questions. We underestimated the work<br />

involved in executing our original approach, and by the time we settled on an alternate approach,<br />

we did not have the chance to investigate all the questions we wanted to.<br />

Our initial approach involved making our tools as portable and generic as possible. This included<br />

modifying the compiler front end (Clang)[2] to accept new pragmas. When we discovered difficulty<br />

with pragmas, we attempted to modify Clang to accept new attributes. We wasted a great amount<br />

of time trying to learn the ins and outs of Clang, when those modifications were not central to<br />

our core goals. Eventually, we abandoned the approach of modifying Clang in favor of using the<br />

pre-identified variable name, function name, and variable prefix. Using this approach, we could<br />

send the information to the LLVM intermediate representation without modifying Clang. One of<br />

our biggest challenges was designing threading code that could be easily manipulated by LLVM.<br />

We used the C++ library p thread to do this. We created a function called createThread, which<br />

checks if the trigger variable has changed. If the variable has changed, it spins off a thread that<br />

runs the function threadFunction.<br />

6 Conclusions and Future Work<br />

6.1 Future Work<br />

There is an immense amount of work we had hoped to do for this project but did not get to. We<br />

had hoped to use our profiler to automatically detect regions that would be amenable to becoming<br />

support threads. However, while the profiling tool can provide a starting point for the user to look<br />

for potentially redundant work, the profiling tool requires further development to detect code that<br />

is safe to move into a new thread. The tool also would benefit from further coordination with<br />

the compiler to detect patterns in the code that produce situations amenable for data-triggered<br />

threads.<br />

8


We we also believe there are interesting areas of investigation in code movement to increase<br />

parallelization. We had hoped to investigate this further, but we were too ambitious. If the<br />

support thread and the independent code In addition, there are areas for further investigation in<br />

applying these techniques to real-world code.<br />

Because of the overhead from the threading, a version that simply skips code when that code<br />

is unnecessary, without involving threading, may provide better performance in many situations.<br />

6.2 Conclusions<br />

Data-triggered threads are a promising idea, but they only achieve observable benefits in particular<br />

situations. Any production implementation would need to carefully determine the scenarios in<br />

which the threads should be used and carefully tune the trade offs to determine use cases.<br />

Division of Credit Credit should be divided equally, 50%-50%.<br />

References<br />

[1] Chi-keung Luk, Robert Cohn, Robert Muth, Harish Patil, and Artur Klauser. Pin: Building<br />

Customized Program Analysis Tools. In PLDI, pages 190–200, 2005.<br />

[2] University of Illinois at Urbana-Champaign. The llvm compiler infrastructure project, May<br />

2013.<br />

[3] Hung-Wei Tseng and Dean M. Tullsen. Data-triggered threads: Eliminating redundant computation.<br />

2011 IEEE 17th International Symposium on High Performance Computer Architecture,<br />

(Hpca):181–192, February 2011.<br />

[4] Hung-Wei Tseng and Dean Michael Tullsen. Software data-triggered threads. Proceedings<br />

of the ACM international conference on Object oriented programming systems languages and<br />

applications - OOPSLA ’12, page 703, 2012.<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!