Final report

Compiler Optimizations Using Data-Triggered Threads 

Alex Limpaecher 

alimpaecher@gmail.com 

Deby Katz 

dskatz@cs.cmu.edu 

May 1, 2013 

1 Introduction 

Inspired by research conducted by Hung-Wei Tseng and Dean M. Tullsen, we programmed a compiler 

optimization that implemented data-triggered threads. 

1.1 Motivation 

Many programs perform redundant work, and few programs take full advantage of the parallel 

processing capabilities of modern hardware. As presented in the papers [3] and [4], data-triggered 

threads are designed to reduce some of the wastefulness in both scenarios. Tseng and Tullsen’s 

Data-Triggered Threads concept is discussed further in the next section. 

1.2 Related Work 

In [3] and [4], Tseng and Tullsen presented the concept of Data-Triggered Threads. The basic idea 

behind data-triggered threads is that a section of code (“skippable region) is only executed if a 

certain piece of data (its “trigger) has changed. If the “skippable region is executed, it is spun into 

a different thread (a “support thread”) to enable paralellization. If the trigger’s value changes while 

the support thread is running, such that the input value to the support thread would not have been 

the value used, the support thread is canceled. If the support thread has not finished running by 

the time that the execution of the main thread reaches a use of a value that is computed in the 

support thread, the main thread halts until the support thread completes its work. 

The general operation of data-triggered threads is illustrated in Figure 1 and Figure 2. 

1.3 Contributions 

We implemented some of the concepts presented in Tseng and Tullsen’s work, and we attempted to 

provide a user-friendly interface that would allow a user to easily incorporate the mechanisms for 

Data-Triggered Threads. We also implemented a profiling tool that may allow users to determine 

where their code might benefit from the use of data-triggered threads. We also demonstrate that, 

although this technique can be useful, there can be significant overhead from checking whether a 

value has changed, spawning threads, canceling threads, waiting for threads to complete, and other 

actions taken in a data-triggered thread version of a program that would not occur in standard 

execution. This overhead can outweigh the speedups from executing fewer redundant instructions 

1

time 

main thread 

Trigger Variable = ... 

Skippable Region 

Non-Dependent Code 

Dependent Code 

Figure 1: The execution of a single-threaded program, in a program that could take advantage of 

data-triggered threads. 

and taking advantage of parallel processing. In our work, we demonstrate the varying usefulness 

of the technique for different scenarios. 

2 Implementation Details 

We used LLVM to implement a simplified version of Software Data-Triggered Threads. 

2.1 End-User Instructions 

Our implementation makes it easy for end users to incorporate data-triggered threads into their 

C code. We have written a header file, triggerThreading.h, which the user must include to 

take advantage of our features. We have provided mechanisms for the user to identify the trigger 

variable, the skippable region, and the dependence variables. Specifically, the variable named 

t triggerVariable is a global variable included in triggerThreading.h. The user should store the 

data that operates as their trigger variable in this global. When the data contained in this variable 

changes, it will trigger the trigger function. The user should place the trigger function, which 

should contain the calculations that depend on the value of the trigger variable, in a function called 

t triggerFunction. This function should only contain code that does not have other complex 

dependencies because the code will be executed out-of-order. Our code will run the code within 

that function only when the value of t triggerVariable changes and will do so in a new thread. The 

user must also specify which variables depend on data calculated in t triggerFunction by using the 

prefix m in the variable name. 

2

time 

main thread 

Trigger Variable = ... 

support thread 

Non-Dependent Code 

Skippable Region 

Dependent Code 

Figure 2: The execution of the same program, running in two threads with data-triggered threads. 

We have provided a script run.sh that will compile a .c file that conforms to the above properties. 

The script takes the main part of the filename as an argument (omitting the .c extension). 

The script emits two executables corresponding to the file – one with our optimizations and one 

without. 

2.2 Technical Implementation 

Our code adds the following to the user’s code: 

• When a value is stored into t triggerFunction, our software inserts code to check at runtime 

if the value of t triggerVariable has changed. If it has changed, it will follow the control 

path to spin off a new thread that executes the skippable region, designated in function 

t triggerFunction. 

• Our code removes the skippable region, contained in t triggerFunction, from the user’s main 

thread. That code will only run in a support thread and not a part of the main thread code. 

• We find all the points in the code that depend on the skippable code. These dependence 

points are those that use the variables that the user has prefixed with m Our code adds 

a code barrier that will not be crossed unless the thread executing the skippable code has 

finished. 

2.3 Techinical Details 

We implemented the functions described above by by writing high-level helper functions in C++. 

We then designed an LLVM pass to manipulate those functions along with the user code. We 

understand that this approach is standard practice in implementing compiler optimizations. 

2.3.1 Creating a Thread 

Designing our threading code was challenging because we needed to have code that we could easily 

manipulate in our LLVM pass. We settled on using the C++ library p thread. We created a 

3

function called createThread, which checks if the trigger variable has changed. If it has changed, 

createThread spins off a thread that runs threadFunction. 

Because we wanted the user-designated t triggerFunction to run in the new thread, our LLVM 

pass creates a new instruction that calls t triggerFunction. The pass inserts the new instruction at 

the appropriate location within threadFunction. 

Our pass then locates store instructions that store a value into t triggerVariable. Immediately 

after these store instructions, the pass inserts a new instruction that calls createThread, described 

above. In this manner, the compiler pass ensures that every time the value of t triggerVariable may 

have changed, createThread checks if it has, in fact, changed and if so spawns a new thread to do 

the necessary computation. 

2.3.2 Cancelling Threads 

If t triggerVariable changes while a support thread is running, the results of the current thread 

may no longer be valid, and the thread itself may be out of date and unnecessary. Because we 

have barriers that wait at the next use of data that depend on the t triggerFunction, we know 

that any given support thread will be out-of-date if t triggerVariable changes while it is running. 

Therefore, our code provides a mechanism to cancel an out-of-date thread to reduce the number 

of unnecessary computations. The pThread library requires that, in order to cancel threads, the 

thread code must handle cancellation calls. 

Code run in support threads is primarily user-written, so our LLVM pass had to modify the 

code in the support threads by modifying threadFunction. In threadFunction, we set the thread as 

’cancelable’. When we insert the user’s code from t triggerFunction into threadFunction, we insert 

it after the call making the thread cancelable. 

In addition, the support thread needs to have certain threading handles that allow it to be 

cancelled. Without those handles, a cancelled thread would still run to completion. Because 

the user defines the code running in the support thread, our LLVM pass inserts these hooks 

into t triggerFunction. At the beginning of every BasicBlock of t triggerFunction, we insert a 

call to pthread testcancel, which stops the thread from executing if someone has canceled it. The 

pthread testcancel call is designed to use very little computation, and we place the call at the beginning 

of every BasicBlock to trade off the additional computation of the call against the additional 

computation and waiting caused by a thread running after it has been cancelled. 

2.3.3 Waiting for Threads 

When variables’ values are calculated in t triggerFunction, the thread must complete before those 

variables can be used. Therefore, our LLVM pass inserts barriers to halt execution before the uses 

of those variables – identified by the prefix m – until the thread completes execution. The inserted 

instruction calls waitForThread, a function that calls pthread join, which waits for any existing 

threads to finish. 

2.4 Profiling 

As also mentioned in 3, we designed and implemented a profiling tool that may help users to identify 

regions that may benefit from the addition of data-triggered threads. As our implementation 

requires users to identify regions that are potentially skippable, this information should help users 

4

to identify potentially skippable regions. Among other functions, the profiling tool can identify the 

number of “silent” loads and stores in each module of the program. A silent load, as used in this 

paper, is a load that loads a value from memory into a register, where the value loaded is identical 

to the value the register previously contained. Similarly, a silent store is a store that stores a value 

from a register into memory, where the value stored in memory is identical to the value previously 

contained in the affected memory addresses. Implemented in the dynamic binary instrumentation 

tool PIN, [1] our profiling tool provides a listing of the numbers of silent loads and stores in each 

program region. There are some usability enhancements to only provide data relating to routines 

likely of interest to the programmer, such as those the programmer coded. 

3 Experimental Setup 

To evaluate the effectiveness and correctness of our optimizations, we designed several toy programs 

with specific intended behaviors after being processed by our compiler. We evaluated the programs 

compiled with the standard clang and LLVM compilers settings, against the programs compiled 

with our compiler pass using the -pthread flag. For each toy program, we used our profiling tool to 

report the total number of instructions executed, the total number of silent loads, the total number 

of silent stores, and the numbers of silent loads and stores in each program region, for both the 

original program and our optimized program. We also ran most programs 1,000 times consecutively 

and compared the running times for the original version and our optimized version. 

4 Experimental Evaluation 

We will discuss each example toy program we used in turn. For the purposes of this discussion, 

we will call the version compiled with Clang and LLVM’s standard options the unoptimized case 

and the version compiled using our pass the optimized case, even though our pass does not always 

perform better than the version compiled with the standard options. Results are shown in figures 

Example 1 Example 1 ran with 152,449 instructions in the unoptimized case, 277,228 instructions 

in the optimized case, which meant that there were 124,779 more instructions in the optimized case, 

which were about 82% more than the unoptimized case. The unoptimized runtime was 1 second, 

while the optimized runtime was 2 seconds. We believe that this case produced a longer runtime and 

more instructions executed because there was not much work that took place inside the spawned 

thread. Therefore, the overhead of spawning a new thread was much greater than any time saved 

from parallelization or any work saved from skipping. 

Example 2 Example 2 ran with 7,183,517 instructions in the unoptimized case, 35,875,135 instructions 

in the optimized case, meaning there were 28,691,618 more instructions in the optimized 

case. The unoptimized runtime was about 22 seconds while the optimized runtime was about 2 

minutes and 9 seconds, more than five times longer. This example was designed as a situation 

where data-triggered threads would not provide a good result. The trigger variable changes value 

on each iteration of a loop, leading the optimized program to spawn a new thread at each iteration 

of the loop. Furthermore, there is no independent work for the main thread to complete before it 

needs to use a value calculated in the support thread. In such a scenario, data-triggered threads 

would not be a good approach. 

5

225% 

Percentage Improvement Running Data Triggered Threads 

Time Improvement 

Instruction Improvement 

0% 

-225% 

-450% 

-675% 

-900% 

Example 1 Example 2 best Case cancel redundant Work thread Case thread Sometimes worst Case 

Figure 3: The percentage improvement and decline in performance for toy programs compiled with 

our compiler pass. Some examples were designed to illustrate pathological cases. 

Example BestCase Example BestCase ran with 2,147,483,647 instructions in the unoptimized 

case, and 47,348,764 instructions in the optimized case, meaning that there were 2,100,134,883 

fewer instructions in the optimized case, a savings of about 97.8%. The unoptimized runtime was 

about two minutes, 36 seconds, while the optimized runtime was about 13 seconds, a savings of 

about 5x. Example BestCase was designed as a scenario that would be well-adapted to the use of 

data-triggered threads. The value of the trigger variable only changes once, so the trigger function 

only executes once. Because its trigger function was so time consuming, executing that function 

only once saves a lot of time. 

Example Cancel Example Cancel ran with 5,821,578 instructions in the unoptimized case, and 

2,012,340 instructions in the optimized case, meaning that there were 3,809,238 fewer instructions in 

the optimized case. However, the runtime increased from 2 seconds to 13 seconds in the optimized 

case. We designed this example to test and demonstrate the ability of programs compiled with our 

pass to cancel threads that have become out-of-date. We believe that the additional overhead from 

spawning and canceling threads can account for the discrepancy in executing fewer instructions 

versus taking a longer time. 

6

Original Optimized 

Example 1 152449 277228 

Example 2 7183517 35875135 

best Case 2147483647 47348764 

cancel 5821578 2012340 

redundant Work 14387393 12264835 

thread Case 433235441 433024417 

thread Sometimes 433240969 420622307 

worst Case 1682046 15370376 

Figure 4: Instruction counts of unoptimized and optimized toy programs. 

Original Optimized 

Example 1 00:00:01 00:00:02 

Example 2 00:00:22 00:02:09 

best Case 02:36:47 00:00:13 

cancel 00:00:02 00:00:13 

redundant Work 00:03:44 00:03:42 

thread Case 00:36:15 00:36:47 

thread Sometimes 00:36:56 00:37:14 

worst Case 00:00:22 00:02:02 

Figure 5: Running time of unoptimized and optimized toy programs. 

Example Redundant Work We created an example with a lot of redundant work. This example 

ran with 14,387,393 instructions in the unoptimized case, and 12,264,835 instructions in the 

optimized case, a savings of 2,122,558 instructions or about 14.7%. The unoptimized version ran in 

three minutes, 44 seconds, while the optimized version ran in three minutes 42 seconds, a savings 

of only two seconds. We believe that in this example, we were able to skip a lot of redundant work, 

bringing the instruction count down. However, because the calculated value is used right after it 

is calculated and because there is not much time-consuming work involved in running the function 

in the support thread, the time saved by skipping may be about equal to the overhead generated. 

Example Thread Case In this example, the number of instructions and elapsed time both 

stayed relatively flat. The trigger variable changes on each loop iteration, meaning that we do not 

get the benefit of skipping any unnecessary work. There is an amount of work in the main thread 

and in the support thread that can run in parallel. However, the amount of work run in parallel is 

likely not enough to outweigh the overhead of spawning threads. 

Example Worst Case This example demonstrates the high cost of threading when there is no 

benefit to be had from skipping work or parallelization. The unoptimized version had 1,682,046 

instructions, while the optimized version had 15,370,376 instructions, a difference of -13,688,330 

7

which is more than eight times the original number of instructions. Similarly, the unoptimized 

running time was 22 seconds, while the optimized running time was two minutes and two seconds, 

about six times as much as the unoptimized running time. 

Note that the same program might increase in running time and decrease in number of instructions 

executed in the optimized version versus the unoptimized version. (See, for example, the 

Cancel toy program.) We believe that this may occur because of system overhead due to spawning 

threads or because of synchronization overhead, such as waiting for a lock to be released. 

While our examples do not all show much improvement, we believe there is much potential to 

be had from this technique. We designed several examples to show pathological situations, and 

their time losses swamp the time savings of the good cases. A well-designed implementation of 

data-triggered threads would include safeguards to guard against using the threads for pathological 

cases. It is vitally important to weigh the overhead of generating new threads with the potential 

gains of parallelization and skipping redundant work. Our experiments can contribute to finding 

the right level for that trade off. 

5 Surprises and Lessons Learned 

We were too ambitious with the scope of the project, which is unfortunate, because the parts of 

the project we did not get to involved the more interesting questions. We underestimated the work 

involved in executing our original approach, and by the time we settled on an alternate approach, 

we did not have the chance to investigate all the questions we wanted to. 

Our initial approach involved making our tools as portable and generic as possible. This included 

modifying the compiler front end (Clang)[2] to accept new pragmas. When we discovered difficulty 

with pragmas, we attempted to modify Clang to accept new attributes. We wasted a great amount 

of time trying to learn the ins and outs of Clang, when those modifications were not central to 

our core goals. Eventually, we abandoned the approach of modifying Clang in favor of using the 

pre-identified variable name, function name, and variable prefix. Using this approach, we could 

send the information to the LLVM intermediate representation without modifying Clang. One of 

our biggest challenges was designing threading code that could be easily manipulated by LLVM. 

We used the C++ library p thread to do this. We created a function called createThread, which 

checks if the trigger variable has changed. If the variable has changed, it spins off a thread that 

runs the function threadFunction. 

6 Conclusions and Future Work 

6.1 Future Work 

There is an immense amount of work we had hoped to do for this project but did not get to. We 

had hoped to use our profiler to automatically detect regions that would be amenable to becoming 

support threads. However, while the profiling tool can provide a starting point for the user to look 

for potentially redundant work, the profiling tool requires further development to detect code that 

is safe to move into a new thread. The tool also would benefit from further coordination with 

the compiler to detect patterns in the code that produce situations amenable for data-triggered 

threads. 

8

We we also believe there are interesting areas of investigation in code movement to increase 

parallelization. We had hoped to investigate this further, but we were too ambitious. If the 

support thread and the independent code In addition, there are areas for further investigation in 

applying these techniques to real-world code. 

Because of the overhead from the threading, a version that simply skips code when that code 

is unnecessary, without involving threading, may provide better performance in many situations. 

6.2 Conclusions 

Data-triggered threads are a promising idea, but they only achieve observable benefits in particular 

situations. Any production implementation would need to carefully determine the scenarios in 

which the threads should be used and carefully tune the trade offs to determine use cases. 

Division of Credit Credit should be divided equally, 50%-50%. 

References 

[1] Chi-keung Luk, Robert Cohn, Robert Muth, Harish Patil, and Artur Klauser. Pin: Building 

Customized Program Analysis Tools. In PLDI, pages 190–200, 2005. 

[2] University of Illinois at Urbana-Champaign. The llvm compiler infrastructure project, May 

2013. 

[3] Hung-Wei Tseng and Dean M. Tullsen. Data-triggered threads: Eliminating redundant computation. 

2011 IEEE 17th International Symposium on High Performance Computer Architecture, 

(Hpca):181–192, February 2011. 

[4] Hung-Wei Tseng and Dean Michael Tullsen. Software data-triggered threads. Proceedings 

of the ACM international conference on Object oriented programming systems languages and 

applications - OOPSLA ’12, page 703, 2012. 

9

Final report

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?