Proceedings of the Seminar Hardware/Software Codesign

Proceedings of the 

Seminar Hardware/Software Codesign 

Lecturer: 

Jun.-Prof. Dr. Christian Plessl 

Participants: 

Erik Bonner 

Wei Cao 

Denis Dridger 

Christoph Kleineweber 

Sandeep Korrapati 

André Koza 

Pavithra Rajendran 

Maryam Sanati 

Gavin Vaz 

WS 2011/12 

University of Paderborn

Contents 

1 An Introduction to Automatic Memory Partitioning 

Erik Bonner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

2 Error Detection Technique and its Optimization for Real-Time Embedded 

Systems 

Wei Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

3 CPU vs. GPU: Which One Will Come Out on Top? Why There is no 

Simple Answer 

Denis Dridger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

4 Will Dark Silicon Limit Multicore Scaling? 

Christoph Kleineweber . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

5 Guiding Computation Accelerators to Performance Optimization Dynamically 

Sandeep Korrapati . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

6 A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors 

André Koza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

7 Warp processing 

Maryam Sanati . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

8 Performance Modeling of Embedded Applications with Zero Architectural 

Knowledge 

Pavithra Rajendran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

9 Improving Application Launch Times 

Gavin Vaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

ii

An Introduction to Automatic Memory Partitioning 

Erik Bonner 

University of Paderborn 

berik@mail.uni-paderborn.de 

January 12, 2012 

Abstract 

This paper presents Automatic Memory Partitioning, a method for automatically 

increasing a program’s data parellism splitting by splitting its data structures into 

segments and assigning them to separate, simultaneously accessible memory banks. 

Unlike other data optimization methods, Automatic Memory Partitioning uses dynamic 

analysis methods to identify partitionable memory. After partitioning, the set 

of partitioned memory regions is assigned to a set of available memory banks by 

solving a budgeted graph colouring problem by means of Integer Linear Programming 

(ILP). After introducing Automatic Memory Partitioning, this paper offers a 

discussion on its merits and pitfalls. 

1 Introduction 

Field Programmable Gate-Arrays (FPGAs) and other embedded systems can organize 

memory into multiple memory banks, which can be accessed simultaneously. Since many 

applications are memory-bound, organizing memory into separate memory banks such 

that data parallelism is increased during execution can be a powerful means of improving 

program performance. Consider, for example, the code in Listing 1. If all memory were 

organized in a single memory bank, and the latency for a read were a single clock cycle, 

3 clock cycles would be necessary to access the memory required to compute the result 

sum. If, however, each of the arrays a, b and c were stored in seperate memory banks, 

the necessary result could be obtained in a single clock cycle. 

1 f o r ( i n t i = 0 ; i < ARRAY_SIZE ; i ++) 

2 sum [ i ] = a [ i ] + b [ i ] + c [ i ] ; 

Listing 1: An example of code that benefits from memory parallelization. Listing source: 

[3]. 

2

Erik Bonner 

Arrays of data structures are often linearly traversed and, in each iteration, several 

components are accessed in the same basic block. An example is given in Listing 2. Two 

for-loops traverse an array of structs of type point3d, accessing all three fields x, y 

and z in each iteration. When serviced by a single memory bank, extracting the contents 

of each point3d object will have a latency of 3 cycles. However, if the contents of each 

point object can be distributed across several different memory banks, each position 

extraction can be performed in a single clock cycle. 

1 void f i n d _ s t a r s h i p ( p o i n t ∗ s t a r s , i n t n , p o i n t 3 d ∗ ship , 

2 i n t m, i n t ∗ a v a i l ) 

3 { 

4 i n t sx =0 , sy =0 , sz =0 , b =0; 

5 

6 / / f i n d g a l a x y c e n t e r 

7 f o r ( i n t i =0; i


Figure 1: An example of the memory used in Listing 1 paritioned and distributed across 

3 memory banks. Figure inspired by a similar diagram in [3]. 

Memory Partitioning identifies seperately accessed memory regions and, by solving a 

budgeted graph-colouring algorithm using Integer Linear Programming, assigns (partitions) 

them to a mimial set of memory banks (see Section 3.2). The particular focus of 

this technique is the splitting of complex data structures into their constituent fields and 

assigning them to different memory banks, thus greatly accelerating code similar to the 

example given in Listing 2. 

After the introduction to the problem addressed by Automatic Memory Partitioning 

given in this section, Section 2 discusses current approaches to memory parallelization 

in the Literature. Section 3 then discusses the Automatic Partitioning Method in detail. 

The evaluation results reported by the technique authors, Asher and Rothem [3], are given 

in Section 4. Finally, critical discussion is provided in Section 5, before the paper is 

concluded in Section 6. 

2 Related Work 

A number of memory reshaping and partitioning techniques have been proposed in the 

Literature for improving application performance. The majority of these are based on 

static analysis of application source code, i.e analysis that can be performed at compile 

time. 

Zhao et al. [9] proposed Forma, which is an automatic data reshaping technique performed 

(transparently to the programmer) at compile time. The aim of Forma is to reshape 

arrays of structs in order to improve data locality and thereby optimize cache usage. For 

example, consider the code given in Listing 3. The for-loop on lines 3-6 traverses an array 

of point3d objects, accessing only the .x field in each iteration. If left unmodified, 

running the code in Listing 3 results in poor cache performance. Although only the .x 

field is used in each iteration, due to their proximity in memory to the .x field, the .y 

and .z fields of each structure element will also be fetched to cache, causing significant 

cache clutter. 

4

Erik Bonner 

1 / / compute a v e r a g e x c o o r d i n a t e 

2 i n t sumx = 0 , avg = 0 ; 

3 f o r ( i n t i = 0 ; i < NUM_STARS; i ++) 

4 { 

5 sumx += s t a r s [ i ] . x ; 

6 } 

7 avg = sumx /NUM_STARS; 

Listing 3: Code suitable for optimization with Forma. 

By combining statistics gathered from execution profiling, which identify the usage 

frequency and affinity of the structure fields, with static code analysis, data structure object 

fields in Listing 3 can be partitioned and the stars array reshaped to support the 

data locality present in the program execution. Figure 2 shows how the stars array 

could be reshaped to improve cache performance. Although Forma is primarily targeted 

at devices with traditional memory heirarchies, the data structure partitioning and array 

reshaping used can be adapted to target platforms with multiple memory banks. 

Figure 2: An example of reshaping an array of point3d objects using Forma. Using the 

restructured array, the traversal in Listing 3 would enjoy a significantly improved cache 

performance. 

Lattner and Adve et al. [7] proposed a technique called Automatic Pool Allocation, 

which, by means of static pointer analysis, improves performance of heap-based data 

structures (such as linked-lists or trees) by partitioning allocation of individual complex 

objects into different memory pools. For example, the nodes in a linked list can be automatically 

allocated in different memory pools. By controlling allocation of objects within 

pools, the compiler can ensure that memory can be structured in an aligned format, which 

greatly improves data locality. Figure 3 compares the memory structure of a linked-list 

allocated using traditional allocators (such as malloc()) with one whose nodes have 

5


been allocated using Automatic Pool Allocation. Since the linked-list nodes are not scattered 

throughout memory in the latter example, a traversal of the linked-list will benefit 

from improved cache performance. 

(a) (b) 

Figure 3: An example of Automatic Pool Allocation. The figure on the left (a) shows a set 

of nodes belonging to a linked list, allocated using traditional methods in main memory. 

The nodes are scattered randomly throughout memory. The figure on the right (b) shows 

the same nodes allocated using Automatic Pool Allocation. Using this method, new nodes 

are allocated in so-called “pools”, which are dedicated memory regions ensuring contiguous 

node allocation. 

Like Forma, Automatic Pool Allocation is a static technique primarily intended for 

use during compilation for traditional, heirarchy-based memory architectures. However, 

also like Forma, Automatic Pool Allocation can be readily adapted to architectures using 

multiple memory banks. 

Curial et al. [5] proposed a method called MPADS (Memory-Pooling-Assisted Data 

Splitting), which can be considered a combination of Forma and Automatic Pool Allocation. 

Using this method, individual objects of complex data structure types are split among 

memory pools. In this aspect, MPADS offers very similar functionality to the Automatic 

Memory Partitioning technique described in this paper. Unlike Automatic Memory Paritioning, 

however, MPADS accomplishes its memory splitting and allocation purely using 

static code analysis, which, they argue, has the advantage of avoiding the generation of 

large memory traces. On the other hand, MPADS is designed for use with commercial 

compilers, and therefore must be more minimalistic and pessimistic in its approach than 

other, research specific methods. For example, if there is a chance that a potential memory 

transformation could modify the semantics of the target program, the transformation 

is abandoned. 

The main contribution of Automatic Memory Paritioning, which is not addressed by 

the related work, is a combination of data structure partitioning and dynamic code analysis. 

This entails analysing the program according to its dynamic behaviour, rather than 

6

Erik Bonner 

analysing its code statically at compile time. The pros and cons of using this approach are 

discussed in Section 5. 

3 Proposed Technique 

Automatic Memory Partitioning is a technique for optimizing linear traversal of data 

structure arrays on embedded devices (primarily FPGAs) that organize memory in a 

set of simultaneously accessible memory banks. By automatically partitioning program 

data structures such that individual structure components are placed in different memory 

banks, linear traversals of data structure arrays are significantly accelerated (see the 

example in Section 1). 

Automatic Memory Partitioning consists of two main stages: identifying the set of disjoint 

memory access patterns within a program/kernel execution, and assigning memory 

regions to a minimal set of memory banks. These techniques are described in Sections 

3.1 and 3.2, respectively. Once memory has been redistributed into banks, all pointers 

accessing this memory must be updated. This process is described in Section 3.3. 

3.1 Linear Memory Pattern decomposition 

3.1.1 Linear Memory Patterns (LMPs) 

The first step in the proposed method is to decompose the overall memory signature of 

a program execution into a set (lp0, ..., lpk) of disjunct Linear Memory Patterns (LMPs), 

where: 

• Each load in the code is associated with an LMP lpi. 

• Each LMP lpi represents a set of sequentially spaced memory addresses of the form 

αx + β, where β is the offset of the first memory access, α the stride separating 

adjacent accesses and x an integer between 0 and some upper bound n. 

• Each memory operation in the program is mapped to exactly one LMP, which spans 

all memory addresses associated with that operation’s signature. 

3.1.2 Memory profiling 

Unlike the memory partitioning methods discussed in Section 2, the set of LMPs existing 

in a program’s memory signature is identified by means of dynamic program analysis. 

To obtain the memory trace of an execution, the program source code is instrumented 

such that a call to a custom function is inserted immediately prior to each memory operation 

opcode. When the instrumented binary is executed, the custom functions write the 

identifier and operand address(es) of each memory operator to a log on disk. After execution, 

the contents of the log make up a complete memory trace of the program execution. 

Figure 4 shows a portion of a sample memory trace log. 

7


Figure 4: An example memory trace log. Image source [3]. 

The example trace in Figure 4 logs four op codes (referred to as #7, #12, #17 and #22) 

consecutively operating on the fields of an array of adjacently allocated data structure 

objects. The address upon which each op code operates is given in the left-most table 

column, and the basic block to which they belong is specified in the right-most column. 

3.1.3 Data structure decomposition 

Once the memory trace of an execution has been generated, it is analysed to determine a 

set of LMPs that can correctly represent the program memory profile. For the analysis, an 

LMP is defined as a 4-tuple Rl, Rh, Op, S, where Rl and Rh define the upper and lower 

bounds on the memory range, repectivley; Op defines a set of memory operations that 

operate on addresses within this range; and S, which corresponds to α in Section 3.1.1, 

defines the stride between each potential access. 

Listing 4 shows an example code snippet for looping through an array of point3d 

objects and accessing the .x structure field. Since the array of structs is allocated as 

a contigious memory region of adjacent struct elements, each read of the .x field is 

seperated by a distance of sizeof(point) bytes. Furthermore, since the memory 

operation applied to this field is alternating between reading and writing, the LMP propery 

Op contains both read and write opcodes. Finally, the memory range defined by Rl and 

Rh spans 100*sizeof(point) bytes. The diagram in Figure 5 visualizes the LMP, 

denoted lp0, constructed from the code in Listing 4. 

1 p o i n t 3 d p a r r a y [ 1 0 0 ] ; 

2 f o r ( i n t i = 0 ; i < 100; i ++) 

3 { 

4 i f ( i%2 == 0) 

5 do_some_computation ( p a r r a y [ i ] . x ) ; 

6 e l s e 

8

Erik Bonner 

7 p a r r a y [ i ] . x = s o m e _ o t h e r _ c o m p u t a t i o n ( ) ; 

8 } 

Listing 4: Simple code for looping through an array of structs, alternating between reading 

and writing. 

Figure 5: A view of memory during the execution of the code in Listing 3. An LMP, lp0, 

can be constructed to represent the accesses to the field parray[i].x (marked in yellow). 

The LMP range, Rl and Rh; set of operations, Op; and stride, S, which is this case is 

equal to sizeof(point3d), are marked in the diagram. 

The pseudocode given in Figure 6 demonstrates how the set of LMPs for a given 

program execution can be extracted from its memory trace. The first loop, on Lines 1 

to 11, creates an LMP for each opcode in the set of all identified opcodes found in the 

memory trace. Note that this part of the algorithm can be performed online, while the 

memory trace is being generated. The second loop compares each identified LMP with 

all other identified LMPs to determine if any two can be merged. Two LMPs can be 

merged if they operate on common memory cells. This will be true for two candidate 

LMPs if both of the following conditions hold: 

1. There is an intersection of the candidate ranges. 

2. Both candidates have the same offset within their stride. 

When traversing an array of complex data types, the traversal stride represents the 

size of the complex data type object, and the offset within the stride indicates which 

field within the data structure is being accessed. For example, consider two functions: 

compute_x() and compute_y(). The function body of compute_x() is made of 

up of the code given in Listing 4, while the body of compute_y() is nearly identical to 

that of compute_x(), with the exception that it operates on the parray[i].y field. 

The LMPs extracted from the traces of these functions would have indentical strides and 

largely overlapping ranges. However, since they are accessing different elements of the 

point3d data structure, the offset within their strides differ. Therefore, the LMPs of the 

compute_x() and compute_y() functions will not be mergable. 

9


Two candidate LMPs are merged by setting the merged range to the minimum and 

maximum of their respective upper and lower range bounds, setting the LMP Op field to 

the union of the both candidate Ops and setting the merged stride to the greatest common 

devisor of the two candidate strides. After the second nested loop (Lines 12-23), the 

set of disjoint LMPs present in program execution has been identified and is ready for 

assignment to the available memory banks. 

Figure 6: The algorithm used for extracting a set of LMPs from a memory trace. Image 

source [3]. 

3.2 Memory bank allocation 

Once the set of LMPs present during execution have been identified, the memory referenced 

by each LMP must be assigned to memory banks in an optimal manner. In order 

to accomplish this, the set of LMPs must be assigned to a set of K memory banks with 

known capacities, such that: 

• Maximum memory parallelism can be achieved. 

• The capacity of each memory bank is sufficient to store all LMPs assigned to it. 

• A minimal number of banks are used. 

10

Erik Bonner 

The optimal assignment of LMPs to memory banks is attained by solving a modified 

graph colouring problem. The traditional graph colouring problem is formulated as follows. 

Given a graph G = (V, E), where V is a set of vertices and E is the set of edges 

connecting them, a mapping φ : V → C is sought such that ∀(u, v) ∈ G, c(u) �= c(v), 

where the function c() assigns a “colour” to each vertex. In other words, given a graph, 

the graph colouring problem involves assigning a set of colours (or generally, some value) 

to the graph vertices such that no adjacent vertices are assigned the same colour. For the 

assignment of LMPs to memory banks, the set of LMPs are the graph vertices and the 

set of memory banks are the assignable colours. Two vertices are connected by an edge 

if their LMPs cannot be assigned to the same memory bank. Furthermore, an additional 

constraint is added to the problem: each LMP, or vertex, has an associated size, and each 

bank, or colour, has a limited capacity. LMPs must be assigned to banks such that no 

bank has its capacity exceeded. This is known as a budgeted graph colouring problem. 

Figure 7 shows a simple example of the budgeted graph colouring problem, solved for a 

set of 5 nodes and 3 colours. 

Figure 7: An example of a solved budgeted graph colouring problem. Each node has an 

associated size value and each colour has an associated capacity. Nodes must be assigned 

to colours such that the total size of all nodes assigned to a given colour does not exceed 

that colour’s capacity. Figure redrawn from [3]. 

Generally, the problem of graph colouring is NP-complete [6]. A common problem 

for which graph colouring is used is the assignment of variables to registers in compilers 

[4]. Accordingly, a number of heuristic-based solution strategies have been proposed. 

In Automatic Memory Partitioning, the memory bank allocation problem is solved using 

Integer Linear Programming (ILP). Budgeted graph colouring is structured with an ILP 

problem as follows. For n LMPs and m memory banks, a set of mxn boolean variables are 

defined such that the variable xij is 1 if LMP i is assigned to memory bank j. Furthermore, 

for each memory bank, a boolean variable cj indicates whether that memory bank is 

currently being used. By minimizing (c0, ..., cm) subject to a number of constraints, an 

optimal bank allocation can be found. The minimization constraints are defined as: 

• Each LMP is assigned to exactly one memory bank: 

∀i( � m j=0 xij ≥ 1 and � m j=0 xij ≤ 1) 

11

• No memory bank is overfilled: 

∀j � n i=0 xij ∗ sizeof(LMPi) ≤ sizeof(bankj) 


• Confilicting LMPs cannot be assigned to the same bank: 

∀j(xvj + xwj) ≤ 1, where v and w are conflicting LMPs. 

The above ILP problem is solved using the freeware CVXOPT software package. 

3.3 Pointer synthesis 

Once memory has been correctly rearranged into a minimal set of memory banks, all 

pointers in the target program accessing this memory must be reassigned accordingly. 

Consider the memory bank depicted in Figure 8, which contains three LMPs. In original 

memory, each LMP has an associated starting address (Rl), size (Rh − Rl) and stride 

(S). When assigned to a memory bank, these LMP properties must be updated such that 

memory is correctly addressed within the assigned bank. 

Figure 8: A single memory bank with three LMPs (lpi, lpj and lpk) assigned to it. Each 

LMP has an associated size and offset within the bank. 

For each pointer Pold that accesses the LMP in original memory, the following steps 

are taken to determine its new value Pnew within the assigned memory bank. First, the 

start address Rl is subtracted from Pold. Then, since the memory accessed by each LMP 

will be packed into the assigned memory bank linearly, the final LMP stride must be 

adjusted. This is accomplished by scaling each old pointer value by a factor ˆs, where ˆs 

is a multiple its LMP stride. Finally, the starting address ˆ b of the LMP within its newly 

assigned memory bank must be added. The complete pointer mapping is given by: 

� 

Pold − Rl 

Pnew = 

ˆs 

+ ˆ � � 

Pold Rl 

b = − 

ˆs ˆs + ˆ � � � 

Pold 

b = ± C 

ˆs 

where C is a constant for each LMP. The most expensive part of this mapping is the 

operation Pold . However, when ˆs is a power of two, this can be implemented using bit- 

ˆs 

shifting, which is a cheap operation on FPGAs. 

4 Reported Results 

Automatic Memory Partitioning performance was evaluated by synthesising a collection 

of memory-intensive programs from the NVIDIA CUDA SDK [8], CLAPACK SDK [1] 

and SystemRacer test suite [2]. The samples were synthesized with single, as well as 

12

Erik Bonner 

(a) (b) 

Figure 9: Evaluation. The table on the left (a) lists the name of each test program (left 

column), the number of cycles per iteration when using a single memory bank (centerleft 

column) and multiple memory banks (center-right column), as well as the number of 

memory banks used for Automatic Memory Partitioning (right column). These results are 

visualized in the graph on the right (b). Images from [3]. 

. 

multiple memory banks, and the resulting performances were compared. All programs 

were synthesized to Verilog using the SystemRacer synthesis engine. Each memory bank 

was synthesized with a single memory port and each memory port had a latency of 3 

cycles. A comparison of the performance measured for the test programs synthesized 

with a single vs. multiple memory banks is given in Figure 9. 

In most cases, it was possible to synthesize the target code using more than one memory 

bank. In all such cases, performance improvements were recorded when running the 

multiple bank versions. As can be expected, the more banks used, the greater the memory 

parallelism, and hence the greater the performance gains. 

5 Discussion 

This section provides additional discussion and remarks regarding the Automatic Memory 

Partitioning method. 

In the original paper by Asher et al. [3] it is claimed that, unlike previously existing 

methods, Automatic Memory Partitioning performs memory optimization by means of 

dynamic analysis. Although this is true, there are some significant limitations. A target 

application’s memory is partitioned based on an analysis of its memory trace, generated 

during a profiling run. For the method to work, it is necessary that memory addresses 

and usage are indentical between runs. For many applications, particularly those whose 

control flow is data-dependent, this means that the memory partitioning will only work 

on the exact input for which the memory trace was generated. Furthermore, to ensure 

that memory will be located in the same place between runs, the method relies on the use 

13


of custom memory allocators, rather than traditional functions such as malloc() that 

intentionally randomize memory allocation locations for security reasons. Since such 

allocators allocate memory in a predefined, predictable manner that is persistent between 

runs, a program using these allocators can also be correctly analysed using static analysis. 

This weakens the claim that, by using dynamic analysis techniques, Automatic Memory 

Partitioning achieves results that are not obtainable using static methods. 

Another point of discussion is the reported results. As discussed in Section 4, results 

were gathered by synthesizing a collection of sample programs with a single memory 

bank, and comparing performance with the same programs synthesized with multiple 

memory banks. Clearly, the programs synthesized with multiple memory banks outperformed 

those with single memory banks. This is more a proof that the method works, 

rather than that it works well. Far more interesting would have been a comparison between 

sample programs optimized with Automatic Memory Partiotioning with those optimized 

using other methods in the Literature, such as MPADS. Furthermore, a number 

of the samples that were used from the CUDA SDK are already hand-optimized to use 

multiple (shared) memory banks. Synthesizing these to use a single memory bank would 

involve significant modifications to the original source code, with the explicit goal of reducing 

performance. When synthesized for use with multiple memory banks, did they 

use the modified, single-bank code, or the original SDK sample, written with a multiple 

memory bank archtecture in mind? In the paper, this is not clear. 

In addition to the performance of the synthesized application, the performance of the 

Automatic Memory Partioning procedure itself is also of interest. Discussion of this is 

largely left out of the original paper. Both the major phases of Automatic Memory Paritioning 

- memory partitioning and assignment of memory regions to available memory 

banks - can potentially be slow under the right circumstances. The partitioning of data 

structures relies on the use of execution traces, which could potentially become very large, 

particulalry for applications that process large amounts of data and contain frequent datadependent 

branching. The authors of the MPADS method (described in Section 2) explicitly 

state the importance of avoiding execution traces when performance is a concern [5]. 

Furthermore, when the number of identified LMPs becomes large, the task of assigning 

memory banks becomes increasingly complex. In Automatic Memory Partitioning, this 

task is formulated as an ILP problem and solved using a heuristic solver. They reported 

speeds of under a second for a set of 10 LMPs. It would be interesting to see performance 

for larger LMP sets. Furthermore, it would be interesting to know how many LMPs can 

be expected when synthesising larger programs. 

One advantage of using memory traces as the sole basis for memory analysis is the relative 

simplicity of the method. Static techniques often need to employ complex, language 

dependent pointer analysis, with additional measures for type-unsafe languages such as 

C and C++. By analysing the memory trace, rather than the code itself, these complex 

methods can be avoided. Moreover, using memory traces allows the memory analysis 

method to be more language independent; Automatic Memory Partitioning can easily be 

used for any language that can be instrumented to generate suitable memory trace logs. 

On the other hand, the generation and analysis of memory traces can be a cumbersome 

14

Erik Bonner 

process, since they can become very large. 

6 Conclusion 

This paper introduced a technique for automatically partitioning data structures across 

multiple memory banks on embedded devices such as FPGAs, which enhances application 

performance by increasing memory parallelism. 

After using a number of simple examples in Section 1 to illustrate the advantages 

of memory partitioning on architectures with simultaneously accessible memory banks, a 

number of relevant data partitioning methods currently existing in the Literature were discussed 

in Section 2. Although a number of the existing methods show promising results, 

all rely on static code analysis to identify memory partitioning opportunities. Following 

the literature review, Section 3 moved on to introduce a memory optimization technique 

that uses dynamic analysis: Automatic Memory Partitioning. Automatic Memory Partioning 

indentifies a target program’s memory access patterns by analysing its memory 

trace. Once a set of non-interfering memory access patterns have been identified, they 

are assigned to a set of memory banks, taking care to minimize the number of banks used 

while maximimizing data parallelism. The results reported by the authors of the technique 

were given in Section 4. Finally, Section 5 offered a critical disussion of the Automatic 

Memory Partitioning techique, evaluating its strengths and weaknesses. 

References 

[1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, 

A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ 

Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 

1999. 

[2] Y. Ben-Asher and N. Rotem. Synthesis for variable pipelined function units. In 

System-on-Chip, 2008. SOC 2008. International Symposium on, pages 1–4, nov. 

2008. 

[3] Yosi Ben-Asher and Nadav Rotem. Automatic memory partitioning: increasing 

memory parallelism via data structure partitioning. In Proceedings of the eighth 

IEEE/ACM/IFIP international conference on Hardware/software codesign and system 

synthesis, CODES/ISSS ’10, pages 155–162, New York, NY, USA, 2010. ACM. 

[4] G. J. Chaitin. Register allocation & spilling via graph coloring. SIGPLAN Not., 

17:98–101, June 1982. 

[5] Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul 

Silvera, and Roch Archambault. Mpads: memory-pooling-assisted data splitting. In 

15


Proceedings of the 7th international symposium on Memory management, ISMM ’08, 

pages 101–110, New York, NY, USA, 2008. ACM. 

[6] M. R. Garey and D. S. Johnson. The complexity of near-optimal graph coloring. J. 

ACM, 23:43–49, January 1976. 

[7] Chris Lattner and Vikram Adve. Automatic Pool Allocation: Improving Performance 

by Controlling Data Structure Layout in the Heap. In Proceedings of the 2005 

ACM SIGPLAN Conference on Programming Language Design and Implementation 

(PLDI’05), Chigago, Illinois, June 2005. 

[8] NVIDIA. Nvidia cuda sdk, 2011. 

[9] Peng Zhao, Shimin Cui, Yaoqing Gao, Raúl Silvera, and José Nelson Amaral. Forma: 

A framework for safe automatic array reshaping. ACM Trans. Program. Lang. Syst., 

30, November 2007. 

16

Error Detection Technique and its Optimization for 

Real-Time Embedded Systems 

Wei Cao 


wcao@mail.upb.de 

January, 12 2012 

Abstract 

This paper discusses error detection techniques and the optimization of error detection 

implementation(EDI) in the context of different FPGAs, including FPGA 

with static configuration and FPGA with partial dynamic reconfiguration(PDR). In 

the error detection techniques, path tracking and variable checking are the main 

sources of performance overhead. According to their different implementation ways, 

there are three basic error detection implementations: software-only(SW-only) approach, 

in which both path tracking and variable checking are implemented in software; 

mixed software/hardware(mixed SW/HW) approach, in which path tracking 

leading to significant time overhead is moved into hardware and variable checking 

remains in software; hardware-only(HW-only) approach, in which both of them are 

performed in hardware. This paper introduces error detection approaches based on 

these basic error detection implementations and discusses them in detail. Further 

more, considering the fact that an application normally consists of a number of processes, 

error detection can be optimized through applying it to every process, i.e. 

to achieve the efficient implementation of error detection through the refinement of 

error detection. Therefore, two optimization algorithms are presented in this paper 

as well. One optimization algorithm focuses on the case of FPGA supporting only 

static configuration and the other one, on the case of FPGA supporting PDR. The 

improvement after the optimization will be shown through experimental results. 


Errors are always unavoidable in any system. If errors are not detected in time, they can 

cause result deviations, even program crashes. In the context of errors, to detect errors is 

the only possibility to guarantee the effectiveness of an application’s execution. Therefore, 

error detection is indispensable for any system, especially for real-time systems, in 

which errors should be not only effectively detected, but also efficiently detected. To 

18

Wei Cao 

achieve this goal, many error detection techniques have been developed. Each error detection 

technique either causes a certain number of time overheads, or requires a certain 

number of hardware resources, for some techniques even both. In real-time systems, each 

application has a deadline. Because of the existence of deadline, for error detection in 

real-time systems, time overhead is a more important factor than hardware cost. Consequently, 

the time for error detection should be minimized in order to satisfy the deadline 

of the application. There are various ways to optimize the time for error detection. To determine 

an appropriate error detection implementation for each process of the application 

in an intelligent manner is regarded as a considerable way. 

The main focus of this paper is the systematic discussion about error detection technique, 

including corresponding approaches and the explanation of an approach to the optimization 

of error detection implementation. 

2 Error Detection Technique 

Although the traditional approach of error detection, so called “one-size-fits-all” approach, 

is capable of providing a certain error coverage, this error coverage sometimes 

can be rather low and can not meet the expected requirements, i.e. the traditional approach 

is not able to supply sufficient reliability. Since every application has its own 

characteristics, the reliability provided by error detection can be dramatically improved 

if EDI for a specific application can be adjusted according to these characteristics. To 

take full advantage of characteristics of each application, application-aware technique has 

been developed. 

2.1 Working Principle 

The purpose of application-aware technique is to improve the reliability of an application 

with the help of its characteristics, as stated above. Then, next question occurs: How to 

implement error detection in application-aware technique? The answer is as follows: 

1. The first step is to identify critical variables in a program. A critical variable is 

defined as "a program variable that exhibits high sensitivity to random data errors 

in the application"[6]. 

2. Once critical variables have been identified, backward program slice, defined as "the 

set of all program statements/instructions that can affect the value of the variable at 

a program location"[8], can be extracted as the second step. 

3. After the extraction of backward program slice, checking expressions are generated 

during the optimization of each slice at compile time. These expressions are 

then inserted into the original code and will be chosen by checking instructions to 

compare the results. 

19

Error Detection Technique and its Optimization for Real-Time Embedded Systems 

Thus, along with the execution of the original code, instructions for tracking control paths 

and the checking expressions are utilized to implement error detection. The above three 

steps are the brief introduction for the principle of the application-aware technique. More 

details will be explained in a subsequent section(see Section 2.3.1). 

2.2 Error Detection Implementations 

In this paper, only transient faults are considered. Therefore, path tracking and variable 

checking can be implemented either in software, potentially resulting in high overheads, 

or in hardware, possibly exceeding the amount of available hardware resources. Based on 

different implementation combinations of path tracking and variable checking, there are 

three types of error detection implementations: 

• SW-only: In the SW-only implementation, both path tracking and variable checking 

are implemented in software. Compared with variable checking, path tracking 

causes significant time overheads while implemented in software. Hence, the time 

cost overheads of SW-only implementation is numerous and the maximum among 

all the error detection implementations as well. Also because all error detection are 

implemented in software, almost no hardware resource is needed. 

• HW-only: In the HW-only implementation, both path tracking and variable checking 

are performed in hardware. Thus, the time overheads decrease observably. But 

the disadvantage brought by hardware implementation is rather obvious as well: 

A huge amount of hardware is required, sometimes even beyond the amount of 

available hardware resources. 

• Mixed SW/HW: Since path tracking causes significant time overhead, moving it 

into hardware becomes a natural way to reduce the general overhead drastically. 

After this movement, path tracking is then performed in parallel with the execution 

of the application and as a result, plenty of time cost can be saved. Checking expressions 

for critical variables remain in software, so the requirement for hardware 

in the mixed SW/HW implementation is not as much as HW-only implementation. 

To some degree, mixed SW/HW implementation can be regarded as a composition 

absorbing the advantages of SW-only implementation and HW-only implementation. 

Just because of the existence of these basic error detection implementations, error detection 

approaches(see Section 2.3) and the optimization of error detection implementation(see 

Section 3) can be realized. 

2.3 Error Detection Approaches 

In this section, two extreme error detection approaches are discussed: complete SW-only 

approach and complete HW-only approach. In both complete approaches, all error detection 

are implemented in software or performed in hardware. Given that the principle of 

20

Wei Cao 

the path tracking in mixed SW/HW approach is similar with the one in complete HW-only 

approach and likewise, the principle of the variable checking in mixed SW/HW approach 

is similar with the one in complete SW-only approach, the mixed SW/HW approach is 

not discussed here considering principle convergence. 

2.3.1 Complete SW-Only Approach 

An approach to derive error detectors using static analysis[1] of an application is presented 

in [6]. Detector is defined as "the set of all checking expressions for a critical variable, 

one for each acyclic, intraprocedural control path in the program"[6]. The main steps of 

deriving error detectors are described as follows: 

1. Identify critical variables in the program. Critical variables are program variables 

with the highest fan-outs (defined as the number of forward dependencies). These 

variables are of prime importance, as their errors can propagate to many locations in 

the program and result in program failure. If these variables can be protected, a bigger 

error coverage can be achieved. The approach for identifying critical variables 

can be found in [5]. 

2. Compute the backward program slice of critical variables. Started with the instruction 

that computes the value of critical variables, the static dependence graph of 

the program is traversed backwards to the beginning of the function. The backward 

program slice is specialized for each acyclic control path and it consists of the 

instructions that can legally modify the critical variables. 

3. Generate checking expressions through the optimization of the backward slice of 

the critical variables. These checking expressions are inserted into the program 

immediately after the computation of the critical variable. In order to choose the 

corresponding checking expressions for each control path, program is instrumented 

with tracking instructions to track control paths. 

4. Check at runtime. At runtime, the corresponding checks are performed at appropriate 

points, while each control path is tracked. When checks are executed, they 

recompute the value of critical variable and then compare this value with the value 

computed by the original program. If these values do not match, the original program 

stops and initiates the recovery. 

2.3.2 Complete HW-Only Approach 

The technique mentioned in Section 2.3.1 is called Critical Variable Recomputation(CVR) 

technique. Compared with the complete software implementation of CVR in Section 

2.3.1, the approach to be explained in this section is the hardware implementation of the 

CVR technique. The approach is introduced in [4]. The core part of this approach is the 

Static Detector Module(SDM), which consists of a path tracking submodule, a checking 

submodule and if necessary, an argument buffer called ARGQ, as shown in Figure 1. 

21

1) and convert it into • leaveFunc: This is invoked whenever program execu- 

view of the main protion returns from a function. The state machines are 

s analogous to a no- restored to their previous states, which are popped off 

truction has a unique Error Detection of the Technique StateStack. and its Optimization for Real-Time Embedded Systems 

RSE module and op- 

Checking. The Checking submodule performs recomodule 

ARGQ can buffer putation dataoperations supplied by in parallel an SDM-protected with the program application execution. in order to support 

recomputation. The path tracking tracks the control path and indicates which instruction 

) is the hardware imescribed 

in Section 3. 

ts of two submodules: 

nd (2) the Checking 

o both submodules is 

. If necessary, an arfers 

data supplied by 

r to support recompunce 

all values necese 

ARGQ is accessed 

allows the SDM to 

quiring further infor- 

Leon3 

check 

emitEdge, 

enterFunc, 

leaveFunc 

args 

Static Detector Module 

Checking 

Submodule 

path 

Path Tracking 

Submodule 

StateStack 

ARGQ 

Figure 2. Figure Static 1: Detector Static Detector Module Module[4] block diagram 

is being executed in order to supply the information that which operations should be 

recomputed subsequently. This submodule consists of hardware state machines and a 

stack structure, StateStack. Each state machine corresponds to a particular check and is 

constantly updated during program execution. For each state machine, a corresponding 

stack is set up in the StateStack. Therefore, the StateStack is the set of individual stacks. 

The benefit of such a structure in the StateStack is that the overhead for accessing the 

stack is minimized, because each stack can be accessed with other stacks parallel. Three 

types of CHK instructions which are viewed as analogous to a no-operation instruction, 

are recognized by the path tracking submodule: 

• emitEdge(src,dest): This instruction is needed in the case of branches during the 

program execution. Both of its arguments, src and dest, are inserted into the buffer 

ARGQ and according to these arguments, the state machines for path tracking are 

updated. 

• enterFunc: This instruction is involved when program enters a function. In this 

case, the current states of state machines are pushed into the StateStack. 

• leaveFunc: Corresponding to the instruction enterFunc, leaveFunc is involved when 

program leaves a function. In this case, the states stored in the StateStack pop out of 

the StateStack. The state machines are, therefore, recovered to the previous states. 

Checking submodule is responsible for recomputing in parallel with program execution 

and finding out when to recompute. Different from path tracking submodule, only one 

type of CHK instruction is recognized by checking submodule: 

22

if(path==1) 

Wei Cao 

x′ = w; x′ = s-2*t; 

then 

then else 

if(x′==x) 

else 

flag error and 

recover! 

agment with detectors 

ated via the instrumentation code 

puted by the original program) is 

omputed by the checking expresrror 

flag is raised and a recovery 

ain sources of performance overecking. 

In the context of transient 

mented either in software, potenoverheads, 

or in hardware, which 

ing the amount of resources. 

sed a software-only, straightfore 

path tracking and the variable 

ware and executed together with 

ath tracking alone incurs a time 

e overhead due to variable checkardware 

implementations of path 

are proposed in [15] and [14]. In 

s of implementing all error detecnd 

performing it in hardware, on 

e of possible alternatives characentation 

decision taken for each 

cision depends on various factors 

• check(num): This instruction is involved when a check needs to be done. The 

argument num indicates the ID of the check to be performed. As shown in Figure 1, 

the checking submodule receives the output of the path tracking submodule. Then 

with the help of this output, the checking submodule executes the appropriate check. 

3 Optimization of Error Detection Implementation 

In the above section, error detection approaches are elaborated. They provide error detection 

for applications with some time cost under the limitation of hardware resources. But 

for real-time systems, there is an extra requirement for time: The execution of an application 

along with error detection must be finished before its deadline. In consideration of 

this point, error detection has to be accelerated, i.e. error detection has to be optimized 

in order to reduce the entire execution time. Is there any possibility to accelerate error 

detection? How can efficient implementation of error detection be achieved? The general 

idea for questions mentioned above is to determine an appropriate error detection implementation 

for each process in the application according to various factors. In this section, 

all relevant concerns about optimization will be explained. First of all, the general framework 

for optimization will be illustrated. System model will be explained next. At last, 

two optimization algorithms will be given to show how error detection implementations 

can be optimized. 

3.1 Optimization Framework 

C code 

Error detection 

instrumentation and 

overheads estimation 

Process 

graphs 

Overheads 

WCSL 

Mapping 

HW Architecture 

Optimization of 

error detection 

implementation 

Fault-tolerant 

schedule synthesis 

(cost function) 

ery overheads for each process, the architecture on which this application 

is mapped and the maximum number of faults that could 

Figure 2: Framework Overview[3] 

affect the system during one period. As an output it produces schedule 

tables that capture the alternative execution scenarios corresponding 

to possible fault occurrences. 

Among all fault scenarios there exists one which corresponds to 

the worst-case in terms of schedule length. In the rest of the paper, 

we are interested in this worst-case schedule length (WCSL), which 

has to satisfy the imposed application deadline. 

In this context, our fault model assumes that a maximum number 

k of transient faults can affect the system during one period. To 

provide resiliency against these faults re-execution is used. Once a 

fault is detected by the error detection technique, the initial state of 

the process is restored and the process is re-executed. 

The above mentioned scheduling technique considers error detection 

as a black box. In this paper, we will try to minimize the WCSL 

of the application, by accelerating error detection in reconfigurable 

hardware in an intelligent manner, so that we meet the time and cost 

constraints imposed to our system. 23 

Figure 2 shows an overview of the general framework. The component emphasized 

in bold is the optimization framework represented in this section. The function of each 

component, including the optimization framework, is explained in the below. The goal 

is to minimize the worst-case schedule length(WCSL) of the application under hardware 

constraints. 

• C code: represents the initial application. 

2.3 Optimization Framework 

In Figure 2 we present an overview of our framework. The initial 

applications, available as C code, are represented as a set of process 

graphs. The code is processed through the error detection instru


• Process graphs: can be obtained from the initial application and specifies the privilege 

relationship among all processes. 

• Error detection instrumentation framework: processes the initial application code 

by embedding error detectors into code and estimates the time overheads and hardware 

costs using the instrumented code. 

• Optimization framework: takes process graphs, overheads computed by error detection 

instrumentation framework, the mapping of processes to computation nodes 

and the system hardware architecture as its input. As the output, optimization 

framework produces an error detection implementation which is closer to optimal 

one. 

• Fault-tolerant schedule synthesis tool: generates the worst-case schedule length(WCSL) 

as cost function according to the optimization result. More details about this tool 

will be explained in Section 3.2. 

3.2 Synthesis of Fault-Tolerant Schedules 

In [2] an approach to the generation of fault-tolerant schedules is proposed. The input 

of the algorithm consists of a corresponding process graph obtained from the application, 

the worst-case execution time (WCET) of processes, the worst-case transmission time 

(WCTT) of messages, the error detection and recovery overheads for each process, the 

architecture on which this application is mapped and the maximum number of faults that 

could affect the system during one period. The corresponding output of the algorithm is 

schedule tables taking possible execution scenarios with possible fault occurrences into 

account. In some certain fault scenario, the schedule length can be the worst compared 

with other scenarios. The schedule length in this scenario is called the worst-case schedule 

length(WCSL), which must meet the deadline of the application. 

3.3 System Model 

Taking from [3], "a set of real-time applications Ai is considered, modeled as acyclic directed 

graphs Gi(Vi, Ei) and executed with period Ti. The graphs Gi are merged into a 

single graph G(V, E), having the period T equal with the least common multiple of all Ti. 

This graph corresponds to a virtual application A. Each vertex Pj ∈ V represents a process, 

and each edge ejk ∈ E, from Pj to Pk, indicates that the output of Pj is an input for 

Pk. Processes are non-preemptable and all data dependencies have to be satisfied before 

a process can start executing. A global deadline D is considered , representing the time 

interval during which the application A has to finish." 

Figure 3 gives an intuitional understanding of the system model. P1 to P4 are four processes 

in an application and m1 to m2 are two messages sent from a process to another. 

Figure 3c and e show the distributed architecture, on which the application runs. It is 

composed of a set of computation nodes, connected to a bus. In Figure 3a, b and d, the 

24

Wei Cao 

processes are mapped to these nodes and the mapping is illustrated with shading. Each 

node consists of a central processing unit, a communication controller, a memory subsystem, 

and also includes a reconfigurable Error device Detection (FPGA). Implementation For all the messages sent over 

the bus (between Proc. processes WCET mapped SW-only on different Mixed HW/SW computationHW-only nodes), their worst-case 

transmission time (WCTT) is given. Such a transmission is modeled as a communication 

process inserted on the edge connecting the sender and the receiver process. 

Here three error detection implementations(see Section 2.2) are considered for each process 

in the application. Every error detection implementation is possible to be selected 

and applied to any process. 

P1 a) 

P1 b) 

P3 m2 P3 U 

Table 1. WCET and overheads 

N1 P1 

a) N2 P3 

WCETi hi ρi WCETi hi ρi WCETi hi ρi 

bus 

P1 60 240 0 0 100 15 20 80 40 45 

N1 P1 P2 

P2 50 140 0 0 80 15 20 60 40 45 

P3 40 150 0 0 60 10 15 50 30 35 b) N2 P3 

P4 30 100 0 0 60 15 20 40 40 45 bus 

P1 d) 

P3 P1 P2 

P2 m1 P4 P2 m1 P 4 

FPGA1 FPGA2 

FPGA1 

bus 

c) N1 N2 

N1 

Bus 

e) 

Bus 

Figure 3. System model 

to be fault-tolerant Figure (i.e. we 3: use System a communication Model[3] protocol such as 

N1 

d) N2 

P1 

P3 

rec 

P1 

TTP [12]). Each node is composed of a central processing unit, a 

communication controller, a memory subsystem, and also includes a 

bus 

3.4 EDI Optimization 

reconfigurable device (FPGA). Knowing that SRAM-based FPGAs 

N1 P1 

are susceptible to single event upsets [23], we assume that suitable 

Based on different mitigation characteristics techniques between are employed FPGA(e.g. with [13]) static in order reconfiguration to provide and FPGA e) N2 P3 

P1 P2 

with PDR capabilities, sufficient reliability two alternative of the hardware optimization used for solutions error detection. based on a Tabu Search bus 

heuristic[7] will For beeach proposed process respectively. we consider three Before alternative the description implementations of optimization of algorithms 

in Section error 3.4.4 detection and (EDIs): SectionSW-only, 3.4.5, some mixed concepts HW/SW inside and HW-only. the algorithms For are intro- Fig 

duced first in the theSW-only Section 3.4.1, alternative, Sectionthe 3.4.2 checking and Section code (illustrated 3.4.3. with light detection. For exam 

shading in Figure 1) and the path tracking instrumentation (illus- SW-only EDI for pr 

trated with dark shading in Figure 1) are implemented in software costs (hi) and the rec 

3.4.1 Moves 

and interleaved with the actual code of the application. Since the tive EDI, for each 

Two types oftime moves overhead will beof mentioned path tracking in is thesignificant, algorithms a natural : simple refinement moves and of swaps. WCTT A of messages 

simple movethe applied technique to a is process to place is the defined path tracking as the transition instrumentation from one in hard- error detection for all processes (s 

implementation ware, to and, any of thus, thedrastically adjacent reduce ones from its overhead. the ordered This set second H = {SW-only, alterna- mixed- tolerate a number o 

HW/SW, HW-only}, tive represents while the a swap mixed consists HW/SW of solution, two “opposite” in which simple the path moves, track- concerning represent process ex 

two processes ing mapped is moved onto the hardware same computation and done concurrently node. In the with case the of execution swap, because with of dark shading 

the hardwareof limitation the application, on eachwhile computation the checking node, expressions the EDI of remain a process in soft- performed 

checkerboard 

in 

pattern 

the hardwareware, has tointerleaved be movedwith more the into initial software code. In before order to thefurther EDI of reduce another the process 

In 

is 

Figure 4a we p 

time overhead, the execution of the checking expressions can also SW-only solution. T 

implemented more into hardware. 

be moved to hardware (referred as the HW-only implementation). software for all proc 

We assume that for each process its worst-case execution time ules as described in 

3.4.2 Selection 

(WCET 

of the 

i) [22] 

Best 

is known, 

Move 

for each of the three possible implementa- schedule length (W 

In the algorithms, tions of theerror operation detection "select (SW-only, the best mixed move" HW/SW also needs and HW-only). to be performed 

scenario 

and 

in which P 

"the best move" Also, should the corresponding be selected HW fromcost/area all the possible (hi) and the moves. reconfiguration But considering 

additional 

the 

hardware 

time (ρi) needed to implement error detection are known. 

(i.e. unlimited rec 

For all the messages sent over the bus (between processes mapped WCSL is obtained b 

on different computation nodes), their worst-case transmission time time units. This mea 

(WCTT) is given. Such a transmission is modeled as a communica- error detection for al 

tion process inserted on the edge connecting the sender and the FPGA2 should be at 

25 

receiver process (Figure 3a and b). For processes mapped on the are the two extreme 

same node, the communication time is considered to be part of the no additional hardw 

process’ WCET and is not modeled explicitly. 

mal hardware cost. 

mal WCSL, subject 

P 2 

P 4 

c) 

N1 

N2 

P3


efficiency of the algorithms, only moves which can affect the processes on the critical 

path of the worst-case schedule for the current solution, have been explored. When the 

best move needs to be selected, processes on the critical path of the current solution are 

first identified, then the search for the best move can be started according to the following 

criterion: 

1. For the moves, the simple ones into HW are first explored; if not possible, try the 

swap ones. 

2. If there exists moves which is not tabu, irrespective of simple ones or swap ones, 

select the move generating the best movement and stop exploring the other moves. 

3. If the WCSL gets closer to a minimum with the help of this move, then this move 

will be accepted. If no such simple or swap move exists, then the search has to be 

diversified. 

3.4.3 Diversification Strategy 

The diversification strategy in the algorithms consists of continuous diversification strategy 

and a restart strategy. The former uses an intermediate-term frequency memory to 

guarantee that if a process has not been involved in a move for a long time, it will be 

selected to be involved. Complemented to continuous diversification strategy, the restart 

strategy will restart the search process if there’s no improvement of the best known solution 

for a certain number of iterations. 

3.4.4 EDI with static configuration 

Figure 4 shows the pseudocode of the optimization algorithm of EDI with static configuration. 

The EDI assignment optimization algorithm for FPGA with static reconfiguration 

begins with a random initial solution. This solution is considered as the current best solution. 

Next, the WCSL needs to be calculated for evaluation and the Tabu list is initialized 

as empty. After recording the WCSL, as the next step the algorithm will select the best 

move among possible moves based on the current situation. Later this best move is put 

into the tabu list and applied to the current solution. At this time, the current solution gets 

updated and once it has been updated, the new WCSL needs to be recalculated. According 

to the comparison of WCSL, the update of the current best solution is determined. Now 

using the diversification strategy, if no improvement occurs for a certain number of iterations, 

the search process is restarted. Generally, if the number of iterations has reached 

the maximal number allowed for iterations, the algorithm returns the current best solution 

and stops. 

3.4.5 EDI with PDR FPGAs 

Since FPGA now supports the capability of partial dynamic reconfiguration, it’s possible 

to overlap the execution of a process with reconfiguration of another EDI. Thus, the 

26

tain this by assigning the mixed HW/SW 

et us now consider an FPGA of only 25 

R capabilities. In this case (Figure 5c), we 

ixed HW/SW implementations for proc- 

PGA. Then, as soon as P1 finishes, we can 

orresponding to its detector Weimodule Cao and 

e mixed HW/SW EDI for P2. This reconrallel 

with the execution of P3, so all the 

can be masked. As a consequence, P2 can 

3 finishes. Unfortunately, for P4 we cannot 

ith P2’s execution, since we only have 10 

we are forced to wait until P2 ends, then 

ith P4’s mixed HW/SW detector module 

ule P4. Note that, even if the reconfiguraot 

be masked, we still prefer this solution 

e SW-only alternative of P4, because we 

CETSW-only - (ρmixed HW/SW + WCETmixed e 5c (WCSL = 430 time units) with Figure 

nits), we see that, by exploiting PDR capaetter 

performance than using static FPGAs 

puts an assignment S of EDIs to processes, so that the WCSL is 

minimized and the HW cost constraints are met. 

The exploration of the solution space starts from a random initial 

solution (line 1). In the following, based on a neighborhood search, 

successive moves are performed with the goal to come as close as 

possible to the solution with the shortest WCSL. The transition from 

one solution to another is the result of the selection (line 5) and 

EDI_Optimization(G, N, M, W, C, k) 

1 best_Sol = current_Sol = Random_Initial_Solution(); 

2 best_WCSL = current_WCSL = WCSL(current_Sol); 

3 Tabu = Ø; 

4 while (iteration_count < max_iterations) { 

5 best_Move = Select_Best_Move(current_Sol, current_WCSL); 

6 Tabu = Tabu U {best_Move}; 

7 current_Sol = Apply(best_Move, current_Sol); 

8 current_WCSL = WCSL(current_Sol); Update(best_Sol); 

9 if (no_improvement_count > diversification_count) 

10 Restart_Diversification(); 

11 } 

12 return best_Sol; 

end EDI_Optimization 

Figure 6. EDI optimization algorithm 

Figure 4: Optimization Algorithm of EDI with Static Configuration[3] 

WCSL of the application 44 can be further improved. But because of the limitation of hardware 

resource, this is not always possible. In this case, the reconfiguration of an error 

detector module of a process has to wait until the execution of another process is done. 

In the optimization algorithm for EDI with PDR FPGAs, scheduling of processes on the 

processor and placement of the corresponding EDIs on the FPGA are simultaneously performed. 

To implement this, the fault-tolerant schedule synthesis tool discussed in Section 

3.2 is not feasible, because the particular issues related to PDR have not been taken into 

account by the priority function of this tool, which decides the order of process execution. 

So under the PDR assumptions, the priority function has to be modified. The new priority 

function for the optimization algorithm now is described as: 

f(EST, W CET, area, P CP ) = x × EST + y × W CET + z × area + w × P CP 

In this priority function, the parameter EST(earliest execution start time of a process) 

gives information about the placement and reconfiguration of EDI modules on the FPGA, 

WCET and EDI area characterize the EDI of each process and PCP captures the particular 

characteristics of each application. Meanwhile, the value of each coefficient(x,y,z,w) in 

the priority function is defined between -1 and 1, with the value step of 0.25. Because of 

the existence of these coefficients, a new type of moves concerning the weights x,y,z and 

w can be added. Thus, under the assumptions of PDR FPGAs, the optimization algorithm 

for FPGA with static configuration can be extended as follows: In each iteration, different 

values for the weights are explored ahead of different EDI assignments to processes. It 

is checked first whether the changes of values of coefficients bring a better priority function 

leading to a smaller WCSL. If the answer is negative, different EDI assignments to 

processes are explored, exactly as done in the previous optimization algorithm. 

4 Experimental Results 

In [3] experiments were performed on synthetic examples to show the result after applying 

the optimization algorithm. Process graphs were generated with 20, 40, 60, 80, 100 

27


and 120 processes each, mapped on architectures consisting of 3, 4, 5, 6, 7 and 8 nodes 

respectively. 15 graphs were generated for each application size, out of which 8 have a 

random structure and 7 have a tree-like structure. Worst-case execution times for processes 

were assigned randomly within the 10 to 250 time units range. 

To determine time overheads and hardware cost for each EDI, two experiment classes 

were generated: the first one, testcase 1, was based on the estimation of overheads done 

by Pattabiraman et al. in [6] and by Lyle et al in [4]. For the other one, testcase 2, the 

hardware used in it was assumed slower. Thus, if the same time overheads as testcase 

1 need to be reached, more hardware is required. Figure 5 shows the ranges used for 

randomly generating the overheads. Figure 5a shows the range status of testcase 1. As 

time 

overhead 

x100% 

3 

0.8 

0.7 

0.3 

0.25 

0.05 

SW 

only 

0.05 0.15 

mixed 

HW/SW 

testcase1 

0.5 

HW 

only 

1 

0.3 

0.25 

0.05 

HW 

cost 

time 

overhead 

x100% 

3 

0.8 

0.7 

SW 

only 

mixed 

HW/SW 

testcase2 

0.15 0.55 0.75 

x100% 

Figure 14. Comp 

a) b) 

Figure 12. Ranges for random generation of EDI overheads theoretical optimum wor 

Figure 5: Ranges for random generation of EDI overheads[3] of WCSLstatic. Of course, 

Synthetic experiments 

tion only for the applicati 

EDI overheads 

shown, for the SW-only EDI, the range of time overhead can reach maximum 300% HW and fraction. 

minimum 80% of the worst-case testcase1 execution time of the corresponding testcase2 process; the HWFigure 

14 shows the av 

cost is absolutelyApplication 0. For the Mixed SW/HW EDI, the range of time overhead is between our heuristic and for the 

size 

30% and 70%, and the HW cost overhead range is between 5% and 15%. Last, for HW- the differences between 

20 40 … 120 20 40 … 

only EDI, the range of time overhead decreases to 5%-25%, while the120 range for HW cost 1% for testcase1, and up 

increases to 50%-100%. In Figure 5b, the range of time overhead stays the same, buteffectiveness the of our appro 

range of HW HW cost fraction is pushed more to the right. 

Next, we were interest 

5% 10% 

… 

100% 5% 10% 

… 

100% 

tion assigned to each FPG 

4.1 Results for static Figure reconfiguration 

13. Synthetic experiment space 

shows the average impr 

mixed HW/SW implementation, the time overhead range is between 

heuristic. It can be seen 

Here the SW-only EDI is taken to show the result after the optimization algorithm 

30% and 70%, and the HW cost overhead range is between 5% and 

64% for (compared to the b 

FPGA with static configuration is applied. To show the effectiveness of the optimization 

15%. Finally, the HW-only implementation would incur a time 

assigning more HW to F 

algorithm, the results generated by the optimization algorithm(indicated by "heuristic" 

overhead between 5% and 25% and a HW cost overhead between 

also observe that this hap 

in Figure 6) were compared with the theoretical optimum generated by the Branch 

50% and 100%. Figure 12b depicts the ranges for testcase2: the 

point, and assigning more HW 

Bound(BB) algorithm. The performance improvement(PI) was calculated out as follows: 

time overhead ranges are the same, but we pushed the HW cost 

the saturation point, all p 

� � 

ranges more to the Wright. Also note that for testcase2, the centers of 

length already have their 

CSLbaseline − W CSLstatic 

P I = 

× 100 % 

gravity of the considered areas are more uniformly distributed. The 

for other processes into H 

W CSLbaseline 

execution time overheads and the HW cost overheads for the proc- 

We would also like to p 

esses in our synthetic examples are distributed uniformly in the 

we can reduce the WCS 

intervals depicted in Figure 12a (testcase1) and Figure 12b (test- 

ment >50%), for testcas 

case2). 

WCSL by half, we need 

We also varied the size of every FPGA 28 

available for placement of 

to the assumptions we m 

error detection. We proceeded as follows: we sum up all the HW 

(see Figure 12), namely 

cost overheads corresponding to the HW-only implementation, for 

need more HW in order 

all processes of a certain application: 

case1. As we can see from 

HW 

only 

1 

HW 

cost 

x100% 

Average improvement 

70% 

60% 

50% 

40% 

30% 

20% 

10% 

0% 

5%: 

10%: 

15%: 

testcase 

20%: 

25%: 

30%

d 

SW 

testcase2 

HW 

only 

0.55 0.75 

1 

HW 

cost 

x100% 

Wei Cao 

The W CSLstatic is the result calculated according to the optimization algorithm, 

while the W CSLbaseline is the optimal result. Figure 6 shows the final results. From 

Average improvement 

70% 

60% 

50% 

40% 

30% 

20% 

10% 

0% 

5%: 

10%: 

15%: 

20%: 

b) 

of EDI overheads 

Figure 14. Comparison with theoretical optimum 

theoretical Figure optimum 6: Comparison worst-case with schedule theoretical length, optimum[3] WCSLopt, instead 

of WCSLstatic. Of course, it was possible to obtain the optimal solution 

only for the application size of 20 and examples with up to 40% 

HW fraction. 

testcase2 

4.2 Results 

Figure 

for 

14 

PDR 

shows 

FPGAs 

the average improvement over all test cases for 

our heuristic and for the optimal solution. Considering all the cases, 

the differences between our heuristic and the optimum were up to 

40 … 120 

1% for testcase1, and up to 2.5% for testcase2, which shows the 

effectiveness of our approach. 

follows: Next, we were interested to evaluate the impact of the HW frac- 

10% 

… 

100% 

tion assigned P I to each FPGA, on the WCSL improvement. Figure 15 

nt space 

shows the average improvement we obtained when running our 

rhead range is between 

heuristic. It can be seen that we shortened the WCSL with up to 

ge is between 5% and 

64% (compared to the baseline – SW-only solution). As expected, 

would incur a time 

assigning more HW to FPGAs increases the improvement. We can 

ost overhead between 

also observe that this happens up to a saturation point: beyond that 

ges for testcase2: the 

point, assigning more HW area does not help. The reason is that, at 

pushed the HW cost 

the saturation point, all processes having an impact on the schedule 

stcase2, the centers of 

length already have their best EDI assigned, while moving the EDI 

ormly distributed. The 

for other processes into HW does not impact the WCSL. 

verheads for the proc- 

We would also like to point out that, with only 15% HW fraction, 

uted uniformly in the 

we can reduce the WCSL by more than half (i.e. get an improve- 

and Figure 12b (testment 

>50%), for testcase1. For testcase2, in order to reduce the 

WCSL by half, we need ~40% HW fraction. This difference is due 

ilable for placement of 

e sum up all the HW 

ly implementation, for 

to the assumptions we made when generating testcase2 examples 

(see Figure 12), namely that the hardware is slower and, thus, we 

need more HW in order to get the same performance as for testcase1. 

As we can see from Figure 15, this difference also influences 

the saturation point for testcase2 (~90% HW fraction, compared to 

~60% for testcase1). 

onsidering the size of 8.2 PDR Approach 

P DR � � 

W CSLstatic − W CSLP DR 

= 

× 100 % 

W CSLstatic 

fraction of 25%). 

5 Conclusion 

29 

25%: 

30%: 

35%: 

heuristic BB (optimum) 

testcase1 testcase2 

40%: 

5%: 

HW fraction 

a general view, for testcase 1, the biggest difference between the optimization algorithm 

and the optimum reached 1%, while for testcase 2, the biggest difference went up to 2.5%. 

Here the efficiency of implementing error detection on FPGAs with partial dynamic reconfiguration 

was tested, but the experiment setup was the same as in the static case. The 

efficiency was evaluated through the comparison with the results of the static approach. 

Similarly with the static approach, the performance improvement here is described as 

The W CSLP DR is the result generated by the optimization algorithm for FPGA with 

PDR. Figure 7 shows the final results. Through the comparison with the static approach, 

one result can be observed that the schedule length can be shortened with up to 36% 

for testcase 1(with a HW fraction of 5%) and with up to 34% for testcase 2(with a HW 

For error detection implementation, the SW-only approach in which both path tracking 

and variable checking are implemented in software doesn’t require hardware resource, 

but it leads to considerably performance overhead; the HW-only approach in which both 

path tracking and variable checking are performed in hardware reduces the performance 

overhead, but it may lead to costs sometimes exceeding the amount of resources. Since 

10%: 

15%: 

20%: 

25%: 

30%: 

35%: 

40%:

transition from SW-only, to mixed HW/SW and then to HW-only speed-limit regulations and h 

implementation of error detectors is smoother and more uniform dure in extreme situations. T 

(see Figure 12b). In other words, the gap (concerning HW cost) controller is as follows: base 

between mixed HW/SW and HW-only implementation is smaller in speed-limit regulations, the S 

Error Detection testcase2. As Technique expected, andthe itsmaximum Optimization improvement for Real-Time (34%) Embedded in this Systems speed limit allowed in a ce 

second case corresponds to a HW fraction of ~25% (compared with process calculates the relati 

5% for testcase1). 

component is also used to 

40 

HW: 

5%: 10%: 15%: 20%: 25%: 30%: 35%: 40%: 

testcase1 

60%: 80%: 90%: 100%: 

need to use the brake assist 

trigger the execution of the A 

35 

30 

25 

20 

The ACC assembly (P9 and P 

BrakeAssist process is used t 

in front of the vehicle that m 

Average Improvement 

Average Improvement 

15 

10 

5 

0 

40 

35 

30 

25 

20 

15 

10 

5 

0 

20 Tasks 40 Tasks 60 Tasks 80 Tasks 100 Tasks 120 Tasks 

Application size 

testcase2 

20 Tasks 40 Tasks 60 Tasks 80 Tasks 100 Tasks 120 Tasks 

Application size 

ACC 

assembly 

P1 P2 P3 P4 

Figure 16. Improvement - PDR over static approach Figure 18. Ad 

Figure 7: Improvement- PDR over Static Approach[3] 

each application consists of a certain number of processes, EDI can be applied to each 

process. Through the optimization of the EDI for each process, the optimization 49 of the 

WCSL for the application can be achieved. Two optimization algorithms are introduced, 

one for EDI on FPGA with static configuration and the other, for EDI on FPGA with 

PDR. For EDI on FPGA with static configuration, the optimization algorithm assigns 

different EDIs to processes to minimize the WCSL, while the optimization algorithm for 

EDI on FPGA with PDR explores different weight values of the priority function before 

the assignment of EDIs to processes. Experimental results have shown the improvement 

of the WCSL of the application after applying the corresponding algorithms and proved 

their effectiveness. 

References 

[1] D. Evans, J. Guttag, J. Horning, and Y.M. Tan. Lclint: A tool for using specifications 

to check code. In ACM SIGSOFT Software Engineering Notes, volume 19, pages 

87–96. ACM, 1994. 

[2] V. Izosimov, P. Pop, P. Eles, and Z. Peng. Synthesis of fault-tolerant schedules with 

transparency/performance trade-offs for distributed embedded systems. In Proceedings 

of the conference on Design, automation and test in Europe: Proceedings, pages 

706–711. European Design and Automation Association, 2006. 

30 

P6 

P9 

P10 

P12 

sensors 

P7 

P8 

actuator

Wei Cao 

[3] A. Lifa, P. Eles, Z. Peng, and V. Izosimov. Hardware/software optimization of error 

detection implementation for real-time embedded systems. In Hardware/Software 

Codesign and System Synthesis (CODES+ ISSS), 2010 IEEE/ACM/IFIP International 

Conference on, pages 41–50. IEEE, 2010. 

[4] G. Lyle, S. Chen, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer. An end-to-end approach 

for the automatic derivation of application-aware error detectors. In Dependable 

Systems & Networks, 2009. DSN’09. IEEE/IFIP International Conference on, 

pages 584–589. IEEE, 2009. 

[5] K. Pattabiraman, Z. Kalbarczyk, and R.K. Iyer. Application-based metrics for strategic 

placement of detectors. In Dependable Computing, 2005. Proceedings. 11th Pacific 

Rim International Symposium on, pages 8–pp. IEEE, 2005. 

[6] K. Pattabiraman, Z.T. Kalbarczyk, and R.K. Iyer. Automated derivation of 

application-aware error detectors using static analysis: The trusted illiac approach. 

Dependable and Secure Computing, IEEE Transactions on, 8(1):44–57, 2011. 

[7] C.R. Reeves. Modern heuristic techniques for combinatorial problems. John Wiley 

& Sons, Inc., 1993. 

[8] F. Tip. A survey of program slicing techniques. 1994. 

31

CPU vs. GPU: Which One Will Come Out on Top? 

Why There is no Simple Answer 

Denis Dridger 


dridger@mail.upb.de 


Abstract 

Today’s applications need to process an enormous amount of data due to evergrowing 

user requirements. Since traditional single-core CPUs have reached their 

speed limits, vendors nowadays provide powerful multi-core architectures to cope 

with the computation load. Although these architectures provide significant speedups 

compared to single-core CPUs, another trend emerged in the past few years: performing 

general purpose computations on graphics processing units (GPUs). The 

fast-paced evolution of GPUs allows to use more and more computing power along 

with a reasonable programming model. Ever since many publications presented phenomenal 

speedups, up to several hundred fold over CPUs. 

In this paper we take a critical look at those claims and clarify that interpreting 

such speedups should be done carefully. In doing so we discuss the question whether 

achieving such speedups is realistic or just a myth. There are many parameters that 

should be considered when conducting speedup measurements in order to obtain a 

meaningful result. Unfortunately many publications often omit or conceal important 

details such as time for data transfers between GPU and CPU or performed optimizations 

to CPU code. In fact we find that many reported speedups might decrease 

easily by a factor of 10 or more, if such considerations were made. 

33

Denis Dridger 


Today, applications require an immense computing power to satisfy the ever-growing 

needs of the high-performance computing community. In the recent years the computing 

industry recognized that traditional single-core architectures can not meet these demands 

anymore, and began to move toward multi-core and many-core systems [3]. Given the 

fact that parallelism is the future of computing, hardware designers continuously focus 

on adding more processing cores. The recent trend, is to perform high-performance computations 

also on graphics processing units (GPUs). GPUs evolved to powerful graphics 

engines, which feature programmability, peak arithmetic and memory bandwidth and can 

compete with modern CPU architectures [1]. The number of available processing units 

in a GPU exceeds the number of available CPU cores by far. For example an NVIDIA’s 

GTX280 graphics card (which is not a high-end GPU anymore) possesses 240 processing 

units, while Intel’s iCore7 CPU provides only 4 cores. In addition GPU vendors also provide 

powerful programming models that enable the user to port many applications to the 

GPU and leverage its massive parallel computing power. The most notable programming 

model is NVIDIA’s Compute Unified Device Architecture (CUDA) [7], which allows 

programming GPUs in a C-like language. After CUDA’s appearance in 2007, many researchers 

grabbed the opportunity to accelerate diverse algorithms on GPUs and reported 

significant speedups as high as 100X and far beyond, compared to CPU based approaches. 

However, Lee et al. [14] claim that achieving such speedups is a myth. Although 

this paper is very recent, it has already reached an immense popularity status. Motivated 

by this publication we take an objective look at it, as well as at many other papers that 

debate the CPU vs. GPU performance. In doing so we try to find evidence that supports 

or objects this claim. Studying different publications that report about 

• speedups that have been achieved on GPUs ([9, 12, 13, 18, 20, 22, 23, 24, 25, 27]) 

• optimization opportunities for CPU and GPU ([8, 17, 19]) 

• considerations when conducting performance comparisons between CPU and GPU 

([2, 10, 14, 26]) 

we find that many papers in fact do not provide completely fair performance comparisons 

or conceal important details concerning performance comparisons. The study shows that 

there is a number of parameters that influence the performance comparison results, which 

implies that reported results should be employed carefully. In many cases it is not very 

meaningful to say that GPU is X times faster than CPU because of the following parameters 

• used hardware (e.g. single-threaded CPU vs. high-end GPU) 

• performed optimizations (e.g. non optimized CPU code vs. optimized GPU code) 

• consideration of data transfers between CPU and GPU 

34

CPU vs. GPU: Which One Will Come Out on Top? Why There is no Simple Answer 

• used application (e.g. serial code vs. highly parallel code) 

• intention of the author (e.g. CPU vendor vs. GPU vendor) 

In this work we discuss the above mentioned influence parameters and try to answer the 

question whether achieving such great speedups is a myth or really possible. The answer 

is: it depends! Though it is not possible to provide a definite answer to this question, this 

work provides some interesting insights that may help to understand where tremendous 

speedups of more than 100X might come from. 

The remainder of this paper is structured as follows. The next section introduces the 

new trend of performing computations on GPUs. It covers a brief overview on the CUDA 

programming model and several examples of applications for which great speedups have 

been achieved. Section 3 provides information on technical aspects of CPUs and GPUs 

and highlights the differences between the both platforms. Here the features of each platform 

are described on a level that is reasonable for understanding the differences between 

the platforms as well as their approaches of processing data. The next two sections form 

the core of the paper. Section 4 tries to clarify why comparing the performance between 

CPU and GPU is not an easy task. In particular it is put straight why the results of such 

comparisons may vary from paper to paper by several orders of magnitude. In section 

5 we impartially discuss the claim that achieving 100X GPU speedups is just a myth, as 

suggested by Lee et al. [14] with the help of our previous considerations. Finally the 

work is concluded in section 6. 

2 The New Trend: General Purpose Computing on GPUs 

The GPU is no more just a fixed-function processor, which was designed to accelerate 3D 

applications. Over the past few years the GPU evolved to a highly parallel and flexible 

programmable processor featuring special purpose arithmetic units. With GPUs one gets 

much computing power for low cost. Today’s GPUs can provide peak performance of 

over 1 TFlop/s and peak bandwidth of over 100 GiB/s [9]. Figure 1 shows the performance 

increase over the past few years. As the figure suggests each year the theoretical 

performance was nearly doubled, which attracted the interest of more and more application 

developers and researchers. 

Another very important reason why today’s GPUs are so attractive, is their programmability. 

With the appearance of CUDA, programmers do not need to deal with cumbersome 

graphics APIs anymore (that were actually designed to handle polygons and pixels) when 

porting an application to the GPU. CUDA is probably the best known and most used programming 

model that is currently available. All studied publications concerning GPU 

performance or optimizations use CUDA, therefore we will also focus on CUDA and 

NVIDIA’s GPU architecture in this work. 

CUDA also refers to NVIDIA’s hardware architecture, which is tightly coupled to the 

programming model [7]. The hardware architecture is introduced in the next section. In 

35

Denis Dridger 

Figure 1: GPU performance increase over years. Figure is adapted from [5]. 

this section we want to take a brief look at CUDA, the programming model and some 

application examples for which notable speedups have been achieved using CUDA. 

2.1 The CUDA Programming Model 

In the CUDA model a GPU is considered as an accelerator that is capable of executing 

parallel code and special purpose code like mathematical arithmetics. The code that shall 

be accelerated on the GPU is referred to as a kernel. CUDA programs are basically C 

programs with extensions to leverage GPU’s parallelism and consist of two parts: the 

non-critical part that shall run on the CPU and the critical part, the kernel, that shall run 

on the GPU. Executing a kernel, the GPU runs many threads concurrently, each of which 

executes the same program on different data. This approach is known as SPMD (Single 

Program, Multiple Data). An illustration of the thread execution in the CUDA model is 

shown in Figure 2. 

CUDA programs consist of mixed code for CPU and GPU. The CPU (host) code is 

an ordinary C program, whereas the GPU code is written as a C kernel, using additional 

keywords and structures. In addition there are several restrictions on the kernel code: no 

recursion, no static variables and no variable numbers of function parameters. Both code 

fragments are compiled separately by the NVIDIA CUDA C compiler as shown in figure 

3. The kernel execution on GPU is launched by the host. The host code is also responsible 

for transferring data to and from GPU’s global memory, with the help of special API calls. 

36


Figure 2: The CUDA model considers the CPU as host, which runs code with no/low 

parallelism. The GPU is treated as an accelerator, which executes parallel code by running 

thousands of threads at the same time. Figure is adapted from [12]. 

Figure 3: The CUDA compilation flow. Figure is adapted from [19]. 

2.2 Application Examples 

Over the past few years researchers have ported different applications to the GPU, in 

particular using CUDA. The accelerated applications come from various areas including 

engineering, medicine, finance, cryptography and multimedia. In the majority of cases the 

here applied algorithms solve problems that deal with searching, sorting, mathematical 

computations and image processing. 

Next, several examples for accelerated algorithms are presented that were taken from 

recent publications. Although there were no special criteria for selecting the papers, most 

of the chosen publications report significant speedups compared to corresponding CPU 

implementations. At this point we do not want to consider the performance comparison 

details such as exactly used hardware or performed optimizations. We will take a closer 

look at these details in section 4 and 5. In all cases, the algorithms were implemented 

on high-end (or almost high-end) NVIDIA GPUs that were available at that time. The 

37

Denis Dridger 

corresponding CPU implementations, in contrast, were run only in best case on high-end 

CPUs. In addition, these implementations were optimized questionably or not optimized 

at all. 

• Sparse matrix vector product (SpMV) is of great importance in linear algebra and 

hence in engineering and scientific programs. There has been much work improving 

the performance of SpMV on various systems in the last years. Vazquez et al. [24] 

implemented SmMV on GPU and achieved a speedup of 30X. 

• Fast Fourier transforms (FFT) is also a very important algorithm, which transforms 

signals in the time domain into the frequency domain. Naga et al. [9] 

achieved a speedup of 40X. 

• Fast Multipole Methods (FMM) is widely used for problems arising in diverse areas 

(molecular dynamics, astrophysics, acoustics, fluid mechanics, electromagnetics, 

scattered data interpolation etc.) because of its ability to achieve linear time and 

memory dense matrix vector products with a fixed prescribed accuracy. Gumerov 

et al. [11] achieved a speedup of 60X. 

• Database operations also have parallelization potential. Bakkum et al. [4] implemented 

a subset of the SQLite command processor on the GPU and achieved 

speedups between 20X and 70X. 

• Password recovery algorithms provide excellent opportunities to exploit parallelism 

since passwords can be checked independently. Hu et al. [12] and Phong 

et al. [18] achieved speedups of over 50X and 170X respectively. 

• Image processing is another important application domain, which promises good 

speedup results due to the low data dependency. Zhiyi et al. [27] achieved speedups 

up to 200X. 

• Sum-product or “marginalize a product of functions problem“, is a rather simple 

kernel, which is used in different real-life applications. Silberstein et al. [22] 

achieved a speedup of 270X. 

3 Differences Between Today’s CPUs and GPUs 

In this section we highlight the differences between the two platforms and try to state 

some reasons why computing on GPUs may be a reasonable option. 

3.1 The CPU 

CPUs are designed to support a wide variety of applications, which can be single-threaded 

or multi-threaded. In oder to improve the performance of single-threaded applications, the 

38


CPU makes use of instruction-level parallelism, where several instructions can be issued 

at the same time. Multi-threaded applications may leverage additional cores along with 

the SIMD (Same Instruction Multiple Data) technology. Modern CPUs possess four to 

eight cores, run at a frequency above 3GHz and provide other useful features such as 

branch prediction. Intel’s Hyper-Threading technology allows a single physical processor 

to execute two heavyweight threads (processes) at the same time, dynamically sharing the 

processor resources [15]. An example for such a processor is Intel’s Core i7 CPU, which 

is used by Lee et al. in [14], to show that CPUs can/might compete against GPUs. 

However, providing all this architectural advances in order to support general purpose 

computing well, results in rather complex chips, and thus large chip areas, which in turn 

limits the number of cores that can be placed onto the chip. Since the number of application 

pieces that can be processed in parallel is limited by available parallel processing 

resources of the processor, GPUs become more interesting to researchers and application 

developers. 

3.2 The GPU 

The GPU provides many scalar processor cores, each of which is rather simple compared 

to a CPU core. Scalar processors are grouped into multiprocessors (also known as 

streaming processors) and can execute the same program in parallel using threads. CUDA 

threads are similar to ordinary operating system threads with the difference that the overhead 

for creating and scheduling threads is extremely low and can be safely ignored [6]. 

The threads again, are grouped into thread blocks that are scheduled by the GPU to the 

multiprocessors. The modern GPU is capable of running thousands of threads at the same 

time, which helps to hide memory latencies. If a thread block issues a long-latency memory 

operation, the multiprocessor will quickly switch to an other block while the memory 

request is satisfied by the memory controller. The GPU provides different memory types. 

Each processor core has a very small cache, each multiprocessor has a shared memory, 

which can be accessed by all cores located on this multiprocessor. The device itself provides 

a large global memory, which can be accessed by all multiprocessors. Shared memory 

is an on-chip memory and can be accessed extremely fast, while accessing the global 

memory, which is an off-chip memory, takes much longer. For example a GeForce 8800 

consumes only 4 clock cycles for fetching data from shared memory while the same operation 

takes 400 to 600 clock cycles for the global memory [27]. However, the size of the 

shared memory is with ca. 16KB quite small, while the global memory provides several 

hundreds of megabytes. 

Figure 4 illustrates the organization of multiprocessors, processor cores and memory 

on a GTX 280 GPU. Although this GPU was introduced already in 2008, and is surely 

not a high-end graphics device anymore, it was used in most recent publications that were 

studied in this work. 

However, having many cores and being able to run many threads in parallel does not 

make the GPU that fast yet. Data throughput is a feature that can be considered as the 

most important one. Today’s GPUs provide a bandwidth of over 100 GiB/s to keep the 

39

Denis Dridger 

Figure 4: GeForce GTX 280 GPU with 240 scalar processor cores, organized in 30 multiprocessors. 

Figure is adapted from [20]. 

processors busy and thus exploit as much computation power as possible. Gather/Scatter 

is another profitable feature of the GPU, which allows to read/write data from/to noncontiguous 

memory addresses in the global memory. This is important to treat applications 

with irregular memory accesses still in SIMD fashion [1, 14, 23]. Last but not least, 

each multiprocessor has several built-in function units to support fast execution of texture 

sampling and frequently-used arithmetic operations like square root, sin and cosine. 

These units also contribute to kernel’s speedup if the kernel makes use of the supported 

functions. Ryoo et al. [19] found that these special units contribute about 30% to the 

speedups of the evaluated trigonometry benchmarks. Lee et al. [14] suggest that the texture 

sampling unit of the GTX 280 GPU greatly contributed to the speedup of a collision 

detection algorithm (namely GJK). 

In addition, the performance of graphics hardware increases rapidly. Especially, faster 

than that of CPUs. But how can this be? Both chips consist of transistors, after all. The 

reason is that many transistors built into CPUs do not contribute to the actual computational 

work. Instead, they are used for non-computational tasks like branch prediction and 

caching, while the highly parallel nature of GPUs enables them to use additional transistors 

for computation [16]. Few years ago GPU vendors introduced for the first time the 

support of double-precision floating-point arithmetics. This innovation removed one of 

the major obstacles for the adoption of the GPU in many scientific computing applications 

[1]. 

3.3 Summary in Table Form 

The table below summarizes the features of CPU and GPU respectively, and highlights the 

differences between both platforms. Here, we ignore characteristics such as performance 

growth rate, cost and power consumption, because they do not directly contribute to the 

performance achievable on the device. 

40


Table 1: Comparison of CPU and GPU features that are relevant for the computing performance. 

In order to present the differences in an easy comprehensible way, we rate 

each feature with plus (+) symbols, where + implies that the respective feature is poorly 

supported, while +++++ implies that the respective feature is very well supported. The 

table is based on information obtained from [1, 8, 16, 14]. 

CPU GPU comment 

Application domain +++++ ++ GPU requires highly parallel applications 

Number of cores + +++++ 

Processor frequency +++++ ++ 

Peak throughput +++ +++++ 

Caches/shared memory +++++ + 

Gather/Scatter + +++++ Usually no hardware support on CPU 

Special function units + +++ Usually none/less in CPUs but few in GPUs 

Chip area that contributes ++ +++++ CPU ”wastes“ many transistors for caching 

to computation and control logic 

4 Considerations When Conducting Performance Comparisons 

The authors of [2], [10], [14] and [26] highlight important details regarding CPU/GPU 

speedup comparisons. They all agree that comparisons found in publications are often 

taken out of context. In this section we introduce four parameters that influence the 

performance comparisons and should thus be considered while conducting performance 

comparisons. 

4.1 The Application 

It is obvious that some applications are perfectly suitable to run on the CPU, whereas 

others perfectly fit on the GPU. In the extreme case, we have a single-threaded application, 

which would leverage the corresponding CPU features and run very well. Running the 

same application on the GPU would even result in a slow down, because only a single 

processor would be active, which is comparatively slow. In addition the performance 

would suffer from the overhead migrating the data to and from GPU’s memory. On the 

other hand, running a perfectly parallelizable code that is compute bound and is largely 

independent from other operations, would provide tremendous speedups on GPU, while 

the CPU implementation would have to get along with the few parallel units it has. Also 

applications that can work on small input data sets or can generate input data directly on 

GPU, (i.e. without the need to fetch it from CPU) may perform well on GPUs. 

41

Denis Dridger 

4.2 The Hardware 

When comparing the performance between CPU and GPU, the achieved speedups strongly 

depend on which CPU and GPU is used. For example using the next best GPU model instead 

of the chosen one, the theoretical deliverable performance can be doubled. That 

is because GPUs evolve rapidly and thus a newer GPU usually features more processing 

cores and higher throughput bandwidth. Usually there is also a performance gain 

choosing a better CPU model, though the expected gain is not that promising as in the 

GPU case, since the number of additional cores is very limited. It seems obvious that 

speedups measured on a GPU would (probably) decrease if the used CPU would feature 

more cores. Thus, speedups may decrease by a half if a dual-core processor is used instead 

of a single-core processor and so forth. 

But how to provide meaningful measurement results if there is such a wide variety of 

available CPUs and GPUs on the market? Probably it is the best to take the best available 

hardware for both platforms. And, as Lee et al. [14] suggest, to compare GPUs to thread 

and SIMD parallelized CPU code. The result would then declare the performance gain 

achievable on state-of-the-art hardware. 

For example comparing the execution time of a kernel using an high-end GPU on 

the one hand, and an obsolete single-threaded CPU on the other hand, does provide high 

speedup numbers, but does not provide very usable results. Authors, whose primary aims 

are not to report GPU speedups, but to inform about other concerns such as optimization 

techniques, often choose better comparable hardware to produce objective results. Such 

publications include [13], [14], [17] and [23]. In [8] achieved GPU results are compared 

even to several CPU platforms, which is very useful since notable performance gaps to 

other CPUs are directly visible. Correspondingly, the measured speedups in these publications 

are all less than 10X. In contrast, it is not very surprising that authors, who try to 

deliver GPU speedups that are as high as possible, (in particular higher than any reported 

speedups for similar algorithms) tend to choose weaker CPUs. If we take a look at our 

papers, which report great speedups (as mentioned in section 2.2) we find an evidence. In 

[4], [11], [12], [18], [22] and [27] a sequential CPU program is used as reference, while 

state-of-the-art GPUs are used on the other hand. In [24] a dual-core CPU processor is 

used, while quad-core processors already existed for several years. Only Naga et al. [9] 

implemented their algorithm on a high-end quad-core CPU. 

4.3 Performing the Optimizations 

A program’s code may be optimized in order to better leverage the given hardware resources. 

Differences in execution times of an optimized and an unoptimized program can 

be significant. For example, in [14], Lee et al. suggest that the speedup of an algorithm, 

which was reported to be 114X over CPUs, decreased to only 5X after their carefully 

optimizations. Ryoo et al. [19] researched tree searching algorithms on CPU and GPU. 

They confirm the fact that the speedup gap is reduced significantly if using optimized 

CPU code. The gap was reduced from 8X to 1.7X for large trees. For smaller trees the 

42


CPU implementation was even two times faster than the GPU implementation. 

In the most studied publications that achieve great speedups on GPUs, the description 

of CPU optimization is lacking in content, whereas the optimizations of the GPU version 

are explained in detail. Often authors do not consider CPU optimizations at all, or just 

mention that they use ”optimized“ CPU code. 

We want to take a look at available tuning opportunities for both platforms to get some 

insights how the performance can be increased. While optimizing the code, one basic 

approach is to reduce/hide memory latencies. Therefor, CPU designs use large caches, 

whereas GPU designs seek to run thousands of threads in flight. The efficient utilization 

of the computing resources also depends on how to extract instruction-level parallelism, 

thread-level parallelism and data-level parallelism. 

4.3.1 CPU Optimizations 

• Scatter/Gather can be realized by hand-coding the instruction sequence. This significantly 

improves the SIMD performance. For example, Smelyanskiy et al. [23] 

managed to reduce the number of instructions needed to fetch data from 4 noncontiguous 

memory locations, from 20 (generated by compiler) to 13. 

• Cache blocking is the standard technique used to reduce low-level cache misses on 

CPUs. Cache blocking restructures loops with frequent iterations over large data 

arrays by dividing them into smaller blocks. Then, each data element in the array is 

reused within the data block, such that the block of data fits within the data cache, 

before operating on the next block. Lee et al. made intensive use of cache blocking 

in [14], and observed that the performance of the ”Sort“ and ”Search“ benchmarks 

improved by 3-5X applying the technique. 

• Data layout is critical for processing data in parallel, especially if no hardware 

support for scatter/gather is available. Reordering data requires a good understanding 

of the underlying memory structure. For example in [14], Lee et al. improve 

the performance of the Lattice Boltzmann method (also known as LBM), by 1.5X 

reordering array data structures. 

4.3.2 GPU Optimizations 

Accessing GPU’s off-chip memory is a major bottleneck in GPU computing. Hence, 

reducing global memory latency is the main concern when optimizing GPU code [19, 27]. 

The basic techniques for hiding memory latency are listed below. 

• Using as many threads as possible is a very common approach to hide memory 

latency. This improves the utilization of the processors because a great number of 

threads can run on the processors while many other threads are waiting until their 

read or write request to the global memory is satisfied. Switching between active 

threads and inactive threads is very fast on GPU’s and hence does not cause notable 

43

Denis Dridger 

overhead as already mentioned in the previous section. So a GPU code developer 

should try to create as many threads as possible in his program. To fully utilize 

today’s GPUs it is necessary to create 5,000 to 10,000 threads [20]. 

• Reusing data that is already located in the shared memory avoids expensive accesses 

to the global memory. The thread that loads a datum to the shared memory 

may perform a synchronization operation, so that other threads of the same block 

may access this data too, instead of fetching it from global memory. 

• Loading data in blocks helps reducing the global memory latency for applications 

that can take advantage of contiguity in main memory. An example for such an 

application is the matrix multiplication. Ryoo et al. [19] load parts of the matrix as 

nxn blocks, which are then processed by nxn threads in parallel. For example the 

results for two 16x16 input blocks are computed by 256 threads. 

4.4 Data Transfers Between CPU and GPU 

Time needed for memory transfers between CPU and GPU is critical to the overall performance 

of the application [1, 8, 10, 13, 21]. Since it is not possible to exchange data 

between CPU and GPU at runtime, executing a kernel on the GPU usually involves the 

following steps: 

1. CPU: copy input data from CPU memory to GPU memory 

2. CPU: launch n instances of the kernel 

3. GPU: process n pieces of data in parallel 

4. CPU: copy output data from GPU memory to CPU memory 

Gregg et al. [10] recognized that many published performance comparisons do not exactly 

say where the data resides before kernel execution and what happens to the data after 

kernel execution. They argue that considering memory transfer times may reduce the 

achieved speedups significantly. Indeed, they show that execution time for benchmarked 

kernels increases by factor 2 to 50 if considering transfer times. Furthermore, they point 

out that measuring only the raw kernel execution time is meaningless if results produced 

by the GPU have to be used by the CPU afterwards. In this case the kernel may be fast 

but the execution time of the whole application would also include the time for copying 

the results from GPU to CPU. Ignoring transfer times in publications also complicates to 

understand whether it is even worth to perform the execution on GPU or not. Figure 5 

and Figure 6 show examples that demonstrate the impact of data transfer times. 

Surprisingly, many studied publications including [9], [14], [27] and [20] ignore the 

time for memory transfers completely in their performance comparisons. No (or unclear) 

information on memory transfers is provided by [12], [18] and [24]. Bakkum et al. [4], 

who achieved 20X - 70X speedups porting database operations to GPU, include memory 

transfers in their comparisons. They state that excluding memory transfers would lead 

to speedups close to 200X, which would not be a fair comparison though. Authors of 

[8], [23] and [25] also consider memory transfer times and achieve speedups, which are 

correspondingly low. 

44


Figure 5: Execution times of the SpMV kernel for growing matrices as input. The time for 

moving the matrix to GPU’s memory dominates the overall execution time of the kernel 

extremely. Figure is adapted from [10]. 

Figure 6: Measured performance for stencil computations. Blue bars represent GPU 

implementations, other bars represent CPU based implementations. Considering time 

for data transfers to/from CPU, degrades the GPU performance dramatically. Figure is 

adapted from [8]. 

45

Denis Dridger 

5 Discussion: is the “100X GPU Speedup” Just a Myth? 

5.1 Motivation 

In the recent years we have seen many claims concerning program speedups on GPU. To 

put it roughly, these claims often sound like this: “You can compute a matrix 100 times 

faster using a graphics card instead of a CPU” or “Password cracking on a graphics card is 

200 times faster than on a CPU”. But is this true? Can one state that GPUs are that much 

better just like this? Lee et al. [14] say no. Moreover, they argue that achieving such 

speedups is generally a myth. They argue that there are several parameters that need to be 

considered to provide fair performance comparisons. And thus reported speedups would 

decrease significantly if adequately evaluated. In their work, they reevaluated various 

claims that suggest GPU speedups about 100X, and ended up with much lower GPU 

speedups. Therefor they implemented 14 algorithms on CPU and GPU respectively, and 

managed to damp down the originally reported GPU speedups for these algorithms to an 

averaged speedup of 2.5X. The trick was, to use a state-of-the-art Intel CPU along with 

several code optimization techniques. 

Motivated by Lee et al., we investigated several publications in order to figure out 

which parameters this are and how they might influence the speedups. In fact, we found 

evidence that many performance comparisons seem to be taken out of context. Especially 

noticeable is the fact that authors, who report huge speedups, tend to conceal important 

details of their performance comparisons, or compare their GPU implementations to 

poorly optimized or outdated CPUs. Some examples for such publications were already 

mentioned in section 2.2 and 4.2 respectively. Moreover, as figured out in section 4.4 

almost all publications (especially again those from section 2.2) ignore the time needed to 

transfer the data to and from the the GPU. Since considering the transfers is essential in 

real-life applications, the reported speedups would decrease even further, because moving 

data is a very costly operation. Summing up, it is likely that these speedups would 

decrease significantly if considering (just) these two parameters. For example, if we assume 

that the program would be run parallelized on a quad-core CPU instead of on a 

single-threaded CPU, and memory transfers would account for “only” 2X of the speedup, 

then a 100X speedup would (theoretically) decrease to 12.5X. If we then apply elaborate 

optimizations to the CPU code, we might end up with a GPU speedup of less than 10X, 

which would be near to the results achieved by Lee et al. 

5.2 Intention of the Author 

So far we can say that reported speedups should be interpreted with care in order to 

obtain a meaningful outcome. How and whether the elaborated influence parameters play 

a role during performance comparisons, depends on the author himself. Anderson et al. 

[2] point out that there are two distinct perspectives from which to make comparisons: 

application developers and computer architecture researchers. Application developers 

focus on demonstrating new application capabilities designing algorithms for a particular 

46


domain under a set of implementation constraints. Hence, when application developers 

report a 100x speedup using a GPU, the speedup numbers should not be misinterpreted 

as architectural comparisons, claiming that GPUs are 100x faster than CPUs. 

Architecture researchers, on the other hand, do not focus on a specific application domain 

but design architectures, which perform well for a variety of application domains. To 

evaluate designed architectures researchers often use benchmark suites, rather then elaborated 

data structures and algorithms that solve a concrete problem. Benchmark suites are 

designed to evaluate architectural features instead of providing great speedups. Anderson 

also asks that every future comparison should have enough reference information, which 

allows to reproduce the reported speedups. 

As mentioned before, Lee et al. [14] state that published GPU speedups numbers are 

exaggerated in general, and that CPUs might keep up with GPUs in many cases. However, 

the fact that Lee et al. are members of the Intel Corporation, which does not want to lose 

the market share for general purpose computing, may suggest their intention. In other 

words, the intention of Lee et al. was to push down the speedup numbers that were 

achieved using GPUs. However, if we consult our influence parameters we discover that 

Lee et al. use an outdated GPU for their comparisons, while next generation GPUs were 

already available, which could provide as much as twice the performance. In addition, 

Lee et al. do not detail the implementations used for comparison, which again makes it 

hard to comprehend or reproduce the results. 

5.3 The Answer 

The answer on our question whether GPUs can achieve 100X speedups over CPUs or 

not is: it depends. The claim that a GPU implementation is 100X faster than the legacy 

sequential implementation, is valid and may be of great interest to application developers 

using the legacy implementation [2]. However, one can push down this speedup by 

adapting the introduced influence parameters nearly arbitrarily. Even if the parallelism of 

a CPU implementation is limited by the number of available cores, one still can argue that 

adding another CPU sockets will match the GPU in performance, as shown by Vuduc et 

al. in [26]. 

Nevertheless, we can agree that GPUs still have the potential to significantly accelerate 

parallel algorithms. Even though many reported speedups are exaggerated, today’s, 

and especially future GPUs, are capable of providing notable speedups for certain, well 

optimized applications. 

6 Conclusions 

In this work we figured out that reported speedups of GPU accelerated algorithms often 

appear to be exaggerated. Therefor we first had a look at the basic concepts of general 

purpose computing on GPUs. Here the GPU architecture, its programming model and 

application examples were presented. Next, several parameters were discussed that influ- 

47

Denis Dridger 

ence the performance comparisons. Therefor several publications were studied that deal 

with algorithm acceleration on GPUs. We have seen that the parameters (1) chosen application, 

(2) chosen hardware, (3) performed code optimizations and (4) consideration of 

memory transfers, have a very influential impact on the resulting speedup. Many authors 

however, do not provide very fair performance comparisons adapting these parameters 

so that their GPU implementation outperform the corresponding CPU implementation 

by far. Adapting the parameters is mainly driven by authors intention, which can lead to 

speedups of 100X (and far beyond) over the CPU implementation. In turn, conducting absolutely 

“fair” performance comparisons often shows that GPU implementations provide 

reasonable speedups or even do not outperform the corresponding CPU implementations 

at all. 

References 

[1] D. Luebke S. Green J. E. Stone J. C. Phillips . D. Owens, M. Houston. "GPU 

Computing". In Proceedings of the IEEE, pages 879 – 899, Washington, DC, USA, 

2011. IEEE Computer Society. 

[2] Michael Anderson, Bryan Catanzaro, Jike Chong, Ekaterina Gonina, Kurt Keutzer, 

Chao-Yue Lai, Mark Murphy, David Sheffield, Bor-Yiing Su, and Narayanan Sundaram. 

"Considerations When Evaluating Microprocessor Platforms". In Proceedings 

of the 3rd USENIX conference on Hot topic in parallelism, HotPar’11, pages 

1–1, Berkeley, CA, USA, 2011. USENIX Association. 

[3] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, 

Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John 

Shalf, Samuel Webb Williams, and Katherine A. Yelick. "The Landscape of Parallel 

Computing Research: A View from Berkeley". Technical Report UCB/EECS-2006- 

183, EECS Department, University of California, Berkeley, Dec 2006. 

[4] Peter Bakkum and Kevin Skadron. "Accelerating SQL Database Operations on a 

GPU With CUDA". In Proceedings of the 3rd Workshop on General-Purpose Computation 

on Graphics Processing Units, GPGPU ’10, pages 94–103, New York, NY, 

USA, 2010. ACM. 

[5] NVIDIA Corporation. Compute unified device architecture programming guide version 

2.0. http://www.nvidia.com/object/cudadevelop.htm, 2008. 

[6] NVIDIA Corporation. "NVIDIA CUDA C Programming Guide". 2010. 

[7] NVIDIA Corporation. Nvidia cuda zone. http://www.nvidia.com/ 

object/cuda_home.html, 2011. 

48


[8] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, 

Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. "Stencil Computation 

Optimization and Auto-tuning on State-of-the-art Multicore Architectures". In 

Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pages 

4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press. 

[9] Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. 

"High Performance Discrete Fourier Transforms on Graphics Processors". 

In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, 

pages 2:1–2:12, Piscataway, NJ, USA, 2008. IEEE Press. 

[10] Chris Gregg and Kim Hazelwood. "Where is the Data? Why You Cannot Debate 

CPU vs. GPU Performance Without the Answer". In Proceedings of the IEEE International 

Symposium on Performance Analysis of Systems and Software, ISPASS 

’11, pages 134–144, Washington, DC, USA, 2011. IEEE Computer Society. 

[11] Nail A. Gumerov and Ramani Duraiswami. "Fast Multipole Methods on Graphics 

Processors". J. Comput. Phys., 227:8290–8313, September 2008. 

[12] Guang Hu, Jianhua Ma, and Benxiong Huang. "Password Recovery for RAR Files 

Using CUDA". In Proceedings of the 2009 Eighth IEEE International Conference 

on Dependable, Autonomic and Secure Computing, DASC ’09, pages 486–490, 

Washington, DC, USA, 2009. IEEE Computer Society. 

[13] Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, 

Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. "FAST: Fast 

Architecture Sensitive Tree Search on modern CPUs and GPUs". In Proceedings 

of the 2010 international conference on Management of data, SIGMOD ’10, pages 

339–350, New York, NY, USA, 2010. ACM. 

[14] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, 

Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, 

Per Hammarlund, Ronak Singhal, and Pradeep Dubey. "Debunking the 100X GPU 

vs. CPU myth: an Evaluation of Throughput Computing on CPU and GPU". In 

Proceedings of the 37th annual international symposium on Computer architecture, 

ISCA ’10, pages 451–460, New York, NY, USA, 2010. ACM. 

[15] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, 

J. Alan Miller, and Michael Upton. "Hyper-Threading Technology Architecture and 

Microarchitecture". Intel Technology Journal, 6(1):4–16, 2002. 

[16] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron 

Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on 

Graphics Hardware". Computer Graphics Forum, 26(1):80–113, 2007. 

49

Denis Dridger 

[17] S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige. "Performance 

Analysis of a Hybrid MPI/CUDA Implementation of the NASLU Benchmark". SIG- 

METRICS Perform. Eval. Rev., 38:23–29, March 2011. 

[18] Pham Hong Phong, Phan Duc Dung, Duong Nhat Tan, Nguyen Huu Duc, and 

Nguyen Thanh Thuy. "Password Recovery for Encrypted ZIP Archives Using 

GPUs". In Proceedings of the 2010 Symposium on Information and Communication 

Technology, SoICT ’10, pages 28–33, New York, NY, USA, 2010. ACM. 

[19] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. 

Kirk, and Wen-mei W. Hwu. "Optimization Principles and Application Performance 

Evaluation of a Multithreaded GPU Using CUDA". In Proceedings of the 13th ACM 

SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP 

’08, pages 73–82, New York, NY, USA, 2008. ACM. 

[20] Nadathur Satish, Mark Harris, and Michael Garland. "Designing Efficient Sorting 

Algorithms for Manycore GPUs". In Proceedings of the 2009 IEEE International 

Symposium on Parallel&Distributed Processing, IPDPS ’09, pages 1–10, Washington, 

DC, USA, 2009. IEEE Computer Society. 

[21] Dana Schaa and David Kaeli. "Exploring the Multiple-GPU Design Space". In 

Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed 

Processing, IPDPS ’09, pages 1–12, Washington, DC, USA, 2009. IEEE Computer 

Society. 

[22] Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, and John D. Owens. 

"Efficient Computation of Sum-products on GPUs Through Software-managed 

Cache". In Proceedings of the 22nd annual international conference on Supercomputing, 

ICS ’08, pages 309–318, New York, NY, USA, 2008. ACM. 

[23] Mikhail Smelyanskiy, David Holmes, Jatin Chhugani, Alan Larson, Douglas M. 

Carmean, Dennis Hanson, Pradeep Dubey, Kurt Augustine, Daehyun Kim, Alan 

Kyker, Victor W. Lee, Anthony D. Nguyen, Larry Seiler, and Richard Robb. "Mapping 

High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and 

Many-Core Architectures". IEEE Transactions on Visualization and Computer 

Graphics, 15:1563–1570, November 2009. 

[24] F. Vazquez, G. Ortega, J. J. Fernandez, and E. M. Garzon. "Improving the Performance 

of the Sparse Matrix Vector Product with GPUs". In Proceedings of the 2010 

10th IEEE International Conference on Computer and Information Technology, CIT 

’10, pages 1146–1151, Washington, DC, USA, 2010. IEEE Computer Society. 

[25] Vasily Volkov and James W. Demmel. "Benchmarking GPUs to Tune Dense Linear 

Algebra". In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 

SC ’08, pages 31:1–31:11, Piscataway, NJ, USA, 2008. IEEE Press. 

50


[26] Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay 

Shringarpure. "On the Limits of GPU Acceleration". In Proceedings of the 2nd 

USENIX conference on Hot topics in parallelism, HotPar’10, pages 13–13, Berkeley, 

CA, USA, 2010. USENIX Association. 

[27] Zhiyi Yang, Yating Zhu, and Yong Pu. "Parallel Image Processing Based on CUDA". 

Computer Science and Software Engineering, International Conference on, 3:198– 

201, 2008. 

51

Will Dark Silicon Limit Multicore Scaling? 



chkl@mail.uni-paderborn.de 


Abstract 

The performance of processors has grown exponentially over decades, but it 

is doubtful if this scaling holds with upcoming multicore processors. To answer 

this question, this work reflects a study published by Esmaeilzadeh [7] et al., which 

presents an analytical model to make scaling predictions, relying on empirical data 

of current processor technologies. One of the most significant results is that dark 

silicon might become a relevant problem. Dark silicon is the fraction of the die area, 

which is unused, caused by power or application parallelism limits. We came to the 

conclusion that the level of parallelism is the most relevant reason for dark silicon. 


The exa-scale challenge is a very frequently discussed topic in the area of computerengineering. 

During the last decades, a continuous performance growth of CPUs was 

retained. While the energy efficiency was improved with upcoming technology generations, 

the total power consumption of a CPU has grown with the performance in the 

past. 

To overcome an exorbitant growth of power consumption, multicore CPUs and and 

GPUs were established to avoid the demand to make further increases of the used single 

core frequency. This strategy implies the demand of applications with a certain level 

of parallelism to make performance improvements. Additionally the memory and communication 

bandwidth is a still existing challenge. To answer the question if the current 

technology may fulfill the upcoming performance needs with acceptable energy and chip 

area demands, Esmaeilzadeh et al. [7] investigated in a detailed analysis of different 

models and empirical measurements of currently available devices and tried to estimate 

the scalability of upcoming technologies with this knowledge. An interesting aspect in 

this area is the fraction of dark silicon in upcoming processor generations. Dark silicon is 

the part of a die, which is unused, e.g. caused by missing parallelism in an application or 

by power constraints. In the worst case, dark silicon may limit the possible performance 

53


improvements of upcoming chip generations, even if the growth of chip complexity continues 

as in the past. This paper reflects the work of Esmaeilzadeh et al. and compares 

the results to alternative models. 

1.1 Overview 

The remainder of this paper is structured as follows: The rest of this section introduces 

basic models related to scaling compute performance and explains the different types 

of considered multicore topologies. Section 2 presents an empirical study on current 

processor technologies and makes predictions on upcoming technologies and the resulting 

performance. This section consists of a device model, a core model and a multicore 

model. Section 3 concludes scaling limitations and presents sources of dark silicon. The 

following section 4 summarizes related word. The last chapter concludes and discusses 

the feasibility of the shown work. 

1.2 Basic Models 

In the past, different performance and scaling models have been proposed. These models 

are necessary to predict the upcoming processor technology and performance. This 

section presents Moore’s Law, Amdahl’s, and Pollack’s Rule. In the remainder of this 

paper, we will discuss the question if these models are sufficient to make detailed scaling 

predictions and particularly predict the fraction of dark silicon. 

1.2.1 Moore’s Law 

Gordon E. Moore, one of the founders of Intel, noticed in 1965 that the complexity of 

integrated circuits doubles every 18 months [11]. Complexity means in this context the 

number of transistors per die area. Unexpectedly, this rule holds for decades and thereby it 

was the base for the appeared growth of compute performance. An interesting question is 

what the effect on the performance of upcoming increases of processor complexity might 

be, even if Moore’s Law holds. 

1.2.2 Pollack’s Rule 

One model to answer the question of the effect of an increased processor complexity is 

Pollack’s Rule [4]. Pollack’s Rule proposes that the increase of the performance of a chip 

is proportional to the growth of the square root of its complexity. This rule implies for 

instance that doubling the processor complexity results only in a performance growth of 

40 %. 

1.2.3 Amdahl’s Law 

One important question while analyzing processor performance is the speedup caused 

by a new processor generation. Therefore Amdahl formulated a very general rule [1] in 

54


1967, which enables us to compare two processor generations. According to Amdahl, the 

speedup of a system is 

Speedup = 

1 

(1 − f) + f 

S 

where f represents the fraction that is optimized by an improved system, e.g. the parts of 

the code, and S represents the speedup of this fraction. We will see some corollaries of 

Amdahl’s Law, fitted to multicore processors in section 2.3.1. 

1.3 Multicore Topologies 

We consider different types of processors for the following analysis, which are also presented 

by Esmaeilzadeh et al. [7]. First we distinguish between regular multicore processors 

and GPU like processors, which are able to execute many threads per core. For each 

of these types, we consider the following topologies. 

1.3.1 Symmetric Multicore 

A symmetric multicore processor is the most obvious one and consists of multiple, identical 

cores. The parallel fraction of a program is distributed across each of these cores. 

Running serial code certainly results in executing the whole code on one single core, 

whereat large parts of the processor may be unused. 

1.3.2 Asymmetric Multicore 

This kind of multiprocessor consists of one large core and multiple small cores of the 

same type. Typically the performance of the large core is much higher than the smaller 

cores’ performance, thus sequential tasks can be executed with a good performance on 

the large core and parallel tasks on the small cores and the large core. 

1.3.3 Dynamic Multicore 

The dynamic multicore topology is very similar to the asymmetric multicore topology. 

Contrary to the asymmetric multicore, either the large core or the small cores are usable 

at the same time. During the execution of a sequential task, the small cores are shut down 

and during the execution of parallel tasks, the large core is shut down. 

1.3.4 Composed Multicore 

The composed multicore topology, in literature also called fused multicore, consists of 

multiple small cores, which can be composed to one large core. This architecture implies 

the same behavior as the dynamic multicore topology where either one large core or 

multiple small cores can be used at the same time. 

55 

(1)


2 Performance Models 

This section describes three models, used for estimating the upcoming performance scaling. 

We model future devices, CPU cores and multicore CPUs and combine them to 

make predictions on future compute performance and the impact of dark silicon. The first 

device model describes upcoming semiconductor technologies. In the next step, we consider 

a core model to estimate the upcoming performance per core by having a look at the 

performance per die area and power consumption of current processors. In combination 

with the device model, we can estimate the core performance of upcoming processors. In 

the last step we estimate the upcoming multicore speedup by combining the results from 

the core model with Amdahl’s Law and a second, more realistic model. 

2.1 Device Model 

The authors of [7] presented two different device scaling models. The first one is based on 

the ITRS technology roadmap 1 , the second model is a more conservative one, presented 

by Borkar [5]. Both of the models are presenting a roadmap of upcoming technologies, 

which are the base for further predictions made in the remainder of this section. Both are 

presenting estimations for the upcoming technologies with a feature size from 45 nm to 

8 nm, the expected frequency, voltage, capacitance and power scaling factor. The results 

of both roadmaps are shown in Figure 1. We have to consider that the ITRS roadmap 

assumes different types of transistors than the conservative projection. 

Figure 1: Scaling factors for ITRS and conservative projections [7] 

1 Online at http://www.itrs.net 

56

2.2 Core Model 

2.2.1 Current Performance Behavior 


Esmaeilzadeh et al. used empirical performance data, measured by the SPECmarks 

benchmark of 152 real processors from 600 nm to 45 nm. The benchmark results, shown 

in Figure 2, were taken from the SPEC website 2 . They presented the single-threaded core 

performance, called q, compared to the power consumption P (q) and chip area A(q). Any 

details on the processor and system architecture are not considered in this model. The performance 

q is given as SPEC CPU2006 score. The power consumption of a processor core 

was taken from the data sheets. The Thermal Design Power (TDP) was considered in this 

study. This value is the power, a processor can dissipate without reaching the junction 

temperature of the transistors. To build a model to predict upcoming performance, only 

one technology generation, in this case 45 nm, was considered (Figure 3). To estimate the 

core area, die photos were used. The area consumed by level 2 and level 3 caches were 

excluded. 

Power and area constraints were considered decoupled in this study. Previous studies 

on multicore performance used Pollack’s Rule and assume power consumption to be proportional 

to the number of transistors, which means being proportional to the chip area, 

when only one feature size is considered. Given that frequency and voltage are not scaling 

as historically done, Pollack’s Rule is not practical for modeling the power consumption 

of a current or upcoming processor core. 

2.2.2 Estimate Optimal Design Points 

Figure 2: Power/Performance across nodes [7] 

To point out the most relevant design points, the Pareto frontier of the 45 nm design space 

was derived. For the power/performance design space, a cubic polynomial P (q) was 

assumed. The Pareto frontier of the area/performance design space A(q) was assumed 

2 Online at http://www.spec.org 

57


Figure 3: Power/Performance frontier, 45 nm [7] 

Figure 4: Area/Performance frontier, 45 nm [7] 

as a quadratic polynomial. This choice was taken according to Pollack’s Rule, which 

assumes a quadratic increase of chip area, with an performance increase. The coefficients 

of the polynomials P (q) and A(q) were fitted using the least square regression method. 

The results are presented in Figure 3 and Figure 4. 

2.2.3 Predicting Upcoming Performance 

To make predictions on upcoming processor core performance, we combine the results 

from the presented device model and the core model. Therefore, the 45 nm Pareto frontier 

was scaled to 8 nm and fitted to a new Pareto frontier for each technology. For that, 

the results from the device model (Section 2.1) were inserted to the data points of the core 

model. The SPECmark performance is therefore assumed as scaling with the frequency, 

which ignores aspects like the memory latency and bandwidth, thus the presented model 

has to be considered as an upper bound for upcoming processor performance. The predic- 

58


tions, depending on the ITRS roadmap and the conservative model by Borkar are shown 

in Figure 5 and Figure 6. 

2.3 Multicore Model 

Figure 5: Conservative frontier scaling [7] 

Figure 6: ITRS frontier scaling [7] 

The next presented model estimates the possible scaling for multicore processors. We 

will consider two different scaling models, the first one is a corollary of Amdahl’s Law, 

the second one is a more realistic model, which was originally proposed by Guz et al. [8] 

and extended. This model is appliable to CPU- and GPU-like processors. 

2.3.1 Upper Bound by Amdahl’s Law 

To apply Amdahl’s Law to multicore processors, Hill and Marty [10] concluded the 

speedup of all presented multicore topologies. This model can be considered as an upper 

59


bound for the multicore speedup. The model was extended to consider power and area 

constraints, but does not differentiate between CPU- and GPU-like processor architectures. 

The possible speedups, depending on the processor topology are shown be the equations 

2 to 9, where the possible number of cores is depending on the Chip area restrictions 

and the power restrictions. DIEAREA presents the maximum area budget and TDP the 

power budget. The parameter q presents the performance of a singe core, the speedup 

is measured related to a baseline core with the performance qBaseline. The speedup of a 

single core cannot be larger than SU(q) = q/qBaseline. 

For the symmetric multicore topology, the parallel fraction of the code f is distributed 

over all NSym available cores, the serial fraction runs on only one core. 

NSym(q) = min( DIEAREA 

, 

A(q) 

T DP 

) (2) 

P (q) 

SpeedupSym(f, q) = 

1 

(1−f) 

SU (q) + 

f 

NSym(q)SU (q) 

For the asymmetric multicore topology, the large core dominates the area constraint 

and the small cores are dominating the power constraint. The variables qL and qS are 

describing the performance of the large core and the performance of a single small core. 

On this topology parallel code is executed on the large core and the small cores, sequential 

code is executed only on the large core. 

NAsym(qL, qS) = min( DIEAREA − A(qL) 

, 

A(qS) 

T DP − P (qL) 

SpeedupAsym(f, qL, qS) = 

1 

(3) 

) (4) 

P (qS) 

(1−f) 

SU (qL) + 

1 

NAsym(qL,qS)SU (qS)+SU (qL) 

Having a dynamic multicore topology, the area is still bounded by the area of the large 

core, if the area constraint is the dominating part. The number of small cores is not limited 

by the power consumption of the large core. For this topology, parallel core is executed 

only on the small cores. 

NDyn(qL, qS) = min( DIEAREA − A(qL) 

, 

A(qS) 

T DP 

) (6) 

P (qS) 

SpeedupDyn(f, qL, qS) = 

1 

(1−f) 

SU (qL) + 

f 

NDyn(qL,qS)SU (qS) 

One of the characteristics of the composed multicore topology is an area overhead, 

caused by the composed technology. The parameter τ describes this overhead. The model 

contains the assumption that the composed core has the same performance and power 

consumption as a scaled up single core. The execution behavior of parallel and sequential 

code is similar to the dynamic multicore. 

60 

(5) 

(7)

2.3.2 Realistic Model 


NComposed(qL, qS) = min( DIEAREA T DP − P (qL) 

, ) (8) 

(1 + τ)A(qS) P (qS) 

SpeedupComposed(f, qL, qS) = 

1 

(1−f) 

SU (qL) + 

f 

NComposed(qL,qS)SU (qS) 

The next presented model is a more realistic model on the speedup of upcoming multicore 

processors. This model also considers technological details like the number of threads 

per core, and thereby the difference between CPU- and GPU-like architectures, the cache 

behavior, the memory bandwidth, the frequency, or the cycles per instruction (CPI) value. 

Also important for the performance of a processor is the used application. The application 

behavior is characterized by the level of parallelism and the memory access behavior. 

The performance of a fully parallel application, measured by the number of instructions 

per second can be calculated by equation 10. 

P erf = min(N freq 

η 

CP Iexe 

BWmax 

) (10) 

rmmL1b 

Thereby η represents the core utilization, which is depending on the memory behavior, 

rm is the fraction of instructions with memory access, mL1 is the predicted miss rate of 

the first level cache and b is the number of bytes per memory access. The CP Iexe value 

and the frequency were estimated by the presented Pareto frontiers. Details on the values 

are explained by [7]. 

To model application characteristics, PARSC applications were considered from previous 

studies [2], [3]. The level of parallelism f was obtained from this using Amdahl’s 

Law between 0.75 and 0.9999, depending on the considered benchmark. 

Now we compute the serial performance P erfs and parallel performance P erfP for 

each type of multicore processor using equation 10. The number of cores N is computed 

using the topology dependent equations 2, 4, 6, and 8. We are considering a 45 nm 

Nehalem core as the baseline performance P erfB. Now we obtain a speedup SSerial = 

P erfS/P erfB for the serial part of the benchmark and SP arallel = P erfP /P erfB for the 

parallel part. The total speedup is given by equation 11 for each of the topologies. 

2.4 Combining the Models 

1 

Speedup = 1−f 

SSerial + 

f 

SP arallel 

In this section we are putting all things together and predicting the performance of an upcoming 

multicore processor. We are assuming a power limit of 125 W and an area budget 

of 111 mm 2 , which corresponds to a Nehalem based 4-core processor at 45 nm technology, 

excluding level 2 and level 3 caches. For this prediction each area/performance 

61 

(9) 

(11)


design point of the Pareto frontier is considered. In the next step, iteratively one core 

is added in each step and the new power consumption and speedup is computed. The 

speedup is computed using the upper bound with Amdahl’s Law and the more realistic 

model. The power consumption is computed using the power/performance Pareto frontier. 

The iteration stops when the power or area limit is reached or we see a performance 

decrease. The difference between the allocated chip area up to this step and the total 

area budget is the fraction of dark silicon. These steps are repeated for all scaled Pareto 

frontiers with both of the multicore performance models, considering GPU- and CPU-like 

processors. The power and area budget is kept constant. Detailed results of this model 

are presented by [7]. Esmaeilzadeh et al. came to the conclusion that using Amdahl’s 

Law the maximum speedup at 8 nm is 11.3 using the conservative device scaling and 59 

considering the ITRS roadmap. In both cases the typical number of cores is predicted to 

be smaller than 512. They assume that dark silicon will dominate in 2024 relying on the 

ITRS roadmap. 

3 Scaling Limitations and Dark Silicon 

Figure 7: Dark silicon bottleneck relaxation using CPU organization and dynamic topology 

at 8 nm with ITRS scaling [7] 

From our previous observations, we know obviously that limited application parallelism 

and a limited power budget are the main sources of dark silicon. To make a more 

detailed analysis, which of these factors may dominate, we have a closer look to a hypothetical 

CPU-like processor in 8 nm technology derived from the ITRS roadmap. In the 

first part of Figure 7 only the power budget was limited. The different curves are presenting 

the speedup of the different PARSEC benchmarks, normalized to a 45 nm Nehalem 

62


quad-core processor. We are considering a parallelism of 75 % to 99 % and assume that 

programmers can arrange this somehow. The markers are presenting the parallelism in 

the current implementations. We notice that most of the benchmarks have even at a level 

99 % parallelism only a speedup of 15. 

In the second part of Figure 7 we are considering a fixed limit of parallelism and vary 

the power budget. We see that eight of twelve benchmarks are accelerated not more than 

by a factor of ten, even with a practically unlimited power budget. 

This analysis shows that the level of parallelism is the most dominating source of dark 

silicon and a varying power budget is affecting the fraction of dark silicon more marginal. 

4 Alternative Models 

4.1 General Models 

Several other studies have been published in the area of performance and scaling predictions, 

but most of them do not cover the generality and level of details as presented by 

Esmaeilzadeh et al. [7]. Examples are the corollaries to Amdahl’s Law by Hill and Marty 

[10] of the presentation of many core architectures by Borkar [4]. 

4.2 Specialization Oriented Models 

A promising approach to overcome the problems pointed out by this work is using custom 

logic. Chung et al. [6] presented a model, which is combining traditional processors 

with custom logic, called unconventional cores (U-cores), implemented by FPGAs or 

GPGPUs. They came to the conclusion that these technologies are useful when reducing 

the power consumption is a primary goal, but these technologies also require a significant 

level of application parallelism to work efficient. Such solutions may help to reduce 

energy demands in some areas, but by the fact that limited parallelism is the most critical 

source of dark silicon (Section 3), it is doubtful that this technologies are suitable for the 

majority of the applications. 

Hempstead, Wei, and Brooks presented a modeling framework for upcoming technology 

generations called Nagivo [9]. They also came to the conclusion that specialization 

to specific application may overcome energy problems. Furthermore they made very optimistic 

assumptions regarding the possible parallelism, so it is also problematic to solve 

dark silicon problems with this approach. 

5 Conclusions 

Historically processor speedup was achieved by increasing the chip complexity and increasing 

the used frequency. This scaling failed in the last year, caused by an exorbitant 

growth of the energy consumption. The answer of the computer engineers were multicore 

processors, which results in many new problems. This work presented an analysis 

63


of the performance scaling of multicore CPUs and GPUs with a focus on the effect of 

dark silicon. A device model, which predicts upcoming semiconductor technologies, a 

core model which predicts the upcoming single core performance and a multicore model, 

which enables us to make predictions on the speedup by using multicore processors were 

presented. We have seen that even with a optimistic technology scaling, proposed by the 

ITRS roadmap, it is impossible to hold the historical performance growth. 

Finally we have to consider the question about the significance of this work. The 

relevant factor is here the plausibility of the made assumptions and used techniques. 

To simplify the analysis, in the proposed models, there were no consideration of simultaneous 

multithreading (SMT). SMT may cause in a additional speedup, but also be a 

performance drawback. 

Another problem is that only the on-chip components were considered in the power 

analysis. There is a consensus that the fraction of these components will increase in future. 

Other system components will demand a larger part of the total power consumption, 

which may reduce the speedup and increase the fraction of dark silicon. 

The presented empirical data was only containing Intel and AMD processors, particularly 

ARM or Tilera cores were not considered, caused by missing SPECmark results. 

However, the presented model seems to be feasible in general, even though some 

smaller assumption at different sections of the study were optimistic. Specially the mentioned 

sources of dark silicon might be realistic. The fact that limited application parallelism 

is the most important reason for dark silicon shows, that also programmers have a 

large amount of the upcoming challenge to speedup applications. 

References 

[1] Gene M. Amdahl. Validity of the single processor approach to achieving large scale 

computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer 

conference, AFIPS ’67 (Spring), pages 483–485, New York, NY, USA, 1967. 

ACM. 

[2] Major Bhadauria, Vincent M. Weaver, and Sally A. McKee. Understanding parsec 

performance on contemporary cmps. In Proceedings of the 2009 IEEE International 

Symposium on Workload Characterization (IISWC), IISWC ’09, pages 98– 

107, Washington, DC, USA, 2009. IEEE Computer Society. 

[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec 

benchmark suite: characterization and architectural implications. In Proceedings 

of the 17th international conference on Parallel architectures and compilation techniques, 

PACT ’08, pages 72–81, New York, NY, USA, 2008. ACM. 

[4] Shekhar Borkar. Thousand Core ChipsA Technology Perspective. In 2007 44th 

ACM/IEEE Design Automation Conference, pages 746–749. IEEE, June 2007. 

64


[5] Shekhar Borkar. The Exascale challenge. Proceedings of 2010 International Symposium 

on VLSI Design, Automation and Test, pages 2–3, April 2010. 

[6] Eric S. Chung, Peter a. Milder, James C. Hoe, and Ken Mai. Single-Chip Heterogeneous 

Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? 

2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 

225–236, December 2010. 

[7] Hadi Esmaeilzadeh, Emily Blem, Karthikeyan Sankaralingam, and Doug Burger. 

Dark Silicon and the End of Multicore Scaling. 

[8] Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and 

Uri C. Weiser. Many-core vs. many-thread machines: Stay away from the valley. 

IEEE Comput. Archit. Lett., 8:25–28, January 2009. 

[9] Mark Hempstead, Gu-yeon Wei, and David Brooks. Navigo: An Early-Stage Model 

to Study Power-Constrained Architectures and Specialization. 2009. 

[10] Mark D. Hill and Michael R. Marty. Amdahl’s Law in the Multicore Era. Computer, 

41(7):33–38, July 2008. 

[11] Gordon E. Moore. Cramming more components onto integrated circuits, reprinted 

from electronics, volume 38, number 8, april 19, 1965, pp.114 ff. Solid-State Circuits 

Newsletter, IEEE, 20(3):33 –35, sept. 2006. 

65

Guiding Computation Accelerators to Performance 

Optimization Dynamically 



sandeep@uni-paderborn.de 


Abstract 

Constant demand for performance optimization and increase in efficiency of 

computation has paved to many advancements in the design of the embedded processors. 

Usage of application specific instruction set processors(ASIPs) is one of the 

most popular approaches. The hardware, computation accelerators used in ASIPS, 

are customized as per the extensions to instruction set(ISE). In order to capitalize the 

performance gain provided by these customized accelerators, the applications have 

to be compiled with these ISEs. An approach to (1)dynamically utilize these customized 

accelerators for the applications that are not compiled with the ISEs, (2)the 

problems faced due to the dynamic approach and (3)the methods used to resolve 

them, are explained in detail in this paper. 


1.1 Introduction to Terminology 

Compilation of an application involves decoding of the instructions and storing them in a 

convenient way so that it can referenced later easily. The compiler views these decoded 

instructions as a graph, referred to as dataflow graph(DFG). A portion of this dataflow 

graph is often extracted to fuse them into macro-ops or to map them onto a specialized 

hardware. These portions of the DFG are referred to as subgraphs. Compiler requires a 

mapping that describes the flow of the control within these instructions. Hence, it extracts 

a graph depicting the flow of control from the DFG. This is referred to as controlflow 

graph(CFG). 

1.2 Origin 

Present day embedded systems are expected to perform complex computations like, processing 

images, signals, video streams, etc., efficiently. General purpose processors may 

67


fail to meet the demands of complex instructions, in terms of performance and power 

costs. Customizing hardware is a commonly opted method for meeting these performance 

requirements within a limited power and cost constraints. Traditionally, application specific 

integrated circuits(ASICs) are used in embedded systems to perform computation 

intensive tasks. ASICs are non programmable hardware customizations, that aid in realizing 

efficient solutions. In ASICs, the critical functionality is mapped directly onto the 

hardware implementations, reducing the burden on the processor and there by resulting in 

better performance. Although ASICs yield in better performance compared to the other 

solutions, lack of programmability makes it a bad choice as only few applications can 

fully benefit from them. Any changes in the application may deprive it of the advantages 

of the ASICs. Moreover, introduction of an ASICs requires rewriting of the application 

to be able to take advantage of the ASICs. 

An alternative approach is to employ smaller, but compilable hardware, referred to as 

computation accelerators. These accelerators are customized as per certain specific complex 

operations, and the instruction set should incorporate these instructions. The application 

specific instruction processors(ASIPs) utilize computation accelerators, incorporated 

into its processor pipeline. Computation accelerators can provide several advantages, 

including reduced latency for subgraph execution, increased execution bandwidth, improved 

utilization of pipeline resources, and reduced burden on the register file for storing 

temporary values. ASIPs unlike ASICs are reprogrammable, have time to market advantage 

over ASICs and produce a better performance when compared to traditional general 

purpose processors. 

The multiply accumulate(MAC) unit is one of the most widely used accelerator in 

industry. Accelerators find a common use in DSPs, where common computations like dot 

product, sum of absolute differences, and compare select, in signal and image processing, 

are mapped onto them. Accelerators are further classified into two types, generalized accelerators 

and specialized accelerators. The design of generalized accelerators is mainly 

architecture dependent. Some of them being, 3-1 ALUs, closed-loop ALUs, etc. Larger 

the accelerator, bigger the subgraph it can support and thus higher performance enhancement. 

But, increase in capacity of the accelerators reduces the options of its deployability, 

only a fewer applications can benefit from them. FPGA-style accelerators, configurable 

compute accelerators, and programmable carry functions are some of the successful bigger 

accelerators. As the name suggests, specialized accelerators target a particular application. 

These synthesized accelerators are mostly employed in commercial tool chains, 

e.g. Tensilica Xtensa, ARC Architect, and ARM OptimoDE. 

Complex algorithms have been developed over the period, to identify the subgraphs 

that can be executed on the accelerators. These algorithms require instruction set extensions(ISE) 

for the instructions supported by the accelerators, to select the subgraphs. 

Then "control flow graph" to isolate subset of the subgraphs that would improve overall 

performance. Usually these algorithms are incorporated into the compilation process, 

making the approach static. And hence, the applications that are not compiled with these 

ISEs face binary compatibility problem, and cannot benefit from these accelerators. The 

authors have proposed using dynamic binary translation(DBT) approach to overcome bi- 

68

Guiding Computation Accelerators to Performance Optimization Dynamically 

nary compatibility. It enables the applications not compiled with these ISEs also to benefit 

from the computation accelerators. 

Dynamic binary translation, in principle looks at a short sequence of code, typically in 

the order of a single basic block, then translates it and caches resulting sequence. Code is 

only translated as it is discovered and when possible. The overhead during the translation 

time can be amortized if translated code sequences are executed multiple times. 

Dynamic binary translation has been proven effective in embedded systems like power 

management, security, software caches, instruction set translation, memory management 

etc. The authors have used this technique to collapse critical computations subgraphs 

into ISEs during runtime and thereby mapping them onto the accelerators without the 

necessity to recompile. As this processing has to be done during runtime, it poses certain 

limitations. The authors describe their implementation using dynamic binary translator, 

the difficulties in achieving it and the methods used to overcome them. 

In the current document the work done in [3] will be explained in detail. In the Section 

2, similar works done to improve the performance is explained. In Section 3, the 

methodology of algorithms employed in the static approaches is described in detail. Further 

in Section 4, a similar implementation is explained, to give a better understanding of 

the work done in [3]. And in the Section 5 the implementations used in [3] is explained. 

Then finally the work is concluded in Section 6 with an overview. 


Attempts to improve the performance of the embedded the systems have taken place in 

many areas. Most of research has been in the field of automating the generation of ISEs. 

Whenever a new accelerator is developed, or an existing accelerator is modified, an ISE 

suitable to the hardware should also be developed. Development of this ISE has to be 

monitored and tested well enough to guarantee full benefits of the hardware. By automating 

the process of generation of the ISE, time required to invest in its design and testing 

can be avoided. There by providing for early release of the product into market. 

There have also been researches in the hardware structure of an accelerator. Some 

of them include, an attempt to serialize the register files access to increase the number 

of register file ports. Flexible configurable compute accelerator, which can be integrated 

into a pre-designed processor core through a simple interface, was another attempt. 

Next is in the usage of an accelerator, as described in Section 1, most of the other 

practices use static approach. The identification of the subgraphs and mapping them onto 

the accelerator is done during compilation, along with generating ISEs. Some researches 

also include dynamic hardware approaches, designed for systems with trace. 

The most related research from the authors was explained in [1]. This involves fusing 

of dependent micro-ops into macro-ops to run on 3-1 ALUs, thereby increasing the 

instruction level parallelism. One limitation of this approach is that, it only focuses on a 

specific architecture. This is a co-designed virtual machine approach, with an enhanced 

superscalar microarchitecture. It is explained in detail in the Section 4. 

69


3 Static Approach 

The standard implementations of ASIPs incorporate their implementation into compilation. 

Hence the performance of accelerators in these ASIPs greatly depends on the 

compiler support. The compiler has two major tasks when targeting a computation accelerator. 

Firstly, it must identify the candidate subgraphs in the target application that 

can executed on the accelerator. This task gets complicated when an accelerator supports 

multiple functionality, especially when some of them are a superset of others. This task is 

commonly known as subgraph isomorphism. The second task is to select those candidate 

subgraphs, that can be executed on the accelerator. Candidates often overlap, hence the 

compiler must select a subset of these candidates in order to maximize performance gain. 

For the compiler to be able to identify these subgraphs, the instructions supported 

by the accelerator have to be incorporated into the instruction set, i.e. an extension to 

instruction set(ISE) has to be designed as per the accelerator. When an application is 

compiled with these ISEs, the subgraphs that can be executed by the accelerators are 

identified and replaced with suitable instructions to invoke an accelerator. 

Greedy compiler approach has been a common approach in the beginning. In this approach 

an operation(referred as seed) is selected and expanded till its compatible with the 

accelerator. But, this approach produces only a sub-optimal solution and mostly breaks 

down for larger accelerators. There has been a lot of research in this area, and better and 

complex algorithms were developed. 

As it is during the compilation, identification of the sub-graphs and selection of the 

candidate subgraphs for execution on the computation accelerator are done, the complexity 

of the algorithms and algorithm execution time are not highly restricted. Moreover 

the data flow and control flow information of these subgraphs can be obtained from the 

compilation as the subgraphs are already identified, thereby reducing any burden on the 

execution. The availability of the control flow information eases up the scheduling of the 

instructions, avoiding the conflicts. 

4 Dynamic approach for CISC processors 

The authors of [1], describe a dynamic approach to improve the performance of a traditional 

x86 processor with an enhanced superscalar microarchitecture, and a layer of concealed 

dynamic binary translation software that is co-designed with the hardware. The 

main concept behind the optimization proposed here, is to combine dependent micro-op 

pairs into fused "macro-ops" and are managed throughout the pipeline as single entities. 

Authors state that, although a CISC instruction set architecture(ISA) already has 

instructions that are essentially fused micro-ops, higher efficiency and performance can 

be achieved by first cracking the CISC instructions and rearranging and fusing them into 

different combinations than in the original code. 

The proposed implementation contains two major components, software binary translator 

and the supporting hardware architecture. The interface between the two is the x86- 

70


Figure 1: Overview of proposed x86 desing in [1] 

specific implementation instruction set. A two-level decoder has been introduced, as part 

of the proposed architecture. The decoder first translates the x86 instructions into microops. 

The second level decoder generates the decoded control signals used by the pipeline. 

The pipeline is designed to have two modes, one to process the x86 instructions(x86mode) 

and the other for fused macro-ops(macro-op mode). A profiling hardware is used 

to identify the frequently used code regions(hotspots). As hotspots are discovered, they 

are organized into special blocks called, superblocks, translated and optimized as fused 

macro-ops. These fused macro-ops are placed into a concealed code cache. To reduce 

pipeline complexity, fusing is performed only for dependent micro-op pairs that have a 

combined total of two or fewer unique input register operands. When executing these 

macro-ops, the first level of decode, as shown in Figure 1, is bypassed. It only passes 

through the second decode level. 

The dynamic binary translation software optimizes these hotspots by finding critical 

macro-op pairs for fusing, by analyzing overall micro-ops, reordering them and fusing 

pairs of operations taken from different x86 instructions. For the optimized macro-op 

code, paired dependent micro-ops are placed in adjacent memory locations and are identified 

via a special fuse bit. Two main strategies are used for fusing. First, single-cycle 

micro-ops are given higher priority as the head of the pair. Second, higher priority is given 

to pairing micro-ops that are close together in the original x86 code sequence. The reason 

being that, these pairs are more likely to be in the program’s critical path and should be 

scheduled for fused execution in oder to reduce the critical path latency. Another constraint 

considered is that, the oder of memory operations has to be maintained. 

Algorithm Functionality 

A forward two-pass scan algorithm is utilized to create fused macro-ops quickly and effectively. 

Once a data dependence graph is created, the first pass considers single-cycle 

micro-ops one-by-one as tail candidates. For each tail candidate, the algorithm looks 

backward in the micro-op stream, to find a head for it. The algorithms proceeds by looking 

from the second micro-op in the backward order, till the last(i.e. first of the actual 

stream) in the block containing the translated code(superblock). Its constraints are to find 

a nearest preceding micro-op as head, the micro-op should be of single-cycle and mainly, 

it should produce one of the tail candidate’s input operands. This emphasizes that the 

fusing rules favor dependent pairs with condition code dependence. The pairs that have 

71


Figure 2: Two pass algorithm used in [1] 

Figure 3: Example of a Two pass algorithm from [1] 

satisfied the above conditions, will then go through some tests(fusing tests). These tests 

make sure that no fused macro-ops can have more than two distinct source operands, break 

any dependence in the original code, or break memory ordering. Macro-ops having more 

than two source operands become an overhead on the pipeline, induce more latency than 

the actual performance gain obtained. As it is understood, breaking any dependence in 

the original code will result in undesired results. Furthermore, the memory ordering hardware 

can be left simple if the memory ordering is not broken while fusing the operations. 

This leads to the end of the first scan. In the second scan the multi-cycle micro-ops 

are considered as candidate tails. The same steps are run again to detect if a suitable head 

can be located in the superblock. 

The Figure 3 illustrates a good example, showing how an x86 code is decoded to 

micro-ops and then how dependent pairs are fused into macro-ops. The translator first 

cracks the default operations of x86 into micro-ops, as depicted in Figure 3b. Reax 

denotes the native register to which the x86 eax register is mapped. The long immediate 

080b8658 is allocated to register R18 as it is used often. First a dependence graph is 

72


built for the translated instructions. Then the two-pass fusing algorithm looks for pairs 

of dependent single-cycle ALU micro-ops during the first scan. It can be seen that in the 

current example, the AND and the first ADD are fused(marked by :: in Figure 3c). There 

is a reordering in the instructions due to the fused pair. This would result in overwriting 

the value of Reax to be used in store operation, by AND operation moved up. Register 

assignments is used to resolve such issues, in this case R20 is assigned to hold the value 

from the ADD operation, such that it can be used in both AND and ST operation. As 

the fusing algorithm also considers multi-cycle micro-ops as candidate tails, during the 

second pass, the last two dependent micro-ops are fused together. Even though the tail is 

a multi-cycle micro-op, the head still remains to be a single-cycle micro-op, which is a 

constraint followed by this algorithm. 

The two-pass algorithm described here is proven to be more advantageous than the 

single pass algorithm used in [2]. The single pass algorithm described there, would fuse 

the first ADD with the following ST operation aggressively, which would not be on critical 

path. Using memory instructions as tails may also slow down the wakeup of the entire 

pair, thus loosing cycles when the head micro-op is critical for another dependent microop. 

Although the two-pass algorithm comes with slightly higher translation overhead and 

fewer fused micro-ops overall, the generated code runs significantly faster in pipelined 

issue logic. 

Observation 

A co-designed virtual machine paradigm is applied to improve efficiency and performance 

of an x86 processor. With a cost-effective hardware support and co-designed runtime software 

optimizers, the VM approach achieves higher performance for macro-op mode with 

minimal performance loss in x86 mode, during the startup. This optimizes the vast microops 

generated by the translator from the x86 code, and is applicable to CISC processors 

in general. The proposed implementation, improves the x86 IPC performance by 20% on 

average over a comparable conventional superscalar design. The large performance gain 

comes from macro-op fusing, which treats fused micro-ops as single entities throughout 

the pipeline to improve instruction level parallelism(ILP), which reduces the communication 

and management overhead. Other features such as superblock code re-layout, a 

shorter decode pipeline for optimized hotspot code(as the first level decoder is skipped) 

and the use of 3-1 ALU(which results in reduced latency for some branches and loads), 

also contribute to the performance improvement. This implementation proved to be a 

promising approach, that addresses the thorny and challenging issues present in CISC 

ISA such as the x86. 

5 Dynamic Optimization for Computation Accelerators 

The authors of [3] have proposed an approach to dynamically optimize the utilization of 

computation accelerators. It is a more generic approach when compared to the approach 

73


discussed in [1], which focuses mainly on CISC processors. One more significant feature 

of this approach is that, it is purely a software oriented optimization. The authors 

have described in this paper, the techniques used to incorporate the accelerator utilization 

into dynamic binary translation technique, to overcome the binary compatibility problems 

posed by not compiling the applications with the ISEs. Due to its incorporation during 

runtime, there are certain limitations on the implementation. Methods used to overcome 

these limitations are also explained here. 

5.1 Integration 

The typical accelerator utilization process is integrated into a dynamic binary translation 

system by introducing the author’s optimization technique between the trace-formation 

and Superblock-cache modules. 

The basic flow of a dynamic binary translation technique consists of three stage and 

a manager module, responsible for high level control. In the first stage the instruction is 

interpreted and emulated. During the emulation, the hotspot regions are searched. If an 

hotspot region is identified, then it is forwarded to the Trace Formation stage. Meanwhile 

the translator continuously translates the instructions until the stopping conditions are met. 

The translated instruction are formed into large block called superblock. The so formed 

superblocks undergo some optimization techniques, and the optimized code is placed into 

a cache called Superblock Cache. The blocks of code placed into the superblock cache are 

indexed using an address map table. After initial warmup, and some optimized blocks are 

put into superblock cache, the instructions being interpreted are compared with the ones 

present in the superblock cache, to check if suitable mapping is already present. If there is 

a corresponding hit in the superblock cache, then the instruction is fetched from the cache 

and executed. If there is no hit, then the instruction is passed to the interpretation stage 

and the process flow is continued. 

The binary accelerator utilization process, proposed by the authors in [1], is incorporated 

as one of the optimization technique in the Optmization Stage(indicated as gray part 

in the Figure 4). This is regarded as a special kind of instruction-set-specific optimization. 

Apart from this, only few other required optimization techniques have used in their 

implementation, to fully measure the performance of their technique. The other optimization 

techniques that were added include indirect branch(ex: jump) removal, superblock 

chaining(identifying the dependencies among he superblocks and scheduling them in a 

proper way). 

Unlike static approach, which is done on a compiled code, where the data flow and 

control flow graph are already constructed, the dynamic approach that lacks constructed 

data or control flow faces many problems. Constructing the exact control flow graph could 

be time consuming and even impossible. Without a proper control flow, the dependencies 

among the data blocks cannot be identified. 

The authors mainly concentrate on generating Dataflow Analysis and Subgraph Mapping 

to map critical dataflow subgraphs into ISEs during runtime without any control flow 

information from compilation, using the dynamic binary translation. 

74


5.2 Functional Description 

Figure 4: A typical DBT workflow from [3] 

The main factor to be considered is the execution time. In a static approach, the time 

taken for dataflow analysis and subgraph mapping is done on the intermediate code with 

the help of the control flow information from the compilation framework and hence not 

considered into the actual execution time. Where as in a dynamic approach the dataflow 

analysis and subgraph mapping is performed on the final binary code, furthermore without 

any control flow information. As this is performed during the runtime, the complexity of 

the algorithms used have to be checked, the execution time of these algorithms is also 

counted into the actual execution time of the application. One more constraint of working 

on final binary is that the number of intermediate variables are limited to the architecture 

registers. Although the number of intermediate variables are limited, the usage of system 

registers offers extra benefits. The sections 5.3 and 5.4 explain major functionalities in 

detail. 

5.3 Dataflow Analysis 

Dataflow analysis is an important part to obtain compilation optimizations. The identified 

dataflow graphs are mapped onto the accelerators to increase the performance as they 

can be executed easily there. The dataflow analysis is split into two parts (1) Intra-block 

dataflow analysis, to identify the dependent instructions within a superblock and (2) Interblock 

dataflow analysis, to avoid unsafe code transformation which might be caused due 

to live-out registers of one block used in another. 

Obtaining the overall dataflow information is not a better option during runtime as 

it could take longer time duration. And inturn it would effect the overall performance. 

Hence, in the current implementation the dataflow is analyzed block by block. 

75


5.3.1 Intra-block Dataflow Analysis: 

The usual algorithm used to build a dataflow graph is the simple brute-force algorithm run 

twice through the list of instructions. The first to identify an instruction and the second 

to check if each of these instructions uses the result of any previous instruction. If such 

an instruction is found then a dataflow edge is set from the previous instruction to the 

current instruction. This resulting in an algorithm of O(n 2 ). Moreover these algorithms 

run on intermediate code before register allocation, hence the they can use any number of 

variables to store temporary values. 

As the dynamic binary translation systems perform this check on the final binary form 

of an application, the number of variables are restricted to the number of the architecture 

registers. But, the usage of architecture registers provides an extra benefit, which is exploited 

by the authors. The algorithm maintains an array of size of number of registers, 

which is used to store the instruction number that has modified a register last. In any 

instruction there is one target register, where the result is stored and at most two more 

registers, source registers, which contain the data required for performing the operation. 

For each instruction the source registers are checked for in the maintained array to see if it 

was modified by any previous instruction in the current block. If the corresponding entry 

is not zero, then a dataflow edge is set from that instruction to the current instruction. 

Thereby the order of magnitude of the algorithm is reduced to O(n). This has also proven 

to be efficient upto 68% to 96.82% in the benchmarks run by the authors. 

5.3.2 Inter-block Dataflow Analysis 

Although the dataflow of the subgraphs is contained within the superblock in most cases, 

the subgraphs near the block borders have to handled with care. If there are any liveout 

nodes(registers utilized with the block) from a block, they have to be killed in the 

successor block. Otherwise it might lead in an unsafe code transformation. 

For example if a target register used within the current subgraph is not used by its 

end, it is considered to be a live-out node. The successor blocks using these registers 

have to be informed of them so that they can redefine these registers before using them. 

Consider the subgraph surrounded by dash lines in the Figure 5, which corresponds the 

the instructions 1, 3 and 5 of the machine code. Form this subgraph, it can be seen that 

the register $2 is a live-out register. If the successor subgraphs, outside the superblock, 

redefine register $2 before using it, then authors suggest that it can be ported to a 1-output 

accelerator, otherwise the accelerator should be at least a 2-output one. 

The algorithm proposed by the authors uses register masks to identify these live-out 

nodes and kills them. The registers used as part of this block are given as input of this 

algorithm and the current masks of the source registers are set to zero. Implying that these 

instructions are used by the current instruction. If the bit mask of a modified register is still 

set to one, it implies that it is a live-out register. This bit mask is passed on to the successor 

block to notify it of the live-out nodes. If these live out node are killed by end of the 

successor block, it implies that there is a dependency between these two subgraphs. This 

dependency information is used while scheduling to avoid unsafe code transformations. 

76


Figure 5: An example of inter-block dataflow [3] 

Figure 6: An example unsafe subgraphs [3] 

This algorithm has also proven to be 19.9% to 54.51% effective for different applications. 

The downside is that, the algorithm being depth-first searching algorithm it takes longer 

time for certain applications as it is not restricted. It can be resolved by putting a check to 

the max-depth. 

5.4 Subgraph Mapping 

5.4.1 Safety Checking 

Now that the dataflow information with for the superblock is available, the subgraphs 

have to be identified to form them into an ISE. Subgraph mapping involves 1) collapsing 

several instruction into an ISE and 2) reordering code to group the dependent instructions. 

The subgraphs have to be chosen in such a way that the safety of the code is intact. 

77


Figure 7: An example of subgraphs among blocks [3] 

Some of the unsafe subgraph mappings can be seen in the Figure 6. The graphs Figure 

6(a) and Figure 6(b) show subgraphs with cyclic dependency. The problem shown 

in Figure 6(a) is referred to as a non-convex subgraph. This graphs shows that a cyclicdependence 

is formed between the operations in and out of the sub-graph. Hence the 

implementation of the authors makes sure the instructions of the subgraph does not have 

a side path. The Figure 6(b) shows two subgraphs possible to be ISEs, but have interdependency. 

Such situations are avoided by choosing only one subgraph for ISE at a 

time. 

The Figure 6(c) shows another for of unsafe code transformation. It would be unsafe 

if the subgraph is placed at the third instruction as register $10 is overwriten by the 2nd 

operation. Hence the placement of the subgraphs have to be carefully chosen. 

5.4.2 Subgraph Mapping among Blocks 

One more advantage of the runtime optimization is that the boundaries of the block are 

known. Additionally a profiler can be used to identify the critical paths, which is not 

possible in static approaches. Using these informations the instructions can be moved 

among the blocks to form a better subgraph. An example instance of this can be seen in 

the Figure 7. 

5.4.3 Subgraph Mapping Strategy 

After the initial check are done on the subgraphs obtained from the basic blocks, mapping 

strategy falls down to two basic steps. First, the subgraphs have to be enumerated to 

obtain their critical sections that can be executed on accelerator. Second, select a subset 

of these subgraphs which would result in optimal performance. As the mapping has to be 

done during runtime, authors have come up a variant of greedy approach, which marks the 

78


nodes that have been considered once. Here an operation is selected as seed, is expanded 

till a jump in control is observed. While selecting the new seed, only the unmarked seeds 

are considered. The so obtained subgraphs are then mapped onto the accelerators. 

6 Conclusion 

Most of the areas of research in improving the performance of the accelerators have been 

about the hardware(static or dynamic) and automating the generation of ISEs. But, the 

authors of [3] have proposed a dynamic approach to utilize the accelerators. Another approach, 

a co-designed virtual machine paradigm presented in [1] is also explained here to 

provide a better understanding of the work flow of the accelerators. The algorithms proposed 

by the authors for dataflow analysis and subraph mapping during the runtime using 

dynamic binary translations have proven to be relatively effective for the applications that 

are not compiled with the ISEs. Although there are lot of safety checks to done, the usage 

of registers in the algorithms during runtime has paved for better results. 

References 

[1] S. Hu, I. Kim, M. H. Lipasti, and J. E. Smith. An approach for implementing efficient 

superscalar cisc processors. http://ieeexplore.ieee.org/xpls/ 

abs_all.jsp?arnumber=1598111&tag=1, February 2006. 

[2] S. Hu and James E. Smith. Using dynamic binary translation to fuse dependent 

instructions. http://dl.acm.org/citation.cfm?id=977395. 

977670&coll=DL&dl=ACM&CFID=61907142&CFTOKEN=18787638, 

March 2004. 

[3] Ya-shuai, Lü Li Shen, Zhi ying Wang, and Nong Xiao. Dynamically utilizing 

computation accelerators for extensible processors in a software approach. http: 

//dl.acm.org/citation.cfm?doid=1629435.1629443, October 2009. 

79

A Case for Lifetime-Aware Task Mapping in 

Embedded Chip Multiprocessors 

Andre Koza 


koza@mail.uni-paderborn.de 


Abstract 

System lifetime of embedded systems is an important factor for reliability. Unpredicted 

failures of essential components can become a bottleneck for overall system 

lifetime. There are different approaches to increase lifetime. One way is to add 

additional resources to the system to cover for component failure. Another way is to 

change the way in which resources are used. In this seminar paper three approaches, 

which enhance system lifetime, are presented. One focuses on lifetime-cost Paretooptimal 

slack allocation. Slack denotes resources that are initially not required but 

tasks and memory of failed components can be remapped to them. The other two 

approaches focus on lifetime-aware task mappings, i.e. task mappings with the goal 

to improve lifetime. As a result all three approaches increase system lifetime. While 

slack allocation needs additional investment in hardware, task mappings only need a 

change in software. 


Lifetime reliability of embedded chip multiprocessors has become important as unforeseen 

system failures can have dramatic results, e.g. the failure of a security system in an 

automobile. System lifetime has to be addressed in the design of the system [6]. In recent 

strategies on the one hand a system-level approach is used, in which the hardware or the 

communication architecture is changed [9]. On the other hand lifetime can be improved 

by changing the way resources are used, e.g. by task mapping [6] [7]. 

In this seminar paper three recent approaches to improve system lifetime in embedded 

systems are discussed. At first we look at a method for cost-effective slack allocation 

[9], which focuses on how to allocate additional resources to overcome whole system 

failure, when single parts of the system fail. In the paper the authors use slack to increase 

system lifetime of NoC-based (Network on Chip) MPSoCs (MultiProcessor System-on- 

Chip). Slack means additional execution and storage resources that are not required in a 

81

Andre Koza 

standard running state but when components fail, tasks and data of failed components can 

be scheduled and mapped to these resources. In their Critical Quantity Slack Allocation 

(CQSA) technique the authors try to find an optimal tradeoff between cost and lifetime 

improvement. The challenge in slack allocation is that the design space can be large and 

complex, i.e. that there are many different positions where and how much slack should 

be allocated. With CQSA it is possible to find designs within 1.4% of the lifetime-cost 

Pareto-optimal while only exploring 1.4% of the design space. 

After this system-level approach, which changes the hardware, the next two approaches 

are based on nature inspired technologies: simulated annealing (SA) [7] and ant colony 

optimization (ACO) [6]. They target the task allocation and scheduling of processes 

to avoid overusing some resources while others are idling or at least less used. These 

overused resources age faster than others, and due to wearout they will eventually fail 

earlier. Therefore they become a reliability bottleneck resulting in a reduced system lifetime. 

The authors of [7] propose a lifetime reliability-aware task allocation for MPSoCs. 

They use simulated annealing for the task allocation. Their motivation is that wearout related 

failures of components have to be considered during the task allocation and scheduling 

process. The failure of important components reduces the reliability and the system 

lifetime. To compensate this a task allocation is developed that takes several wearout related 

factors such as temperature, circuit structure or voltage into account. The algorithm 

used for that task allocation is based on simulated annealing. 

The third approach that is presented in this seminar paper also analyzes the task allocation 

to gain lifetime improvements. The authors of [6] propose a lifetime-aware task 

mapping technique based on the nature inspired ant colony optimization. They tried to 

find a method for improving system lifetime without having to invest in additional hardware 

like in slack allocation. Their starting point was temperature aware task mapping 

but they came to the conclusion that when only regarding temperature, one receives high 

fluctuation in system lifetime. Therefore they considered other factors like electromigration 

or time-dependent dielectric breakdown. In their ACO-based method artificial ants 

explore a graph representation of a task mapping. The ants share information about a 

good path in the task graph and according to that information the following ants select 

paths which previously has been proven to be good ones. The authors showed in a wide 

spectrum of benchmarks that their approach reaches system mean time to failure within 

17.9% of the observed optimum. 

This seminar paper is organized as follows: In Section 2 related work to the presented 

approaches is shortly introduced. Then, in Section 3 the different methods for improving 

system lifetime are described in detail while the focus lies on ACO-based task mapping. 

After that, in Section 4 the methods are compared with each other with respect to effectiveness 

and cost. The paper ends with a conclusion in Section 5. 

82

A Case for Lifetime-Aware Task Mapping in Embedded Chip Multiprocessors 


Two other approaches as the one presented in this paper also used slack allocation to 

optimize cost and lifetime. The first one focuses on minimizing the area while processing 

elements are selected. Then, changes to processor selection are made to get an increase 

in lifetime [10]. The other one works similar as the presented approach but does not use 

storage slack [5]. 

The meta-heuristic simulated annealing was first introduced in [8] and [2]. There it 

was used to find an approximation to the NP-complete traveling salesman problem. It has 

been shown that the task allocation problem is also NP-complete and so the authors of [7] 

tried to adapt simulated annealing to this problem. 

Ant colony optimization is also a meta-heuristic and was first described in [4]. Prior 

to the work of [6], which is presented in this paper, ACO has been used to solve task 

mappings in [1] and [3]. There, performance and not system lifetime was optimized, in 

contrast to [6]. 

3 Lifetime Improvements in Embedded Systems 

In this section the previously introduced approaches are described in detail. In this paper 

we focus on the ACO-based task mapping. To allow a comparison, first the two other 

methods for lifetime improvement are presented. We take a close look at the system-level 

approach of slack allocation before we come to the task allocations based on simulated 

annealing and ant colony optimization. 

3.1 Lifetime Improvement by Slack Allocation 

One way to increase system lifetime of embedded systems is to provide additional, not 

directly required resources, called slack, which compensates for failed components. Both 

data and tasks are remapped and rescheduled to these previously underused resources to 

avoid complete system failure. While this method gives the chance to survive the failure 

of single components, the drawback is that one have to invest in additional hardware. In a 

system as a whole there are many possibilities at which point and how much slack should 

be allocated. The goal is to find a lifetime-cost Pareto-optimal front [9]. This means that 

a slack allocation has to be found that has be best tradeoff between lifetime and cost. 

The authors of [9] focuse on embedded network-on-chip multiprocessor systems-onchip 

(NoC-based MPSoCs) and try to optimize system lifetime and system manufacturing 

cost by selecting where and how much slack to allocate. The challenge of finding an 

optimal slack allocation is that the number of possible allocations is exponential in the 

number of resources [9]. They have developed a technique called Critical Quantity Slack 

Allocation (CQSA) to reach the goals. 

The lifetime of embedded systems can be increased at system level in three ways. 

First, execution slack can be allocated by replacing slow processors with faster proces- 

83

Andre Koza 

sors. Second, storage slack can be allocated by replacing small memories with bigger 

memories. Third, the communication architecture can be changed. For that, switches 

and links are added or modified, and additionally more processors and memories are put 

into the system. The task is now to determine how to increase lifetime cost-effectively. 

CQSA focuses on slack allocation and does not deal with changing the communication 

architecture. 

3.1.1 General Working of CQSA 

For CQSA to work, the following is assumed to be given. The computation, storage and 

communication requirements are known for each task that is executed. There is also a 

fixed communication architecture for a single-chip multiprocessor. Last, an initial task 

mapping of computational task to processors, storage task to memories and communication 

to links and switches is given [9]. With this, CQSA determines a slack allocation that 

optimizes both system lifetime and cost. 

To survive a component failure enough slack has to be allocated. The amount of slack 

that is needed to compensate a failure of a component is defined as critical quantity of 

slack for that component [9]. For a component C the critical quantity is described as es, 

ss where es means the execution slack and ss means the storage slack that is required 

for replacing the resources of C. These resources would become unreachable in case of 

a failure. There is a distinction between processor, memory and switching components: 

While processors only have critical quantities of execution slack (es, 0), memories only 

have critical quantities of storage slack (0, ss). Switches can have both, execution and 

storage slack. 

The authors of CQSA state that it is most cost-effective to allocate slack around 

switches [9]. If slack is allocated to handle processor and memory failure, this allocation 

can at no additional cost also be used for the switch, which interconnects the processors 

and memories. By allocating slack for switches, the design space is partitioned and because 

switches connect many components, the complexity of CQSA only grows slowly 

with an increasing number of overall components. 

3.1.2 CQSA Algorithm 

The algorithm of CQSA consists of three stages. Stage 0 begins to allocate execution 

slack to overcome single component failure of processors. To archive this, execution 

slack is greedily increased until the smallest execution-slack-only critical quantity (es, 0) 

is reached [9]. That means that the amount of slack can at least cover for each single processor 

failure. Next, stage 1 also considers execution slack but now focuses on situations 

in which switches may fail. For switches that only need execution slack additional slack 

is allocated. For that to work each critical quantity (es, 0) with es > 0 is considered. In 

stage 2 additionally storage slack is considered. The stage is executed for each critical 

quantity (es, ss) with es ≥ 0 and ss > 0. At first, an exhaustive search is executed to find 

a slack allocation of (es, ss) that optimizes mean time to failure (MTTF). This allocation 

84


is probably not the Pareto-optimal front because it only considers MTTF and ignores cost. 

The MTTF-optimized allocation is used as an initial slack allocation which is compared 

with other allocations. The algorithm then executes a loop that computes two new allocations 

for comparison. In the first one, execution slack is greedily increased (with regard 

to MTTF) and in the second one, storage slack is greedily increased (also with regard to 

MTTF). Then, that allocation (from the three computed ones) is selected that has the best 

cost-MTTF tradeoff. The selected allocation is then used as a starting point for another 

iteration in the loop (and used in the comparison). This loop is repeated until no more 

allocations can be found. 

3.1.3 Evaluation of CQSA 

The authors used two setups to evaluate CQSA. In the first smaller setup they did an exhaustive 

search for the global Pareto-optimal allocation of slack. They compared the 

Pareto-optimal with the allocation found by CQSA. In the second setup they used a 

large benchmark to estimate the scaling of CQSA. Additionally to the comparison with 

the Pareto-optimal allocation, three other slack allocation approaches were compared 

to CQSA: Optimal execution slack allocation (Optimal ESA), greedy slack allocation 

(Greedy SA) and random slack allocation (Random SA). In Optimal SA a set of Paretooptimal 

designs that only allocates execution slack is found. In Greedy SA execution and 

storage slack is added greedily in iterations. Each iteration selects that allocation that 

has the best cost-lifetime tradeoff. In Random SA a random allocation of all possible 

allocations is chosen. 

As a result the authors observed that their approach is the most accurate in case of the 

first setup where the optimal result found by exhaustive search was used as a reference. 

CQSA finds allocations within 1.81% of the optimum while exploring only 1.7% of the 

design space. The other approaches all had worse results. 

In the larger setup the authors used the best found allocation by all approaches as 

observed optimum (as exhaustive search is impractical due to the large size of the setup). 

In that benchmark again CQSA showed the best results. Another important observation 

was that the number of allocations that CQSA evaluated grows only by a factor of 10 

while the whole design space increased by a factor of 10 5 . 

To sum up over all examples, CQSA found slack allocations within 1.4% of the 

lifetime-cost Pareto-optimal while only exploring 1.4% of the design space [9] on average. 

In the smaller benchmark CQSA was able in increase system lifetime by 22%. The 

authors however do not mention at what cost this lifetime improvement could be achieved. 

Only in one example run they explicitly mention that they improved lifetime by 50% at 

a 62% cost increase. This also shows the big drawback of slack allocation. One has to 

invest a significant amount of money to increase system lifetime. In the next two sections 

we will present methods where no additional investments in hardware must be made to 

receive an improvement in lifetime. 

85

Andre Koza 

3.2 Simulated Annealing 

In contrast to the previous introduced approach to increase lifetime by slack allocation, in 

this section a method is presented that targets the task allocation and scheduling process 

for lifetime improvement. In [7] the authors state that if tasks are allocated in a way that 

some processors are more used than others, they will age faster and eventually fail earlier. 

If these processors are mandatory for the system, they become a reliability bottleneck and 

reduce overall system lifetime. To handle this the authors developed a lifetime reliabilityaware 

task allocation and scheduling algorithm for MPSoCs. This algorithm is based on 

the nature inspired technique simulated annealing (SA). 

Task allocations in prior work that seek to increase system lifetime focused mainly 

on reducing the system temperature due to the strong relationship between temperature 

and lifetime [7]. It has been shown, however, that when only regarding temperature the 

lifetime of embedded systems is not essentially increased [6]. Thus the authors propose to 

take other factors such as internal structure, operational frequency or voltage into account 

in a lifetime-aware task allocation. They investigated what errors can happen and how to 

increase lifetime reliability of embedded systems. As a result they came to the conclusion 

that avoiding permanent hard errors leads to the best reliability and therefore lifetime 

improvement. The work is focused on time dependent dielectric breakdown, electromigration 

and negative bias temperature instability. They used these failure mechanisms to 

estimate the MTTF of their systems. 

The problem of allocating tasks to processors is NP-complete [7]. Thus unless for 

very small problems exact approaches cannot be realized in an acceptable runtime. To 

overcome this the authors developed a heuristic approach based on SA to solve the task 

scheduling problem. 

3.2.1 Simulated Annealing Algorithm 

Simulated annealing is a meta-heuristic to find approximations to a global optimum of 

very large functions in which exhaustive search is infeasible in an appropriate runtime. 

To find an approximation of an optimum solution of a problem in SA a random initial 

solution is chosen at the beginning. In case of a task allocation a random (valid) allocation 

of tasks to processors is chosen. Valid means that no predecessor criterions and 

deadlines are violated. That solution is probably not the optimum. In the next step of the 

algorithm one single random change in the task allocation is executed. If the new allocation 

is better (i.e. nearer to the optimum than the previous solution) then it is always 

accepted. On the other hand, if the solution becomes worse, it is only accepted with a 

certain probability. This probability is influenced by a variable called temperature. The 

higher the temperature the higher the probability that a worse solution is accepted. This is 

done because otherwise the algorithm could get stuck in a local minima. The temperature 

starts at a high value and decreases over time via a cooling rate until an end temperature 

is reached. At each temperature the algorithm makes a certain number of moves before 

the temperature is decreased. With lower temperature the probability that worse solutions 

are accepted decreases. At the beginning of the algorithm the choice is nearly random 

86


Figure 1: Example of a simple task graph (taken from [7]) 

(a) G (b) G 

Figure 2: Example of task graph transformations (taken from [7]) 

if a worse solution is accepted. At the end the probability is very small and almost only 

improvements are accepted. If SA is run infinitely it will eventually output the optimum 

result. It had been shown that SA finds good approximations for the traveling salesman 

problem [2] [8] and in [7] it is adapted to find lifetime-aware task allocations. 

3.2.2 SA-based Task Allocation 

To describe a task allocation a directed acyclic task graph G = (V, E) is used where each 

node v ∈ V represents a task and each edge e ∈ E represents a precedence constraint. 

An illustration of a task graph can be found in Figure 1. A task allocation is then represented 

as (schedule order sequence; resource assignment sequence). An example could 

be (0, 2, 1, 3, 4; P1, P1, P2, P1, P2). There are five tasks and two processors (P1 and P2). 

Task 0 is the first one scheduled and followed by 2, 1, 3 and 4. Tasks 0, 2 and 3 are 

executed on processor P1 and tasks 1 and 4 on P2 [7]. 

To find new solutions from a random initial solution within the simulated annealing 

process, graph transformations are executed. First, there is an expand task graph ˆG = (V, 

Ê). In this graph there are the same nodes as in G but with additional edges. If G has a 

precedence constraint between two nodes, there is a directed edge added in ˆG between 

these two nodes. In the graph from Figure 1 there would be an edge added from node 2 

to node 4. An illustration of ˆG resulting from G is given in Figure 2(a). Next, another 

graph is created: an undirected complement graph ˜G = (V, ˜E). In this graph there is an 

undirected edge (vi, vj) in ˜E if and only if there is no precedence constraint between vi 

and vj [7] in ˆG. An illustration to this is shown in Figure 2(b). 

The authors define a valid schedule order as an order of tasks that conforms to the 

partial order defined by task graph G. Furthermore, they formulate a lemma as follows: 

“Given a valid schedule order A = (a1, a2, ..., a|v|), swapping adjacent nodes leads to 

another valid schedule order, provided there is an edge between those two nodes in graph 

˜G” [7]. Next, they define a theorem as follows: “Starting from a valid schedule order A = 

87

Andre Koza 

(a1, a2, ..., a|v|), we are able to reach any other valid schedule order B = (b1, b2, ..., b|v|) 

after finite times of adjacent swapping” [7]. Then, to reach all possible solutions three 

kind of moves are presented that are used in the algorithm: “M1: Swap two adjacent 

nodes in both schedule order sequence and resource assignment sequence, if there is an 

edge between these two nodes in graph ˜G. M2: Swap two adjacent nodes in resource 

assignment sequence. M3: Change the resource assignment of a task” [7]. 

With those definitions and the introduced moves, all possible task allocations can 

be reached. With M1, all other valid schedules can be reached, and with M2 and M3 

all resource assignments can be chosen. The authors set the temperature for simulated 

annealing to 100, the cooling rate to 0.95 and the end temperature to 10 −5 . At each 

temperature 1000 random moves are executed before the temperature is reduced. They 

decided if a found solution shows an improvement, if the MTTF of the system increases. 

For that a cost function is introduced, which reflects if a solution is valid and computes 

the MTTF according to the above mentioned failure mechanisms. 

3.2.3 Benchmarks of SA-based Task Allocation 

To test the lifetime improvements the authors generated random task graphs with a number 

of tasks from 20 to 260 and tested them on different hypothetical MPSoC platforms 

with 2 to 8 processors cores. They did the benchmarks on the SA-based task allocation 

and on one temperature-aware task scheduling algorithm based on list scheduling. The 

authors showed that their approach had better results than temperature-aware tasks mappings 

in terms of longer system lifetime. Depending on how many processors are used 

and how many tasks have to be mapped SA showed improvements from 0% - 81.81%. 

The results show that the more tasks have to be mapped and the more processor cores are 

used the better the improvement of SA gets. 

All in all, the simulated annealing based task allocation improves system lifetime 

when compared to a task allocation that only regards temperature. The authors however 

did not compare their approach to other lifetime-aware task mappings. Further, there is no 

benchmark which shows the lifetime increase when compared to a random task mapping 

which ignores lifetime. Compared to slack allocation, this method does not need further 

investments in additional hardware. 

3.3 ACO-based Task Mapping 

In this section a method for increasing lifetime in embedded systems that focuses on task 

mappings is presented. The authors of [6] have developed a lifetime-aware task mapping 

technique based on ant colony optimization (ACO). Compared with other approaches like 

slack allocation the authors wanted to develop a method that does not increase system 

cost. 

Other approaches that seek to increase system lifetime by task mappings focused on 

task mappings that optimize system temperature. It has been shown that there is a strong 

relationship between system temperature and system lifetime. Therefore reducing temper- 

88


ature can result in a better lifetime [6]. However, the authors discovered a high fluctuation 

in lifetime when only considering temperature. So they came to the conclusion that other, 

additional factors, which influence the task mapping, have to be considered when lifetime 

optimization is a goal. 

In general, finding an optimal task mapping has proven to be a NP-complete problem 

[1]. To handle this, a heuristic approach is needed that finds a solution that should be 

very close to the optimum. For that the authors developed a task mapping based on ant 

colony optimization. They decided for ACO because in the past task mappings have 

been effectively solved with ACO and it is usable in a changing environment (failure of 

components). 

3.3.1 Problem Definition 

The authors developed a lifetime-ware task mapping. In their approach a task mapping 

is application-dependent and defined as the assignment of tasks to processors and of data 

arrays to memories [6]. The general goal of task mapping is to optimize one or more 

objectives. Here the goal is to optimize system lifetime. For that different objectives have 

to be considered. 

Because of the strong dependence of component lifetime and component temperature, 

minimizing system temperature is one factor to be considered. For that either the peak 

system temperature Tmax or the average system temperature Tavg is minimized. Furthermore 

it is necessary not only to minimize overall temperature but it is also important to 

minimize component temperature. For example, if overall temperature is low, but one 

essential component experiences high temperature and fails early, as a result the system 

fails. 

When only regarding temperature different physical factors that can have an influence 

on system lifetime are ignored. To overcome this the authors of [6] use electromigration, 

time-dependent dielectric breakdown and thermal cycling as additional factors in their 

approach. These three factors influence system MTTF and are called wearout-related 

permanent faults. 

With the use of temperature and different physical parameters to address component 

failure a lifetime-aware task mapping is designed. That task mapping is based on ACO, 

which is described in the next paragraph. 

3.3.2 Ant Colony Optimization 

Ant colony optimization (ACO) is a nature-inspired approach in which artificial ants explore 

paths in the solution space of a problem by leaving pheromone trails in which information 

about the quality of the path is stored [1]. Nature inspired means that natural 

processes are imitated. 

In ACO the indirect communication of ants when they explore new food sources is 

imitated. An ant swarm can find shortest paths between food sources and the nest by this 

indirect communication [1]. When ants are moving out they emit a chemical substance, 

89

Andre Koza 

which is called pheromone. The amount of pheromone on a trail increases the more 

ants takes the same path. Following ants can detect the pheromone and the higher the 

pheromone concentration on a path the higher is the probability that an ant will take that 

path. To avoid a convergence against a certain path at an early stage of the exploration 

process in nature the pheromone evaporates over time. By evaporation, paths that are of 

no further use will be ignored as all of the pheromone on it fades away eventually [1]. 

This behavior from nature is adapted in an artificial way to optimize a constructive 

search process for combinatorial problems [1]. Artificial ants explore a search space and 

when they take a path that leads to a good solution they leave an artificial pheromone trail 

on it so that following ants will take that path with a higher probability than other paths. 

3.3.3 Task Mapping 

To adapt this to the task mapping problem, the authors of [6] developed an approach based 

on ACO. In the following, some basics are introduced that are needed for the method. 

First, the task mapping requires a system description. This consists of a list of components 

including their capacities and links between them [6]. Second, a task graph is needed 

which consists of a list of tasks including requirements and communication rates for that 

tasks. The authors then defined their goal as follows: “Our goal is to determine the initial 

mapping of tasks to processors and data arrays to memories which results in the longest 

system lifetime” [6]. They only define an initial task mapping and do not care about 

efficient remapping of tasks in their paper. 

The ACO strategy is implemented via a construction graph (see Figure 3). This graph 

consists of nodes and directed edges. The set of nodes contains all system components 

and all tasks of the application. There are two types of edges: decision edges connect 

components to tasks and mapping edges connect tasks to components. 

The graph is traversed by artificial ants. At the beginning, a decision edge is chosen 

which ends in a task. This task is the first to be executed. Next, the ant choses a mapping 

edge that connects the task to a component. By this, a single task-to-component mapping 

is done. After that, another decision edge is taken. This process is repeated until all 

tasks are mapped to components. An illustration of the process is given in Figure 3. 

There is a task graph containing all tasks, and a communication architecture containing 

all components, which are connected via a switch. The colors indicate associated tasks 

and components. At the beginning of the mapping, an ant starts at node T1 and choses 

one of the four mapping edges. In this case, the ant selects the edge that ends in node C2. 

That decision means that task T1 is executed on component C2. After that, again a task is 

chosen until all tasks are mapped to components. 

The ants choose edges by a weighted, random selection. The weight of an edge depends 

on the amount of pheromone on it. By this, ants will take paths that has been shown 

to be part of a good solution in the past with a higher probability than other paths. This 

procedure also allows for selecting other paths in order to search for new solutions that 

might be better than older ones. The evaporation of pheromones avoids that the algorithm 

gets stuck in a local minima. 

90


Figure 3: Task mapping process after completion (taken from [6]) 

Figure 4: Overview of the ACO-based task mapping (taken from [6]) 

After an ant has traversed the construction graph, the found solution is checked for validity 

and gets a score. Details about the validity check and the scoring follow in the next 

paragraphs. An illustration of the whole task mapping process can be found in Figure 4. 

Beginning with an ant traversing the construction graph, a mapping is found. This phase 

is called task mapping synthesis. After that, the task mapping is checked for validity. If 

the solution is valid, the lifetime of the mapping is evaluated which results in a system 

MTTF. Then the task mapping is evaluated by giving it a score which depends on the validity 

and the MTTF. Invalid mappings get a bad score while valid mappings get a score 

that reflect the MTTF. Only if the found solution has the best score so far, the construction 

graph is fed by pheromones. 

3.3.4 Task Mapping Evaluation 

After an ant has traversed the construction graph, the resulting task mapping must be 

evaluated. For that it has to be checked if the mapping is valid. Valid means on the 

one hand that no component capacities have been violated. Component capacities for 

processors are given in MIPS (million instructions per second) and for memories in KB 

(kilobyte). Then, for compute tasks processing requirements and for data tasks storage 

requirements are determined and possible violations are identified. On the other hand, the 

communication traffic between tasks is checked to determine if any bandwidth capacities 

have been violated. If both component capacity and bandwidth capacity are not violated, 

the solution is valid. 

91

Andre Koza 

To determine the MTTF of a valid solution the authors of [6] use a system lifetime 

model which is described as follows. System lifetime resulting from a task mapping is defined 

as the amount of time between powering up a system and failure of a system so that 

its performance constraints can no longer be satisfied [6]. The performance constraints 

can be fulfilled until no more valid task re-mappings exist. 

The physical factors, which were listed in Section 3.3.1, are used for an estimation 

of permanent component failures due to wearout. The authors used a lognormal failure 

distribution for each of the factors and normalized them so that MTTF is 30 years for the 

characterization temperature of 345 K [6]. 

In the next step component temperatures have to be determined in order to acquire 

component MTTF and the resulting system MTTF. The temperature of a component depends 

on the utilization and the power dissipation. The utilization of a component is 

determined based on the task mapping and the system description (a list of components 

including their capacities and links between them, see above). With this data the component 

power dissipation can be acquired that leads to temperatures for each component. 

The temperature can then be used to determine component MTTF based on the above 

mentioned normalized MTTF of 30 years for a temperature of 345 K. As this seminar 

paper focuses on lifetime improvement by the ACO technique, details on how component 

power dissipation and the resulting temperatures are determined, are omitted. 

Overall system MTTF is then determined by an iterative simulation. In each iteration 

failure times of components are randomly selected based on the task mapping, component 

utilization and temperature. This means that not the MTTF of a component is chosen but 

one concrete failure time. In case of a component failure the remaining tasks and data 

are remapped. This remapping process is not lifetime-aware. It is then checked if the 

remapping still satisfies system performance constraints. If this is the case component 

utilization and temperature based on the remapping are calculated again and the received 

data is used in the next iteration. This is repeated until the system fails. This process is 

executed for several sample systems and at the end system MTTF is determined by the 

mean of all sample failure times. 

System MTTF is used for scoring a solution of a task mapping. The score equals 

the ratio of the MTTF to a baseline MTTF for that system. The authors however do not 

mention how to obtain these baseline MTTFs. They only write that they are obtained by 

using task mappings created by hand for example systems. Invalid solutions are scored so 

that they are never chosen above a valid solution. 

The scoring is used for placing pheromones on edges of the construction graph. The 

amount of pheromones equals the score. Pheromones are only deposited on the path if the 

score of a solution is the highest found so far. To simulate the evaporation of pheromones 

over time, each time when an ant has explored a new task mapping and the score of it 

is computed, the pheromones on the edges (i.e. the edge weights) experience decay [6]. 

That means the weight changes by a certain percentage that depends on the amount of 

valid task mappings. 

92


3.3.5 Benchmarks 

The ACO-based task mapping and a simulated annealing based task mapping are benchmarked 

in order to compare the resulting system MTTF. As benchmark applications the 

authors of [6] used a synthetic application (synth), Multi-window Display (MWD) and an 

MPEG-4 Core Profile Level 1 decoder (CPL1). 

For the benchmarks two variants of the ACO-based approach and two variants of an 

simulated annealing (SA) based approach were used. SA is chosen for comparison as it 

here represents a temperature-aware task mapping approach. It is important to mention 

that this is not the SA approach described in Section 3.2. 

The first variation of the ACO-based approach, called agnosticAnts, simulates a random 

selection of a task mapping. For that one single valid task mapping is generated 

before the search is stopped. Because at the beginning no pheromone trails have been laid 

out, all possible solutions have the same probability to be chosen as first one. 

The second approach used in the benchmarks is lifetimeAnts. There, a lifetime-aware 

task mapping is executed and the ants explore 20 valid task mappings before the search is 

stopped. The task mapping with the highest MTTF is chosen as a result of this approach. 

The authors chose the value 20 because according to experiments with higher and lower 

numbers this value shows a good tradeoff between MTTF and runtime. 

Next, two variations of SA were used in the benchmarks. The first one, called avgSA 

finds task mappings with optimized average initial component temperature. The second 

one, maxSA, emphasizes the optimization of the maximum initial component temperature. 

The SA-based approaches were stopped when they reached 50 valid task mappings. 

The authors used different design points for each benchmark. A design point is a 

communication architecture that consists of different processors and memories that are 

interconnected via switches. Different design points can have the same communication 

architecture but differ in the types of processors and / or memories. Additionally they 

introduced different amounts of slack in each design point according to the method presented 

in [9]. 

In the following paragraphs the results of the benchmarks are shown. At first the synthetic 

application is evaluated. The authors designed that application small enough so that 

an exhaustive search of all possible valid task mappings is practicable. They compared 

the best found MTTF from the exhaustive search with that found from lifetimeAnts and 

observed that lifetimeAnts was able to create task mappings with an equivalent MTTF. For 

this benchmark they used 16 different design points. 

After that, the authors executed so called real world benchmarks with MWD and 

CPL1. In this benchmarks optimal results could not be obtained due to the large number 

of possible valid task mappings. For example, even in the smallest of the real world 

benchmarks, the authors computed 1.224e10 possible valid task mappings. To overcome 

this, they executed all four approaches several hundred times to receive an observed optimal 

task mapping which acted as a reference. This observed optimal task mapping is the 

one with the highest system MTTF found by all runs of all approaches. 

The results of the real world benchmarks are shown in Table 1. The percentages in 

93

Andre Koza 

Benchmark agnosticAnts lifetimeAnts avgSA maxSA 

MWD 4-s 65.6% 77.3% 83.4% 82.4% 

CPL1 4-s 61.4% 83.9% 81.8% 81.8% 

CPL1 5-s 64.0% 85.1% 84.3% 83.1% 

Table 1: Benchmark of task mapping approaches as a percentage of the observed optimal 

results. Taken from [6]. 

Benchmark Max. Initial Temp. Avg. Initial Temp. 

Avg Max Avg Max 

MWD 4-s 27.4% 44.3% 32.3% 47.9% 

CPL1 4-s 17.5% 24.5% 33.5% 53.2% 

CPL1 5-s 15.3% 23.2% 31.9% 101.7% 

Table 2: The percentages are lifetime ranges within 1% of the observed optimal temperature. 

This table shows a great variation of lifetimes even if the temperature interval is 

small. Taken from [6]. 

the columns show the fraction of the observed optimal lifetime. For example lifetimeAnts 

in the benchmark CPL1 4-s (4 switches) reached 83.9% of the lifetime of the observed 

optimum. These percentages are averaged across all used design points in a benchmark. 

The benchmark shows that lifetimeAnts outperformed agnosticAnts in all test cases. The 

results of avgSA and maxSA are nearly the same as lifetimeAnts. 

The authors did another evaluation of their benchmarks in which they compared the 

lifetime ranges for task mappings with temperatures within 1% of the observed optimum 

temperature. The results of this can be found in Table 2. The first column reflects the 

benchmark application. The second column labeled with Max. Initial Temp. shows 

the lifetime ranges of all approaches within 1% of the observed optimal maximum initial 

component temperature. This lifetime ranges are on the one hand averaged and on 

the other hand the maximum range is shown. The third column shows the same for the 

observed optimal average initial component temperature. For example, the maximum 

lifetime range for all tasks mappings which result in a maximum initial component temperature 

within 1% of the lowest is 44.3% for MWD 4-s. From this table the following 

result can be drawn. Task mappings which result in a low system temperature are often 

not optimized in lifetime. On the other hand, the authors observed in their benchmarks 

that task mappings which result in a high system lifetime also result in a low temperature. 

According to that, they came to the conclusion that temperature-aware task mapping is 

a subset of lifetime-aware task mapping because temperature-aware task mappings only 

find task mappings which are optimized in temperature but not necessarily in lifetime 

while lifetime-aware task mappings show good results both in lifetime and in temperature. 

To sum up, ACO-based task mapping showed an improvement of 32.3% in lifetime 

when compared to a random task mapping approach [6]. This is achieved with no additional 

investment in hardware. The authors focused their work on the comparison between 

94


lifetime-aware and temperature-aware task mappings and concluded that when only regarding 

temperature there is a high fluctuation of system lifetime. 

4 Comparison 

In this section the previously presented approaches to increase lifetime in embedded systems 

are compared to each other. The first approach we looked at was slack allocation. 

There additional resources are brought into the system that cover for potential future failures 

of components. In case of a failure, tasks and data of failed components are remapped 

to the slack resources. The goal of this approach is to find lifetime-cost Pareto-optimal 

slack allocations. As a result the approach CQSA found slack allocations within 1.4% of 

the Pareto-optimal while only exploring 1.4% of the design space on average. Lifetime 

could be increased by 22% in a small benchmark. There is no data provided for real world 

benchmarks as in Section 3.3. 

The next two approaches we presented are focused on task mappings to increase lifetime. 

Both adapted nature inspired methods for the task mapping problem. Also both 

approaches considered not only system temperature to increase lifetime but additional 

physical failure mechanisms. The simulated annealing technique provides task mappings 

that showed lifetime improvements compared to a temperature-aware method. The results 

of this method vary from 0% (only in one benchmark) to up to 81.81% depending on how 

much tasks had to be mapped and how many processor cores were used. 

The third approach presented in this seminar paper was ACO-based task mapping. 

This approach was the focus of this work. The authors adapted the behavior of an ant 

swarm when searching new food sources for the task mapping problem. In benchmarks 

the ACO-based task mapping was compared to a random approach and to two SA-based 

temperature-aware approaches: one that targeted the average temperature and one that 

targeted the maximum temperature. As a result the ACO-based task mapping showed the 

best lifetime improvements on average with the lowest runtimes. It reached 32.3% longer 

lifetime than a random task mapping approach. 

All three examined approaches showed lifetime improvements. The advantage of 

the task mapping approaches is that no additional investments in hardware have to be 

made. It lacks on clear statements from the authors of [9] how much must be invested 

to achieve a certain lifetime improvement. Only in one mentioned example run they 

received 50% more lifetime at a cost increase of 62%. This is compared to task mapping 

approaches where the increased hardware cost is at 0% a lot. There is a benchmark needed 

for a meaningful comparison of the SA-based approach and the ACO-based approach. 

Both approaches did benchmarks in which they were compared to temperature-aware 

task mappings but both benchmarks differ a lot. For example the ACO-based approach 

is compared to two different temperature-aware approaches while the SA-based approach 

is only compared to one. Furthermore, in the ACO-based approach slack according to 

the method in [9] was used. From the available data there cannot be drawn a conclusion 

which approach increases system lifetime most. 

95

Andre Koza 

5 Conclusion 

In this seminar paper we discussed three approaches to target system lifetime in embedded 

systems. There was on the one hand a system-level approach which improves lifetime 

by providing additional resources. On the other hand there were two approaches which 

change the way resources are utilized within the system. Due to the lack of comparable 

benchmarks it cannot be said which approach has the best lifetime improvements. The 

advantage of task mappings over slack allocation is that there is no additional cost for new 

hardware. As proposed in [6] a combination of slack allocation and lifetime-aware task 

mapping is promising. There a system benefits from both approaches and the designer of 

a system can decide how much he is willing to invest in slack to receive an increase in 

lifetime. 

References 

[1] Markus Bank and Udo Honig. An ACO-based approach for scheduling task graphs 

with communication costs. In Proceedings of the 2005 International Conference on 

Parallel Processing, pages 623–629, Washington, DC, USA, 2005. IEEE Computer 

Society. 

[2] V. Černý. Thermodynamical approach to the traveling salesman problem: An efficient 

simulation algorithm. Journal of Optimization Theory and Applications, 

45:41–51, 1985. 10.1007/BF00940812. 

[3] C.-W. Chiang, Y.-C. Lee, C.-N. Lee, and T.-Y. Chou. Ant colony optimisation for 

task matching and scheduling. Computers and Digital Techniques, IEE Proceedings, 

153(6):373 –380, nov. 2006. 

[4] M. Dorigo, V. Maniezzo, and A. Colorni. Ant system: optimization by a colony 

of cooperating agents. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE 

Transactions on, 26(1):29 –41, feb 1996. 

[5] M. Glass, M. Lukasiewycz, F. Reimann, C. Haubelt, and J. Teich. Symbolic reliability 

analysis and optimization of ecu networks. In Design, Automation and Test in 

Europe, 2008. DATE ’08, pages 158 –163, march 2008. 

[6] Adam S. Hartman, Donald E. Thomas, and Brett H. Meyer. A case for lifetimeaware 

task mapping in embedded chip multiprocessors. In Proceedings of the eighth 

IEEE/ACM/IFIP international conference on Hardware/software codesign and system 

synthesis, CODES/ISSS ’10, pages 145–154, New York, NY, USA, 2010. ACM. 

[7] Lin Huang, Feng Yuan, and Qiang Xu. Lifetime reliability-aware task allocation 

and scheduling for MPSoC platforms. In Proceedings of the Conference on Design, 

Automation and Test in Europe, DATE ’09, pages 51–56, 3001 Leuven, Belgium, 

Belgium, 2009. European Design and Automation Association. 

96


[8] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. 

Science, 220(4598):671–680, 1983. 

[9] Brett H. Meyer, Adam S. Hartman, and Donald E. Thomas. Cost-effective slack 

allocation for lifetime improvement in NoC-based MPSoCs. In Proceedings of the 

Conference on Design, Automation and Test in Europe, DATE ’10, pages 1596– 

1601, 3001 Leuven, Belgium, Belgium, 2010. European Design and Automation 

Association. 

[10] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, and Li Shang. Reliable multiprocessor 

system-on-chip synthesis. In Proceedings of the 5th IEEE/ACM international 

conference on Hardware/software codesign and system synthesis, CODES+ISSS 


97

Warp processing 

Maryam Sanati 


msanati@mail.uni-paderborn.de 

November 2011 

Abstract 

In this paper we talk about a framework for dynamic synthesis of thread accelerators 

or thread warping. Warp processing is the process of converting typical software 

instructions binary into an FPGA circuit binary dynamically for speedup. FPGAs are 

much faster than microprocessors because a microprocessor might be able to execute 

several operations in parallel while an FPGA can implement thousands of operations 

in parallel. Warp processing uses an on-chip processor to remap critical code regions 

from processor instructions to FPGA circuit using run-time synthesis. This kind of 

processing is building dynamic synthesis for single process, single thread system. 

We can improve performance by thread warping which has the ability to adapt the 

system to change the threads’ behavior, and different mixes of resident application. 


Here, we want to describe a new processing architecture known as warp processor. Warp 

processing gives a computer chip the ability to improve its performance. In this processor, 

a program runs on a microprocessor chip and the chip tries to detect the parts of program, 

which are executed frequently. Then it moves these parts (the most frequently executed) 

to a field-programmable gate array (FPGA). An FPGA has the ability to execute some, but 

not all programs, 10,100 or even 1000 times faster than a microprocessor. As mentioned 

before FPGAs are faster for some programs, so if microprocessor finds out that FPGA 

is faster for a special part of the program it causes the program execution to warp, that 

means the microprocessor moves the selected part to FPGA. 

While some applications has no speedup on FPGAs, other highly parallelizable applications 

such as image processing, encryption, encoding, video/audio processing, mathematical 

based simulations, and much more may perform 2x, 10x, 100x or even 1000x 

speed ups compared to fast microprocessors. Consumers who want to enhance their photos 

using Photoshop or edit videos on their PCs find their systems speedup with warp 

processing. Warp processing may eliminate tool flow restrictions and extra designer trade 

99

Maryam Sanati 

effort associated with traditional compile-time optimizations due to having optimization 

at runtime. 

2 Warp processing 

A warp processor dynamically detects the binaries critical regions, reimplement them 

which causes in 2x to 100x speedup in comparison to executing on microprocessors. In 

general, software bits are downloaded into hardware device. In traditional microprocessors 

these bits represents sequential instructions that should be executed by programmable 

microprocessor. In FPGA, software bits show a circuit to be mapped onto a logical fabric 

of an FPGA, which is configurable. In both situations, developers download the software 

bits to prefabricated hardware device so, they can implement their desired computation. 

Therefore, in both software types they do not need to design hardware. 

A computation might execute faster as a circuit on an FPGA than as sequential instructions 

on a microprocessor because a circuit allows concurrency, from the bit to the process 

level [1]. The most difficult part of the warp processing is dynamically reimplementing 

code regions on an FPGA which has many steps such as decompilation, partitioning, synthesis, 

placement and routing and needs special tools for these stages in order to minimize 

computation time and data memory comparing to main processor. 

From electrical view, programming of FPGA is the same as programming a microprocessor. 

Many research tools want to be able to compile popular high level programming 

languages such as C, C++, and Java to FPGA. Many of these compilers use profiling to 

detect a kernels of a program, which are the most frequent executable part of the program, 

map those parts to circuit on an FPGA, and let the microprocessor to execute the rest of 

the program. 

According to recent studies, they showed that designers could do hardware/software 

partitioning starting from binaries rather than from high-level code by using decompilation. 

In other words, warp processing is a process that an executing binary dynamically 

and transparently optimized by moving to an on-chip configurable logic. 

2.1 Component of warp processor 

Figure 1 provides an overview of a warp processor. The warp processor consists of a 

microprocessor which is our main processor and a warp-oriented FPGA (W-FPGA) sharing 

instructions and data caches or memory, an on-chip profiler, and an on-chip computer 

aided design module (Dynamic CAD tool). Initially, developer or end user downloads a 

program and it will execute only on main processor. During the execution of the application, 

profiler monitors the execution and dynamically detects the critical kernels. After 

binary’s kernel detection, the dynamic CAD tools map those critical regions to FPGA 

circuit. Then the binary updater updates the program binary to use new circuit. After updating 

took place, the execution time warps. That means the program’s execution speed 

up by a factor of two, 10 or even more. 

100

Figure 1: Warp processor architecture/overview 


As we mentioned before profiler is in charge of monitoring application’s behavior to 

determine the critical kernels, which can be implemented as hardware by warp processor. 

Branch frequencies stored in a cache which profiler updates that, when a backward 

branch occurs. In this way, profiler is able to determine the critical kernels accurately. After 

detecting critical regions by profiling, the on-chip CAD module executes partitioning, 

synthesis, mapping, and routing algorithm. Dynamic CAD first analyzes the profiling 

result that shows which critical kernels should be implemented in hardware. After selecting 

the binary kernels, CAD tool will decompile critical regions to control/data flow 

graph and synthesize the critical kernels to produce an optimized hardware circuit that is 

later mapped onto W-FPGA based on mapping, routing, and placement technology. Warp 

processors use executing binary code rather than source code to synthesize circuits. As 

binary code does not have high-level constructs such as loops, arrays, and functions, synthesis 

from them might produce slower or bigger circuits. We also have the alternative to 

replace on-chip CAD tools by a software task on the main processor. This software task 

is sharing computation and memory resource with the main application. We can also have 

multiprocessor system with multiple warp processors on a single device. In this case, we 

do not need multiple on-chip CAD tools. Here, instead of multiple on-chip CAD modules, 

just a single one is sufficient for supporting each of the processors in a round-robin 

fashion [2]. In this situation, we can also execute software tasks instead of implementing 

CAD tools. 

Researchers found many decompilation techniques in order to recover high-level constructs 

such as loops, arrays, functions, etc. there are two efficient techniques: 

101

Maryam Sanati 

Figure 2: Dynamic CAD tools 

-Loop rolling 

-Operator strength promotion 

Loop rolling detects an unrolled loop in a binary and replaces the code with a rerolled 

loop, thus letting a circuit synthesizer unroll the loop by an amount that matches the available 

FPGA resources. Previous decompilation techniques also use loops to detect arrays 

and synthesizers need arrays to effectively use FPGA smart buffers, which increase data 

reuse and thus decrease time-consuming memory access [1]. Also with loop rerolling 

technique we can significantly reduce the time for circuit synthesis by the help of reducing 

control/data flow graph size. Operator strength promotion detects strength reduceoperation. 

That means in this technique operations like shifts and adds which are strength 

reduced operations replaced by a multiplication which is a stronger operation. Therefore, 

the compiler uses multiplier, which is fast functional unit if it is available on FPGA. 

Without our two new decompilation techniques, the binary approach would have 

yielded 33 percent less average speed up, with a worse case of 65 percent less. Without 

any decompilation, the binary approach actually yielded an average slowdown (not 

speedup) of 4x [1]. By using warp processors, we can improve performance and energy 

efficiency for the embedded applications. Warp processors are good for embedded 

systems that execute the same application for extended periods repeatedly and good for 

systems which software updates and backward compatibility are essential. These kinds 

of processors are extremely useful and efficient for data-intensive applications such as 

image/video processing, scientific research or even games. 

2.2 Dynamic CAD 

FPGA CAD tasks, shown in Figure 2 include: 

102


-Decompilation 

-Behavioral synthesis: converting a control/data flow graph to a data path and register 

transfers 

-Register transfer synthesis: converting register transfers to logic 

-Logic synthesis: minimizing logic 

-Technology mapping: mapping logic to FPGA compatible resources 

-Placement: placing logic/computer resource within specific FPGA resources 

-Routing: creating connections between logic/computer resources 

The traditional desktop counterparts which do the same tasks have long execution 

time between minutes to hours, large memory resources, sometimes even more than 50 

megabytes, and large code size that can require hundreds to thousands of lines of source 

code. However CAD algorithm must provide very fast execution times when using small 

instruction and data memory resources and also minimizing the data amount of memory 

used during execution and providing excellent result. Our on-chip CAD tool starts with 

the software binary and in decompilation step converts the software loops into high-level 

representation that are more suitable for synthesis. At this point each assembly instruction 

converted to equivalent register transfers, which provides a binary independent representation 

instruction set. Decompilation tool builds a control/flow graph for the software 

region after converting the instructions into register transfers and then generate a data 

flow graph by parsing the semantic strings for each register transfer. The parser uses 

definition-use and use-definition analysis to build data flow graph by combining each 

register transfer trees. While control and data flow graph generated, decompilation uses 

standard compiler optimizations in order to remove the overhead introduced by the assembly 

code and instruction set. The next step is to recover high-level constructs such 

as loops and if statements after recovering control/data flow graph. After all these steps, 

on-chip CAD tools performs partitioning to decide which of the critical software kernels 

introduced by on-chip profiler are most suitable for implementing in hardware, to maximize 

speedup while reducing energy. In behavioral and register transfer synthesis our 

dynamic CAD converts the control/data flow graph of each critical kernel into a hardware 

circuit description. The next job is to execute logic synthesis to optimize the hardware 

circuit.The core of logic synthesis algorithm is an efficient 2-level logic minimizer that is 

15x faster and uses 3x less memory that Espresso-II. The trade off here is a two percent 

increase in circuit size [1]. 

After this step, CAD tool performs technology mapping to map the hardware circuit 

onto configurable logic blocks (CLBs) and lookup tables (LUTs) of the configurable logic 

fabric. Our technology mapper uses a hierarchical bottom up graph-clustering algorithm. 

After mapping the hardware circuit into a network of CLBs, the on-chip CAD tool places 

the CLB nodes onto the configurable logic. The most compute and memory intense FPGA 

CAD task is routing. Typically the tool reroute a circuit many times till the time that 

tool finds a valid or sufficiently optimized rating. This requires large amount of memory 

for updating and restoring routing resource graph and long execution times. We reduced 

execution time and memory use by developing a fast lean routing algorithm and designing 

a CAD oriented FPGA fabric [4]. 

103

Maryam Sanati 

2.3 Warp processing scenarios 

Figure 3: Warp processing scenarios 

There are two different scenarios depending on application runtime. In figure 3a we 

can see the execution of short-running application. In this case, running dynamic CAD 

tools take more time than the application. Here for the first few executions there is no 

speedup with warp processing, but it can also benefit from warp processing as long as the 

warp processor can remember the application’s hardware configuration. Figure 3b depicts 

longer-running applications require hours or even days for warp processing like scientific 

computing. In this case, profiling and dynamic CAD finish some time before the end of 

first execution of the application and the rest of the application can benefit from warped 

execution. Therefore, the difference between these 2 scenarios is that in the short-running 

application, they will be mapped after several executions by saving and then reusing the 

application’s saved FPGA configuration. However, in longer-running applications, they 

can be warped even during a single execution and there is no need for saving of the FPGA 

configuration although the application can still use saved configuration for its future executions. 

3 Single-threaded Application 

In each program there are many paths of execution. The programs with only one path 

of execution called single threaded program and the ones that have two or more paths 

called multi-threaded programs. Each single threaded program has the ability to execute 

only one task at a time and should finish each task in a sequence before starting the next 

one. According to different demands and needs, sometimes single threaded programs are 

working properly, however asking to accomplish multiple simultaneous tasks sometimes 

lead you to use multiple threads. 

Thread warping can improve performance of a multiprocessor by speeding up individual 

thread and concurrent execution of more threads. 

104


We followed and worked on the result of many experiments on single-threaded benchmark 

applications. Warp processing would not provide speedup for all of them, therefore 

we only consider the ones which amenable to speedup using FPGAs. For others we need 

to rewrite them or develop new decompilation techniques, on the other hand, warp processing 

can not result in slow down. If it can not speed up the application, the binary 

updater lets the binary to execute on the microprocessor alone. Our present warp FPGA 

fabric supports approximately 50000 equivalent logic gates, roughly equal in logic capacity 

to a small Xilinx Spartan3 FPGA [1]. 

The communication between the microprocessor and the FPGA is implemented in 

the current architecture using the combination of shared memory, memory-mapped communication, 

and interrupts. Digital signal processors (DSPs) use data address generators. 

The FPGA uses the same to stream data required by FPGA circuit from memory. 

The microprocessor uses interrupts I order to be aware of hardware completion and uses 

memory-mapped communication to initialize and enable the FPGA. We need at least one 

cycle and at most two cycles for single data transfer between microprocessor and FPGA. 

Comparing DSP to warp processor shows that DSP uses arithmetic-level parallelism 

to improve performance like warp processing, but warp processing is usually faster while 

there are some benchmarks that DSP is a little faster for. DSP can execute only several 

operations in parallel, while warp processing support wider range of parallelism. The 

cases that have little parallelism are faster on DSPs because of its faster clock frequency. 

4 Multi-threaded Applications 

Thread warping is a dynamic optimization technique that uses a single processor on a 

multiprocessor system to dynamically synthesize threads into custom accelerator circuits 

on FPGA. In other words warp processing in modern processing architecture, multi core 

devices are connected on boards or backplanes in order to make large multiprocessor 

systems. In single thread, the program contains only one execution sequence, but there 

can be more execution paths as well. Therefore the first step is to create threads to execute 

function f().In the case that the number of processors are not enough for the number of 

threads (step 1), OS will put them in a queue to wait for a processor to be available 

(step 2). Our framework is responsible for analyzing the waited threads and utilize the 

on-chip CAD tools as it creates accelerator circuits for f() (step 3). CAD tools create 

custom accelerator circuits for the f() function. It takes 32 minutes for the CAD finishes 

mapping the accelerators onto the FPGA. If the application has not finished yet, operating 

system scheduled threads to accelerators and microprocessors, using thread-level and finegrained 

parallelism. 

Thread warping hides the FPGA by dynamically synthesizing accelerators, allowing 

software developers to take advantage of the performance improvements of custom circuits 

without any changes to the tool flow. Just as multi thread programs make use of 

more processors without rewriting or recompiling code [3]. During execution at different 

points, thread warping is able to create different accelerator versions according to the 

105

Maryam Sanati 

Figure 4: (a) On-chip CAD tool flow , (b) accelerator synthesis tool flow 

available amount of FPGA. 

4.1 On-chip CAD tools 

Figure 4 shows on-chip CAD tool flow, which first analyzes the thread queue and creates 

custom accelerators for waiting threads using accelerator synthesis tool flow. We need 

to define some terms first. A thread creator is a function that contains application programming 

interface (API) call that creates threads. A thread is the unit of execution that 

operating system schedules. A thread group is a collection of threads that created from 

some instruction address that share input data. A thread function is a function that a thread 

executes. 

As we can see in figure 4a queue analysis determines the union of waiting thread 

functions and thread counts shows the occurrences of each thread function in the queue. 

Then if the accelerator has not been created before, accelerator synthesis creates a custom 

circuit for each thread function and put it in accelerator library. There is update software 

binary which used to communicate between microprocessor and accelerator created by 

accelerator synthesis. Specifying the number of accelerators to place in the FPGA for 

each thread function is the accelerator instantiation responsibility. The output of this step 

is converted to an FPGA bitstream by place and rout tool. Schedulable resource list (SRL) 

has the list of available processing resources in order to inform the operating system about 

the available processing resources. The thread queue has a limited size. If the number of 

threads reaches the predefined size, OS invokes the on-chip CAD. As mentioned before 

accelerator synthesis creates a new accelerator when new thread function arrives and the 

accelerator of that thread does not exist in the library. Then because of the change in 

thread counts, accelerator instantiation will change the type and amount of accelerator in 

the FPGA. 

Figure 4b shows the tool flow of accelerator synthesis, which starts with decompilation 

and hardware/software partitioning. Then memory access synchronization analyzes 

thread function, detects threads with similar memory access patterns, combine them into 

thread groups that share memory channels and have synchronized execution. High-level 

106


synthesis converts the decompiled representation of the thread function into custom circuit, 

represented as a net list. If the entire thread function is not implemented on the 

FPGA, the binary updater will modify the software binary in order to communicate with 

accelerators. 

With parallel access, multiple threads can read the same data from memory. Thus, 

memory access synchronization (MAS) is able to combine memory access from multiple 

accelerators onto a single channel and use a single read to service many accelerators. 

MAS unrolls loops to generate fixed-address reads in the control/data flow graph of each 

thread function. 

OS gives the priority to the fastest resource that is compatible with the thread function, 

which is usually an accelerator. However, in the cases that thread functions contain other 

calls (such as create, join, mutexes, or semaphore functions), the OS schedules that thread 

to microprocessor. There are some cases that no microprocessor or accelerator for the first 

thread in the queue is available, but there may other threads exist in the queue that have 

available accelerators. The problem is, when the head can not be scheduled, other threads 

in the queue can not as well, although they have available accelerators. In order to avoid 

this problem, scheduler scans the thread queue until finds a thread that can be scheduled. 

If there is no resource available or available resources do not apply to any waiting threads, 

scheduler can avoid the worst case by not scanning the queue. The scheduler is invoked 

when a thread is created or completed, a lock is released and also when a synchronization 

request block a software thread. 

To evaluate the performance of the framework, they develop a C++ simulator, which 

creates a parallel execution graph (PEG). Nodes in this graph represent sequential execution 

blocks (SEBs), which are a block that ends with a pthread call, or represent the 

end of a thread. Pthread defines a set of C programming language types, functions and 

constants. Edges of this graph show the synchronization between SEBs. 

5 Conclusion 

FPGA can benefit a wide range of applications such as video and audio processing, encryption 

and decryption, encoding, compression and decompression, bioinformatics and 

anything that needs intensive computing and operates on large streams of data.We studied 

many research and various experiments and showed that the basic concept of warp 

processing, which is the concept of dynamically mapping software kernels to an on-chip 

FPGA for improving performance and energy efficiency, is possible. The simplicity of 

the W-FPGA’s configuration logic fabric lets us to achieve lower power consumption and 

higher execution frequencies compared to a traditional FPGA for the application considered. 

Warp processing benefits were most apparent for application with much concurrency. 

In multi-thread warping we need additional CAD tools that can determine which and how 

many threads to synthesize. 

107

Maryam Sanati 

References 

[1] Frank Vahid Greg Stitt Roman Lysecky, Warp Processing: Dynamic Translation of 

Binaries to FPGA Circuits, IEEE Computer Society 2008 

[2] Roman Lysecky Greg Stitt Frank Vahid, Warp Processors, ACM Translation on Design 

Automation of Electronics System, Vol. 11,No. 3, July 2006 

[3] Frank Vahid Greg Stitt, Thread Warping: A Framework for Dynamic Synthesis of 

Thread Accelerators, ACM 2007 

[4] Frank Vahid Roman Lysecky S. Tan, Dynamic FPGA Routing for Just-in-Time Compilation 

, IEEE / ACM 2004 

[5] http://www.en.wikipedia.org 

[6] http://www.cs.ucr.edu 

108

109

Performance Modeling of Embedded Applications 

with Zero Architectural Knowledge 




Abstract 

Performance evaluation of embedded systems is a key phase in the design and 

development of embedded systems. Modern day embedded systems have short 

product development life cycle hence it becomes essential to come out with a performance 

model which can be done early in the design phase so that rework can be 

minimized. Most of the performance estimation techniques require knowledge on 

the system architecture if it has to be done during design phase, unfortunately not 

all target architecture information is available early in the design phase. 

Objective of the paper is to present a model done by Marco Lattuada And Fabrizio 

Ferrandi that estimates performance without requiring any information on the 

processor architecture except GNU GCC intermediate representation and compare 

it against other similar model. The model will use linear regression technique on 

internal register level representation of GNU GCC compiler so that compiler optimization 

is exploited. The paper also describes briefly on my ideas how the model 

can extended to evaluate performance of modern day embedded systems that are 

highly complex with advanced architectures like branching, pipelining, streaming, 

buffer cache and power management which cannot be efficiently derived based on 

linear methods. 

1 INTRODUCTION 

The concept of early performance evaluation in design and minimal architectural dependency 

are primary criteria for modern day embedded systems. Flexibility, time-tomarket 

and cost requirements form integral part in development cycle and this can be 

only achieved by early performance evaluation. Fixing time related constraints later in 

development cycle will cost more as it may cause rework in design and development. 

This complexity demands a new model that can evaluate performance with least architectural 

knowledge. Increased use of Multi-Processor System on Chip (MPSoC) in 

embedded systems has increased complexity of evaluation due to multiple components 

and its heterogeneity that demands architectural knowledge. This means performance 

estimation is done in early design phase so that alternate solutions can be compared 

110



without actually knowing all the details of the components that will be used later in 

product development. Similar works results show that early evaluation technique [5] 

aptly fits the modern day time-to-market pressure, short life of the product to fit market 

competition. But modern day embedded systems are real-time and more complex. For 

example modern real-time embedded systems have multimedia application which has 

to encode/decode a stream with high speed, while at the same not compromising on the 

quality. Performance with quality is the key for time critical embedded systems. Monitoring 

device used in Nuclear power plants or devices to monitor forest fire, missing 

a deadline or time critical decision will cause severe damage. Moreover these modern 

days systems are developed with huge market competition to produce at low cost. They 

have to be reliable but at the same time show competitive performance. The proposed 

methodology does not require any knowledge on target processor but the system design 

exploits the information provided by GNU GCC compiler about the target processor. 

Reminder of this paper is indexed as below. Section 2 compares the other similar works. 

Section 3 describes the proposed methodology by Lattuda. Section 4 gives an comparision 

of experimental results of similar models. Section 5 describes enhancements that 

can be done to the methodology for modern day embedded system. Section 6 concludes 

the paper. 

2 COMPARISON OF RELATED WORK 

Generic methods to do performance evaluation can be categorized as 

1. Direct measures. 

2. Estimation by simulation. 

3. Estimation using mathematical model. 

4. Prediction. 

Most of the time direct measures need developers to know accurate knowledge of 

the target architecture to do performance evaluation. This is not possible because not all 

components will be available early in design phase and most of the time the components 

are prone to change later in design due to cost, new technology in chip or other factors. 

So this model can not be fully utilized early in design phase. Hence techniques based 

on simulation are preferred. In simulation methodology each and every component can 

be simulated by running behavior simulator model using MATLAB or Neural network. 

The advantage of simulation model is its accuracy, at the same time it can be applied 

for smaller components only and cannot be generalized for a bigger set since simulation 

behavior could change. This disadvantage leads to the third model based on mathematical 

model. In this model estimation can be derived by correlating numerical functions 

against performance of the component. This model is less accurate but at the same time 

much faster. Prediction model can be based on simulation results or profile study. The 

simulation based predictive model retains the limitations of the simulation model while 

the profile based study needs the designer to know architecture of the target system. 

111

Performance Modeling of Embedded Applications with Zero Architectural Knowledge 

2.1 Direct Estimation Model 

Direct measures to do performance evaluation require deep knowledge in architectural 

characteristics of the target system to be designed. 

Brandolese et al. [1] presented a model that divided the source code structure into 

basic elements called atoms which are used in hierarchical analysis of the performance. 

In this model, performance evaluation is calculated by summing the execution time 

taken for all the atoms plus different overhead scenarios in the system. Execution time of 

each atom depends on time taken to execute a particular program path in ideal condition 

plus a deviation factor derived from mathematical model. Disadvantage of this model 

is reference time, deviations could not be linearly mapped, increased complexity to 

estimate the execution time and its deviation for a larger system. Also this model could 

not consider the target architecture characteristics like parallelism, memory etc. 

To overcome this disadvantage Beltrame, Brandolese et al. [4] came out with a subsequent 

more flexible model. This model derives performance estimation by summing 

the execution delay of an operation plus overhead due to deviations and a co-efficient 

factor that considers target system performance characteristics like parallelism. Problem 

with the model is that it does not consider the heterogeneity of the target system which 

can potentially use mutli-processors. 

Hwang et al. [12] came out with a model which considers pipeline, branch delay 

and memory organization but this still requires exact timings for executing different 

basic blocks in different processors. 

Most direct estimation techniques posses the same disadvantage: they require the 

designer to have some knowledge of the architecture of the target component to guarantee 

accuracy. This requirement was affordable when the designer dealt with a single 

or few processing elements but with MPoCs in modern day real-time embedded system, 

this is no-more a realistic approach. 

2.2 Simulation and Mathematical Model 

Performance techniques based on automation like simulation or mathematical models 

are faster and more accurate than direct estimation. They can easily apply multiprocessor 

characteristic to figure performance evaluation on memory access and parallelism. 

Question is how much degree of target system architecture should be known by the 

designer. 

Lajola et al. [6] used mathematical model with the GNU GCC compiler to generate 

assembler level C code with timing annotations. This can be used for providing very 

accurate and fast estimation. Disadvantage of the model is that regenerating C code in 

target system requires understanding the target architecture or at least the instruction 

sets of the target processor. 

Oyamada et al. [7] comes with a simulation based model, that is based on instruction 

set of target processor but follows a non-linear approach based on neural network. Using 

neural network simulation makes the model more accurate and faster but it makes the 

estimation complex if developer wants to break the code into subparts. 

112



2.3 Prediction Model 

Prediction technique is also used in performance estimation. Suzuki et al. [10] used a 

prediction which a considers set of benchmark execution time and average cycle count to 

determine the performance of the system. The drawback of this model is that it does not 

consider overheads, loops or recursion. Giusto et al. [9] came out with a similar model 

but with a linear approach which can be applied to similar application execution path 

without even estimating. Moreover the entire prediction model does not consider the 

architectural features such as parallelism, pipelining, compiler optimization, etc. Above 

all they lack accuracy when randomly applied across different processors. 

In Summary, 

1. Direct evaluation Model : Cannot be effectively used as most of the target components 

wont be available during performance evaluation design phase. 

2. Simulation Model : Will require knowledge about target architecture for accuracy. 

3. Mathematical Model : Linear and additive in nature but deviations are higher. 

4. Prediction Model : Lack of Accuracy. 

3 PROPOSED METHOLOGY - Marco Lattuada and 

Fabrizio Ferrandi [2] 

Comparison study between all the similar work show that it is necessary to come out 

with a performance estimation model which 

(a) Should consider the possible characteristics of the target processors, but without 

requiring to know architecture itself nor of its Instruction Set hence extensible. 

(b) Should consider target architecture characteristics like compile-time optimizations, 

pipelining, parallelism etc. 

(c) Should be linear so that every component can be analyzed individually. 

(d) Should take into account the dynamic behavior of the application to find correlation 

among source code, input data and performance. 

3.1 Linear Regression Technique 

Linear regression form in mathematical notation is of the form: 

Y = f(X, β, ɛ) (1) 

Where Y is the execution time of the model or subset of the model that depends on 

X which is the dependent source code parameters, β is the input for those parameters 

and ɛ is the error co-efficient. 

113


Expanding the function, it can be written as 

This can be simplified as below 

Y = β0 + β1X1 + β2X2 + βkXk + ... + ɛ (2) 

Executiontime/Cycletime = β0 + � 

βi.xi 

Linear regression technique can be divided into two steps: model building and model 

application. During model building we set bench mark execution time and develop and 

tune the characters which we can call it as training sets as denoted in simulation model. 

This is usually done by running IPROF on the target system or by generating neural 

networks or simulators in MATLAB or similar simulation tools. During the latter step, 

we apply the analyzed factor over another subset of the application and derive at the 

execution time directly. 

3.2 Model Description 

The proposed model basically consists of following major components: 

1. Converts source code in a language independent intermediate representation called 

GIMPLE 

2. Performs the target independent optimizations. 

3. Translates the GIMPLE representation into the RTL (Register Transfer language) 

representation. 

4. Performs the target dependent optimizations . 

5. Converts RTL representation into assembly language. 

Each RTL instruction is composed of a combination of RTL operations: an RTL 

operation is mainly characterized by an operator (e.g., plus, minus), a data type (e.g., 

SI Single Precision Integer), some operands (e.g., registers, results of other RTL operations) 

and annotations. 

For example as illustrated in the figure 2 and figure 3 , RTL instruction is composed 

of a set operation which writes in a register (reg) the result of a PLUS operation executed 

on a register and on a constant integer. 

The RTL sequences based analysis meets the requirements listed in a previous section 

for the following reasons: 

1. RTL representations of the same application are different for different target processor. 

This is because regenerating source code from GNU GCC compiler code 

considers the characteristics of target architecture hence it considers target architecture 

performance characters like compiler optimization, pipelining and memory 

hierarchy. 

114 

i∈F 

(3)



Figure 1: Lattuda and Ferrandi’s Model 

2. The language is target independent: source code can be generated from assembly 

code on any target processor system. 

3. Target-independent optimizations have already been performed because code is generated 

after middle end compilation. 

4. Portion of target application can analyzed independently. 

115


Figure 2: C Code and GIMPLE 

5. Profiling can be done on target machine and can be coupled with the RTL representation. 

3.3 Model Building 

Proposed model consists of three preprocessing steps: Normalization, Main introduction 

and Clustering, that are done before linear regression. 

Normalization is applied for accuracy. Usually estimation techniques consider 

overall execution delay without considering neither magnitude of the input nor the size 

of application. Absolute error or deviation cannot provide accurate information hence 

relative error must be considered. This is achieved through normalization in the proposed 

model where: 

Input : For each RTL sequence class, the fraction of the sequences of the application 

which belong to that class when compared to whole application 

Output : The average number of cycles required by an RTL sequence of that application, 

the range of this new dependent variable is less sensible than the original one. 

These can be easily calculated by dividing the number of sequence occurrence to 

overall count. For example the normalization of operation ashift:SI-plus:SI is 0.09 

which is obtained by dividing its occurrence by overall count of sequences that is 1/11 

which is 0.09. 

Normally simulation does not consider the startup time of the application itself or 

function call overhead. This can be compromised in the model by introducing a fake 

operation called Main introduction. This can be considered as a constant. 

Last comes the clustering where we group similar RTL sequences. In a large application, 

there might be millions of RTL sequences. The number of RTL sequences can 

be minimized by co-relating an equivalence relation among < op : type > classes. This 

relation should describe which operations can be considered performance-equivalent. 

For example plus and minus, less than and greater than, same operation on similar type 

116



Figure 3: RTL Representation and Assemble Language 

of data should posses same execution time. This will reduce the number of training sets 

and hence the model becomes simplified. 

3.4 Model Application 

Once the analysis and model building is done, then the linear formula explained in 

section 4.1 is applied. Basic execution cycle time is calculated first and repeated cycles 

are executed to calculate the deviations 

4 COMPARISON OF EXPERIMENTAL RESULTS 

The RTL methodology proposed exploits linear regression technique when compared to 

the other models in the section 2. 

1. It is more accurate on heterogeneous system than [9] as it converts source code to 

RTL form only and regenerates assembler code irrespective of target architecture. 

117


RTL also makes use of the target architecture compiler optimization features while 

regenerating source code. 

2. The average error deviation obtained by models [10] is 6.03 % which is the lowest 

when compared to other model but it can be only applied for simpler applications 

without loops, recursion etc. 

3. Most linear model described in section 2 exhibits an error ranging 0.06% to 19.3 

% and non linear model ranges 0.03% to 20.5% . The deviation is minimal if the 

architecture is known and input data is unknown. But in RTL linear model error does 

not depend on architecture and show 8.6 % deviation in a worst case scenario. 

4. Lajolos model [6] exhibits the least deviation with less than 4 % but the system requires 

architectural knowledge to regenerate the code and cycle iteration is minimal. 

5. Oyamada et al. [7] successfully created a similar model that produced almost same 

result around 10.8% in worst case scenario. The model works perfectly in heterogeneous 

systems but it largely uses neural network to train the sets and hence this 

model is non-linear and simple to extend when compared to the RTL model that uses 

clustering. 

6. All models based on assembly level code show better result than RTL and are more 

accurate but they require the developer to know the instruction set of the target processor. 

5 PROPOSED FUTURE WORK 

Lattuadas work which wat was reviewed above considers certain features like linear 

regression technique with early evaluation during design phase. But it does not consider 

evaluation of modern day embedded systems which may result in complex and millions 

of sequences if we go with RTL sequence model. This will take ages if we do not have 

AI neural network to create training sets. Hence it will start tilting towards non-linear 

approach for complex systems. 

Major drawbacks of Lattuadas Model are 

1. Does not consider the length of sequences created by RTL. 

2. Clustering becomes complex for large applications. 

3. No automated clustering. 

C/C++ based models [8] can be executed to simulate the complete behavior of a 

system, and obtain some performance information. Just like testing, these approaches 

can give good confidence in the correctness of the system, but no formal guarantees on 

the upper limits of performance. Abstract interpretation models can be used to verify 

formally and automatically the properties like the system never takes more than X units 

of time to process an event. These analyses provide formal guarantees but analysis can 

118



Figure 4: Proposed Model 

take huge amount of time and memory. The approach should be to opt for a model that 

can analyze the critical components in detail using modular approach [11] [3] and less 

critical components using abstract translation technique but at the same time easy to 

119


create training sets. Above model can be extended and represented as in figure 4.These 

are the ideal characteristic steps that are needed for a fast and portable performance 

analsysis which needs zero architectural knowledge of the target systems. 

1. Convert Source code into machine independent virtual code. 

2. Cluster the operations using neural network. 

3. Regenerate code using target architecture. 

4. Execute performance estimation cycle using trained neural network. 

5. Apply the deviation co-efficient using dynamic programming. 

6. Apply backtracking algorithm to decide which execution path must be applied while 

estimating real-time applications. 

6 CONCLUSION 

Early performance estimation is the way to go due to the complexity and heterogeneity 

of the current and future embedded systems. Todays market scenario require comparing 

multiple architecture during design time hence fast and accurate performance estimation 

tools are needed to help the design architecture exploration. This proposed future 

work is an integrated methodology for faster estimation without architectural knowledge 

supported by neural networks. The estimator provides flexibility and precision even for 

complex processors, with pipeline and cache memories. The estimator is fast compared 

to other linear models and better than non-linear models in worst case scenario. 

References 

[1] F. Salice C. Brandolese, W. Fornaciari and D. Sciuto. Source-level execution time 

estimation of c programs. pages 98103, 2001. 

[2] Marco Lattuada Fabrizio Ferrandi Politecnico di Milano. Performance modeling 

of embedded applications with zero architectural knowledge. pages 277286, New 

York, NY, USA, 2010. ACM, 2010. 

[3] C. Pilato F. Ferrandi, M. Lattuada and A. Tumeo. Performance estimation for task 

graphs combining sequential path profiling and control dependence regions. In 

MEMOCODE09: Proceedings of the 7th IEEE/ACM international conference on 

Formal Methods and Models for Codesign, pages 131140, 2009. 

[4] W. Fornaciari F. Salice D. Sciuto G. Beltrame, C. Brandolese and V. Trianni. Modeling 

assembly instruction timing in superscalar architectures. In ISSS 02: 15th 

international, 2002. 

120



[5] M Gries. Methods for evaluating and covering the design space during early design 

development. Tech.Rep.UCB/ERL M03/32, Electronics Research Lab, University 

of California at Berkeley, 2003. 

[6] M. Lazarescu M. Lajolo and A. Sangiovanni-Vincentelli. A compilation-based 

software estimation scheme for hardware/software co-simulation. In CODES 99: 

the Seventh International Workshop on Hardware/Software Codesign, 1999, pages 

8589, 1999. 

[7] F. Zschornack M. S. Oyamada and F. R. Wagner. Applying neural networks to 

performance estimation of embedded software. J. Syst.Archit., 54(1-2):224240,, 

2008. 

[8] Sangkwon Na Moo-Kyoung Chung and Chong-Min Kyung. ystem-level performance 

analysis of embedded system using behavioral c/c++ model. IEEE INSPEC 

Accession Number: 8540449 , no. 14, pp.188 - 191, 2005. 

[9] G. Martin P. Giusto and E. Harcourt. Reliable estimation of execution time of 

embedded software. In DATE 01: Conference on Design, Automation and Test in 

Europe, pages 580589, 2001. 

[10] K. Suzuki and A. Sangiovanni-Vincentelli. Efficient software performance estimation 

methods for hardware/software codesign. n DAC 96: 33rd Design Automation 

Conference, pages 605610, 1996. 

[11] Marcel Verhoef Paul Lieverse Wandeler, Lothar Thiele. System architecture evaluation 

using modular performance analysis. Case Study, 2006. 

[12] S. Abdi Y. Hwang and D. Gajski. Cycle-approximate retargetable performance 

estimation at the transaction level. n DATE 08: Conference on Design, Automation 

and Test in Europe, pages 38,, 2008. 

121

122

Improving Application Launch Times 

Gavin Vaz 


gavinvaz@mail.uni-paderborn.de 

December, 2 2011 

Abstract 

Application launch times are very noticable to the user. The user has to wait 

for the entire application to load, and only then can he interact with it. If this wait 

is too long, it affects the users satisfaction. The primary cause of slow application 

launch times is attributed to hard disk latencies. This paper looks at how application 

launch times can be reduced by predicting when an application might be launched, 

and preloading it into main memory in order to reduce disk latencies. It also looks 

at how hybrid hard disks could be used to reduce application launch times by around 

24%. The paper also takes a look at optimization techniques that are able to reduce 

application launch times of already fast solid state drives by 28%. 


Application launch times are one of the most evident performance parameters from the 

users perspective. Waiting for an application to load might not be a very ardent experience, 

thereby reducing user satisfaction. Over the past decade, the computational power 

of processors and the speed of main memory have been steadily improving. However, the 

size of applications have also been growing rapidly. Resulting in slow application launch 

times in spite of having faster processors and memory. 

Youngjin Joo et al. [9] performed a study on application launch times in order to 

determine how much time was used by the CPU, the memory, the hard disk drive(HDD), 

and for data transfer during an application launch. Their study (See Fig. 1) showed that 

the CPU and memory constituted for merely 20 to 30 percent of the application launch 

time, the remaining time was accounted for by the HDD, with disk rotational latency and 

seek times accounting for nearly half of the total application launch time. 

HDDs are block devices. A block being the smallest addressable unit of a HDD and 

addressed by using its logical block address (LBA). In order to read a block, the HDD 

controller must first move the head into position over the appropriate cylinder; the time 

taken to do this is known as the seek time of the disk. The desired disk block might not be 

123

Gavin Vaz 

Figure 1: Breakdown of an applications launch time [9]. 

below the head, and so the HDD controller must wait for the disk to rotate until the desired 

block is under the head; this time is known as the rotational latency of the disk. Thus, 

seek time and rotational latency together constitute the disk latency and are the outcome 

of mechanical limitations of a HDD. 

An application(file) is made up of many such blocks, which might not be contagious, 

and in reality might be distributed across many cylinders of the HDD. In addition to 

this, most of the applications nowadays use shared libraries which also need to be loaded 

from the disk when the application is launched. And so when an application is launched, 

hundreds of blocks are requested from the HDD resulting in a lot of time being wasted 

just because of seek and rotational latencies. Hence, seek and rotational latencies are the 

primary and most important cause of slow application launches. 

The problem of disk latencies have been addressed by many techniques, one of them 

being disk caches. Disk caches are effective only when the same data is requested(accessed) 

repeatedly, thereby eliminating seek and rotational latencies by reading the data directly 

from the cache instead of from the disk. However, in the case of an application launch, 

unless the application has been previously launched, the data that the application requests 

for will not be present in the disk cache, proving it to be ineffective in reducing application 

launch times. 

Nowadays, some computer manufactures provide application developers with some 

“Launch Time Performance Guidelines” [8] that need to be followed in order to improve 

application performance at launch time. This involves delaying the loading and initialization 

of subsystems that are not required immediately, until later. This helps to speed up 

the launch time considerably. However, this approach cannot help reduce the latency of 

the code that is absolutely necessary to launch the application. 

On the other hand, some applications load a part of their application code into main 

memory when the operating system boots up. This is done so that the application appears 

to load faster when the user launches it. In addition to wasting precious main memory, 

this scheme gives the user the perception that the operating system takes a long time to 

load and doesn’t really reduce the overall application launch time. 

Another approach that operating systems commonly employ is to optimize the HDD 

by reducing file fragmentation on the disk. This is done by periodically defragmenting 

the HDD. This results in lower seek and rotational latencies, meaning that applications 

are able to load faster. 

124


Microsoft claims that “Windows ReadyBoot” [4] helps decrease the time required to 

boot the system by preloading the files required during the booting phase. ReadyBoot 

saves a file trace of the files used when the system boots up. It then uses idle CPU time to 

analyze file traces from five previous boots making a note of the accessed files along with 

their location on disk. During subsequent system boots, ReadyBoot prefetches these files 

into an in-RAM cache, saving the boot process the time required to retrieve the files from 

the disk. 

This paper looks at different approaches that have been employed to tackle the problem 

of slow application launch times. Section 2 looks at how adaptive prefetching can be used 

to predict what applications a user might run in the near future and fetch them into main 

memory in order to achieve faster application launch times. Section 3 looks at how hybrid 

hard disks(H-HDD) could be used to improve application launch times. Section 4 looks 

at how the performance of a solid state drives(SSD) could be further improved to reduce 

application launch times. Section 5 compares the pricing schemes of HDDs, H-HDDs 

and SSDs. We conclude the paper in Section 6. 

2 Adaptive Prefetching 

Prefetching is a well know concept and has been used to prefetch instructions for processors, 

prefetch data from main memory into the processor cache [15] and prefetching links 

on webpages [7] to name a few. This section takes a closer look at Preload [6] which is an 

adaptive prefetcher that is capable of predicting when an application might be launched 

by the user and preloads it into main memory. This helps reduce HDD latencies and hence 

reduce application launch times. 

2.1 Preload 

Preload consists of the following two components: 

1. Data gathering and model training 

2. Predictor 

These components are fairly isolated and are connected together using a shared probabilistic 

model. Data is gathered by monitoring the user actions and is used to train the 

model. The predictor uses this model to predict which application will be launched and 

then prefetches that application. 

Typical GUI applications have larger binaries, larger working sets, longer running 

times and are inherently more complex when compared to other Unix programs. The 

goal of preload is to achieve faster “application” start-up times. In order to do this, it 

needs to distinguish between an “application” and another program. Preload ignores any 

processes that are very short-lived, or who’s address space are smaller than a specified 

size. By ignoring these processes, Preload is able to keep the size of the model down. 

125

Gavin Vaz 

The processes running on the system are filtered according to the above criteria to obtain 

a list of running applications. Information of all these running applications are collected 

periodically by the data gathering component. The period of this cycle is a configurable 

parameter and is set to twenty seconds if not explicitly specified. Finally, a list of the 

applications memory maps are fetched for each application and are used to update the 

model [6]. 

The predictor, like the data gathering component is also invoked periodically. It uses 

the trained model along with the list of presently running applications to predict which applications 

should be prefetched. For every application that is not running, the probability 

of it starting in the next cycle is computed. The predictor then uses these per-application 

probabilities to assign probabilities to their maps. It then sorts the maps based on their 

probabilities, and proceeds with prefetching the top ones into main memory. In order to 

minimize the system load due to prefetching, system load and memory statistics are used 

to decide how much prefetching is performed in each cycle [6]. 

2.2 Implementation Overhead 

Preload runs as a daemon process on the system and has a modest memory footprint [6]. 

The model which resides in main memory consumes less than 3MB of memory for around 

hundred applications. The process is asleep for most of the time waking up periodically 

or whenever the processor is idle. This ensures that it does not affect the performance of 

other applications running on the system. Once launched, Preload takes a few cycles to 

settle down into a steady state. After this, it stops making any new I/O requests and hence 

does not interfere with power-saving schemes used in most modern systems. 

2.3 Performance Evaluation 

In order to evaluate its performance, the application launch times obtained with Preload 

were compared to those obtained when the page cache was cleared (cold-cache) and to 

those when the application was already present in the page cache (warm-cache). The 

cold-cache scheme represents an application launch when a user has not launched the 

application before and so there are no application related entries in the page cache. On 

the other hand, the warm-cache scheme represents an application launch when a user has 

previously launched an application. Table 1 shows the time taken for various applications 

to launch for the three different scenarios. It is apparent from the results, that Preload is 

able to reduce application launch times when compared to the cold application launch. 

The average reduction in application launch time by using Preload is around 44%. It can 

also be seen that Preload is more effective at reducing application launch times for large 

applications, making it a good solution for improving application launch times. 

126


Application Cold Warm Preload Gain Size 

OpenOffice.org Writer 15s 2s 7s 53% 90 MB 

Firefox Web Browser 11s 2s 5s 55% 38 MB 

Evolution Mailer 9s 1s 4s 55% 85 MB 

Gedit Text Editor 6s 0.1s 4s 33% 52 MB 

Gnome Terminal 4s 0.4s 3s 25% 27 MB 

Table 1: Application start-up time with cold and warm caches, and with preload [6]. 

3 Hybrid Disks 

Figure 2: Hybrid disk logical hierarchy [9]. 

A hybrid disk(H-HDD) is a traditional HDD combined with embedded flash memory. 

The embedded flash memory could be arranged as a new level of hierarchy between the 

main memory and the disk (see Fig. 2(a)) or at the same level of hierarchy as the disk (see 

Fig. 2(b)). 

Flash memory when used in a configuration represented by Figure 2(a), can be used 

as a second level disk cache [11]. Introducing nonvolatile flash memory at this level helps 

retain the contents of the flash disk cache even after the system is rebooted. However, this 

scheme produces a low hit ratio unless the flash cache is very large [9]. Flash memory 

in this configuration can also be used as Write-Only Disk Caches(WODC) [14]. The 

WODC holds blocks of data that are to be written to the disk. This data can then be 

transferred to the HDD asynchronously with virtually no latency, resulting in improved 

HDD performance. However, application launches have very little write traffic making 

WODC ineffective in this scenario. 

When flash memory is used at the same hierarchy level as that of the disk (see Fig. 

2(b)), a small portion of it can be used to pin data. This is referred to as “OEM-pinned 

data”. Table 2 shows the different cache allocation depending on the different sizes of 

127

Gavin Vaz 

Flash size 128 MB 256 MB 

H-HDD firmware 10 MB 10 MB 

Write cache 32 MB 32 MB 

OEM-pinned data 15 MB 79 MB 

SuperFetchTM pinned data 71 MB 135 MB 

Table 2: Manufacturer recommendation for the flash memory partition in the H-HDD [9]. 

Linux Ubuntu 8.04 Windows Vista Ultimate 

Evolution 2.22.1 16.9 MB Excel 2007 15.0 MB 

Firefox 3.0b5 27.1 MB Labview 8.5.1 45.0 MB 

F-Spot 0.4.2 27.4 MB Outlook 2007 16.7 MB 

Gimp 2.4.5 15.6 MB Photoshop CS2 62.4 MB 

Rhythmbox 0.11.5 17.9 MB Powerpoint 2007 14.7 MB 

Totem 2.22.1 10.7 MB Word 2007 27.3 MB 

Table 3: Code block size required for application launch [9]. 

flash memory. The OEM-pinned data cache can be used to pin application data in order to 

improve application launch times. However, due to the size limitation of the OEM-pinned 

data cache, it is not possible to pin all the data required to launch an application (see Table 

3). This section looks at a method proposed by Youngjin Joo et al. [9] that improves the 

application launch time by pinning only a small subset of the application data. The idea 

here is to select an optimal pinned-set for an application given the size limitation of the 

OEM-pinned data cache, so that the seek time and rotational latency of the HDD can be 

minimized. 

3.1 Pinned-set Selection 

The following steps need to be performed in order to obtain the pinned-set of an application. 

1. Determine the application launch sequence from the raw block requests 

2. Derive an access cost model of H-HDDs 

3. Formulate pinned-set optimization as an ILP problem 

Figure 3 shows the framework of the method used to determine this pinned-set selection. 

The first step is to extract the application launch sequence for a given application. 

Software based disk I/O profiling tools like Blktrace [3] (Linux) and TraceView[12] 

(Windows) are able capture raw block requests during the application launch. However, 

on a typical computer system, there might be other processes running as well. These 

processes might also request blocks from the disk. These rogue block requests have no 

128


Figure 3: Framework of the proposed method of pinned-set selection [9]. 

connection to the application launch but are captured by the profiling tools. The application 

launch sequence extractor is used to clean up the application launch sequence by 

eliminating these rogue block requests. After a sufficient number of raw block request 

sequences have been obtained from the disk I/O profiling tool, the application launch sequence 

extractor performs the following steps to target and eliminate such rogue block 

requests. 

1. Any block requests that access read-write blocks are removed, as application code 

blocks are only considered for pinning. 

2. Block requests which do not occur in all the raw block request sequences are removed. 

3. Block requests which do not occur in the same position in all the raw block request 

sequences are removed. 

Once the clean application launch sequence has been obtained, the access cost matrix 

can now be built using the application launch sequence along with the H-HDD performance 

specification. Youngjin Joo et al. also proposed an ILP formulation [9] for a 

given application launch sequence and access cost matrix. 


Generating a clean application launch sequence can take up to 0.6 seconds. While, computing 

the access cost matrix takes up to 1.5 seconds. However, the time taken to solve 

the ILP problem dominates the computation time. The time required to solve the ILP 

129

Gavin Vaz 

Figure 4: Computation times required to solve the ILP problem (pinned-set size: 10% of 

the application launch sequence size) [9]. 

is proportional to the size of the application launch sequence; i.e. the larger the application 

launch sequence is, the more time it will take to solve the ILP. Figure 4 shows 

how the computation time increases with the increase in size of the application launch 

sequence. This however, seems to be an acceptable tradeoff as the computation does not 

have to be repeated once the pinned-set has been obtained. However, over the course of 

time, the application data may change or the blocks of an application might be relocated 

during disk optimization. Thus making the current pinned-set ineffective and forcing a 

re-computation of the pinned-set. 

The time taken to compute the ILP solution can in fact be reduced. However, this 

is obtained by compromising the quality of the solution. For example, a solution within 

0.01% of the theoretical bound can be obtained in 65 seconds, but this can be reduced to 

26 seconds by accepting an error of 0.2% [9]. 


In order to evaluate the performance of their proposed pinning method, Youngjin Joo et 

al. compared it with the following two pinning approaches [9]. 

3.3.1 First-Come First-Pinned 

The first-come first-pinned (FCFP) policy pins the blocks in the order in which they appear 

in the application launch sequence. The blocks are pinned until they fill the pinnedset 

partition of the flash memory. Now, when an application is launched, all the starting 

block requests are serviced by the flash memory; eliminating disk seek times and rotational 

latencies during this phase. Due to which, there is a reduction in the total H-HDD 

access time. This reduction is proportional to the size of the pinned data set. 

3.3.2 Small-Chunks-First 

Disk seek time and rotational latencies are independent of the block size; i.e. it does 

not matter if the the block requested is large or small, we are still going to see nearly 

130


Figure 5: Values of thhd for various sizes of pinned-set. The x-axes are normalized to the 

size of the application launch sequence [9]. 

the same delays caused because of disk latencies. The small-chunks-first (SCF) policy 

fills in the pinned-set partition of the flash memory by pinning the smallest blocks first. 

Thereby maximizing the number of blocks stored in flash memory. This in turn, reduces 

the number of block requests that are sent to the disk and hence avoids delays caused by 

disk seek time and rotational latencies. 

In order to evaluate these approaches, ten raw block request sequences for each benchmark 

application were captured and used as an input to the application launch sequence 

extractor. The resulting clean application launch sequence was then used to calculate the 

access cost matrix for each application. This was then used with the ILP solver to obtain 

the pinned-set for different sizes of flash memory. 

Figure 5 shows the H-HDD access time (thdd) for various pinned-set sizes for Evolution, 

Firefox, Photoshop and Powerpoint. The shaded area represents the region where 

Youngjin Joo et al. think that it would be beneficial to increase the pinned-set size while 

using their proposed method. The optimal pinned-set size for applications running on 

Microsoft Windows is around 30% of the application launch sequence and that for application 

running on Linux is around 20%. This suggests that relatively small pinned-sets 

are effective with their proposed method. 

Table 4 shows the results of the experiment when 10% of the application data was 

pinned to the flash memory. It also shows the improvement in the application launch 

time (tlaunch) and the H-HDD access time (thdd) for the different pinning approaches. 

The proposed method is able to reduce the H-HDD access time by 34% if 10% of the 

application data was pinned. This improvement in H-HDD performance, saw a reduction 

of 24% in the average application launch time [9]. 

4 Solid State Drives 

A solid state drive (SSD) is made up of a number of NAND flash memory modules which 

have no mechanical parts. Thereby, eliminating the disk seek time and rotational latencies 

that are observed in traditional HDDs. A reasonable solution for improving application 

131

Gavin Vaz 

Application 

No pinning (sec) FCFP SCF Proposed 

thdd tlaunch thdd tlaunch thdd tlaunch thdd tlaunch 

Evolution 5.70 7.26 93.1% 94.6% 77.7% 82.5% 59.4% 68.1% 

Firefox 6.82 8.23 89.8% 91.6% 65.3% 71.3% 53.8% 61.7% 

Photoshop 17.36 30.78 89.7% 94.2% 78.1% 87.7% 71.6% 84.0% 

Powerpoint 7.25 12.95 95.3% 97.4% 84.9% 91.6% 80.1% 88.8% 

Table 4: thhd and tlaunch for a pinned-set of 10% of the application launch sequence size 

[9]. 

launch times would be to replace a traditional HDD with a SSD. But with growing application 

sizes, it is only a matter of time before SSDs will eventually appear to be slow. 

This section looks at how application launch times can be further improved on SSDs by 

using the Fast Application STarter (FAST) application prefetching method proposed by 

Youngjin Joo et al. [10]. 

Many of the optimization techniques used with traditional HDDs cannot be used with 

SSDs. For example, defragmenting a SSD to improve its performance doesn’t make any 

sense, as the physical location of data does not affect access latencies. Employing such a 

technique would only shorten the life of the SSD. In fact when a modern operating system 

detects a SSD, it disables the optimization techniques used for traditional HDDs. For 

example, when Windows 7 detects that a SSD is being used, it disables disk defragmentation, 

Superfetch, and Readyboost [13]. 

4.1 FAST 

Figure 6(a) shows how a typical application launch is handled. Here si is the i-th block 

request generated during the launch and n being the total number of blocks requested. 

After a block is fetched, the the CPU can proceed with the launch process (ci) until another 

page miss occurs. This cycle is repeated until the application is launched. 

Let the time spent for si and ci be denoted by t(si) and t(ci), respectively. Then, the 

computation (CPU) time, tcpu , is expressed as 

tcpu = 

n� 

t(ci) (1) 

i=1 

and the SSD access (I/O) time, tssd , is expressed as 

tssd = 

n� 

t(si) (2) 

i=1 

Then, the application launch time can be expressed as 

tlaunch = tssd + tcpu 

132 

(3)


Figure 6: Various application launch scenarios (n = 4) [10]. 

The main idea of FAST is to overlap the I/O with the CPU, so as to minimize tssd. 

This is obtained by running the application prefetcher concurrently with the application. 

The application prefetcher then fetches the application launch sequence (s1, ..., sn) while 

the application is being launched (tcpu). 

One possible scenario for FAST would be when the the computation time is larger than 

the SSD access time (tcpu > tssd) . This is illustrated in Figure 6(b). At time t = 0, the 

application and the prefetcher are started simultaneously. They compete with one another 

to access the SSD. However, since both are requesting for the same block s1, it does not 

matter as to who gets the bus grant first. After s1 has been fetched, the application can 

start with the launch (c1) while the prefetcher continues to fetch the subsequent blocks. 

When it is time for the application to request for the next block, it is already present in 

memory and so there is no page miss. And hence, the resulting application launch time 

(tlaunch) becomes 

tlaunch = t(s1) + tcpu 

Another possible scenario for FAST would be when the computation time is smaller 

than the SSD access time (tcpu < tssd) . This is illustrated in Figure 6(c). Here, the 

prefetcher is not able to fetch the entire s2 block before the application requests for it. 

However, this is still faster as compared to the scenario in Figure 6(a). This improvement 

is accumulated over the remaining block requests, resulting in a tlaunch: 

(4) 

tlaunch = tssd + t(cn) (5) 

However, n ranges up to a few thousands for typical applications, and thus, t(s1) ≪ 

tcpu and t(cn) ≪ tssd [10]. Consequently, Eqs. (4) and (5) can be combined into a single 

equation as: 

tlaunch ≈ max(tssd, tcpu) (6) 

133

Gavin Vaz 

4.2 Implementation 

Figure 7: The proposed application prefetching [10]. 

The processes of FAST can be divided into two broad categories depending on whether 

they run during the application launch time or as an idle process. Figure 7 shows the 

different components of FAST and how they interact with one another. 

Blktrace [3], a disk I/O profiler is used to record the raw block request sequence that 

is requested during the application launch. The device number, LBA, I/O size and the 

completion time are also recorded. However, the operating system or some other process 

might also be accessing the disk during the application launch. And so the raw block 

request sequence captured by Blktrace varies from one launch to another. The application 

launch sequence extractor cleans up the raw block request sequence by collecting two or 

more raw block request sequences and then extracting a common sequence. This sequence 

is known as the application launch sequence. 

A block can be represented as a file and an offset within that file. The application 

prefetcher can request for a specific block(LBA) by issuing a system call and providing 

it with the file name and offset. However, finding the file name and offset from a given 

LBA is not supported by most file systems. In order to find this mapping, a system-call 

profiler (strace) is used to obtain a complete list of files that were accessed during the 

application launch. The LBA-to-inode reverse mapper is then used to create a LBA-toinode 

map from these files. The LBA-to-inode reverse mapper uses a red-black tree in 

order to reduce the search time of the LBA-to-inode map. 

The application prefetcher is a user-level program that replays the disk access requests 

made by a target application [10]. The application prefetcher generator automatically 

creates an application prefetcher for each target application. It performs the 

following operations. 

1. Read si one-by-one from the application launch sequence of the target application. 

2. Convert si into its associated data items stored in the LBA-to-inode map. 

134


Running processes Runtime (sec) 

1. Application only (cold start scenario) 0.86 

2. strace + blktrace + application 1.21 

3. blktrace + application 0.88 

4. Prefetcher generation 5.01 

5. Prefetcher + application 0.56 

6. Prefetcher + blktrace + application 0.59 

7. Miss ratio calculation 0.90 

Table 5: Runtime overhead (application: Firefox) [10]. 

3. Depending on the type of block, generate an appropriate system call using the converted 

disk access information. 

4. Repeat Steps 1–3 until processing all si. 

Once the application prefetcher for an application is created, it is invoked by the application 

launch manager whenever the application is launched. 


Table 5 shows the runtime overhead of FAST for Firefox. Case 2 is run only once. Case 3 

runs for the number of raw block request sequences that were captured. However, Cases 

2 and 3 are run only when no application prefetcher is found for that application. The 

application prefetcher is generated in Case 4 and has the highest runtime. This however, 

can be hidden from the user by running it in the background. Cases 5-7 are a part of 

the application prefetcher and are repeated until the application prefetcher is invalidated. 

Case 7 can also be run in the background effectively hiding it from the user. 

FAST also creates some temporary files, but they can be deleted once the application 

prefetcher has been created. However, the actual application prefetcher and the application 

launch sequences occupies disk space. In the experiments performed by Youngjin 

Joo et al., the total size of the application prefetchers and application launch sequences 

for all 22 applications was 7.2 MB [10]. 


In order to evaluate the performance of FAST, Youngjin Joo et al. compared it with the 

following scenarios [10]. 

• Cold start: The application is launched immediately after flushing the page cache. 

The resulting launch time is denoted by tcold. 

• Warm start: At first only the application prefetcher is run. Is is done so that all the 

application launch sequence blocks are loaded into the page cache. The application 

is then immediately run after this. The resulting launch time is denoted by twarm. 

135

Gavin Vaz 

Figure 8: Measured application launch time (normalized to tcold) [10]. 

• Sorted prefetch: The application prefetcher was modified to fetch the block requests 

in the application launch sequence in the sorted order of their LBAs. After flushing 

the page cache, the modified application prefetcher was run, after which the 

application was immediately launched and the resulting launch time is denoted by 

tsorted 

• FAST: The application was simultaneously run along with the application prefetcher 

after flushing the page cache. The resulting launch time is denoted by tF AST . 

• Prefetcher only: The application prefetcher is run after the page cache is flushed. 

The completion time of the application prefetcher is denoted by tssd. And is used 

to calculate a lower bound of the application launch time tbound = max(tssd, tcpu), 

where tcpu = twarm is assumed. 

Launch times were recorded for all the above scenarios. Figure 8 shows the results 

that have been normalized to tcold. FAST saw an average reduction of 28% in the launch 

time as compared to the cold start scenario, while a HDD-aware application launcher 

only showed a 7% reduction. FAST was able to achieve this with no additional overhead, 

demonstrating the need for, and the utility of, a new SSD-aware optimizer [10]. 

5 HDDs, H-HDDs & SSDs 

When HDDs made their first appearance, they were expensive. However, with the advancements 

in technology and their ever growing demand they have now become affordable 

with costs as low as $0.16 per GB. SSDs today are all about performance with 

sequential read speeds of up to 270 megabytes per second (MB/s). However, they are 

relatively expensive with a average cost of $2.15 per GB, nearly thirteen times more expensive 

than traditional HDDs. This improved performance does in fact come at a high 

price. With time, SSDs might follow the trend seen in HDDs and eventually become affordable; 

but for the time being, do we have something that could match the performance 

of a SDD and the price of a HDD? The answer is yes, H-HDDs are able to bridge this 

gap by embedding flash memory into a traditional HDD. They perform nearly three times 

better than traditional HDDs [1] and with a cost of $0.33 per GB, are nearly 1/6th the cost 

136


Capacity 

HDD - Seagate Momentus 

Price 

H-HDD - Seagate Momentus XT SSD - Intel 320 Series 

750 GB $120 $245 (8 GB flash) - 

600 GB - - $1260 

500 GB $80 $150 (4 GB flash) - 

320 GB $130 $125 (4 GB flash) - 

300 GB - - $630 

250 GB $90 $140 (4 GB flash) - 

160 GB $160 - $340 

120 GB $80 - $260 

80 GB $65 - $200 

40 GB $45 - $110 

Table 6: Prices for 2.5" drives [2, 5]. 

Approach 

HDD 

Device 

H-HDD SSD Smartphone 

Preload � ✗ ✗ ✗ 

OEM-pinned data ✗ � ✗ ✗ 

FAST ✗ ✗ � � 

Table 7: Approaches and supported devices 

of SSDs. Table 6 compares the prices of HDDs, H-HDDs and SSDs of various capacities. 

From the looks of it, H-HDDs give you plenty of bang for the buck and are here to stay. 

6 Conclusion 

This paper looked at three approaches that could be used to improve application launch 

times. Table 7 shows the various approaches and the devices that they could be used 

with. Preload makes use of a prefetching approach to improve application launch times. 

It tries to predict when an application might be launched and then preloads it into main 

memory. Hence, when an application is eventually launched, the application launch data 

is already present in main memory, resulting in a faster application launch. This paper 

also looked at how H-HDDs could be used to improve application launch times. This 

approach looked at how the OEM-pinned data cache of a H-HDD could be effectively 

used to reduce the average application launch time. Using this approach, the average 

application launch time could be reduced by 24%, by pinning only 10% of the application 

launch sequence. Finally, the paper looked at FAST, an optimization technique that can 

be applied to already fast SSDs. Using FAST, the application launch times on SSDs could 

be reduced by 28%. FAST has excellent portability [10] and it would be interesting to see 

how it could be used with state-of-the-art devices like smartphones or tablets. 

137

Gavin Vaz 

References 

[1] http://www.seagate.com/www/en-us/products/laptops/ 

laptop-hdd/. [Online; accessed 30-November-2011]. 

[2] http://www.amazon.com. [Online; accessed 30-November-2011]. 

[3] Jens Axboe. Block io tracing. https://git.kernel.org/?p=linux/ 

kernel/git/axboe/blktrace.git;a=blob;f=README, September 

2006. [Online; accessed 26-November-2011]. 

[4] Microsoft Corporation. Windows pc accelerators. http://www.microsoft. 

com/whdc/system/sysperf/perfaccel.mspx, October 2010. [Online; 

accessed 25-November-2011]. 

[5] Nathan Edwards. Seagate momentus xt 750gb review. http://www. 

maximumpc.com/article/reviews/seagate_momentus_xt_ 

750gb_review, November 2011. [Online; accessed 30-November-2011]. 

[6] Behdad Esfahbod. Preload - an adaptive prefetching daemon. Master’s thesis, University 

of Toronto, 2006. 

[7] Darin Fisher and Gagan Saksena. Link prefetching in mozilla: A server-driven 

approach. In Fred Douglis and Brian Davison, editors, Web Content Caching and 

Distribution, pages 283–291. Springer Netherlands, 2004. 

[8] Apple Computer Inc. Launch time performance guidelines. https: 

//developer.apple.com/library/mac/#documentation/ 

Performance/Conceptual/LaunchTime/LaunchTime.html, April 


[9] Yongsoo Joo, Youngjin Cho, Kyungsoo Lee, and Naehyuck Chang. Improving application 

launch times with hybrid disks. In Proceedings of the 7th IEEE/ACM 

international conference on Hardware/software codesign and system synthesis, 

CODES+ISSS ’09, pages 373–382, New York, NY, USA, 2009. ACM. 

[10] Yongsoo Joo, Junhee Ryu, Sangsoo Park, and Kang G. Shin. Fast: quick application 

launch on solid-state drives. In Proceedings of the 9th USENIX conference on 

File and stroage technologies, FAST’11, pages 19–19, Berkeley, CA, USA, 2011. 

USENIX Association. 

[11] B. Marsh, F. Douglis, and P. Krishnan. Flash memory file caching for mobile computers. 

In System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International 

Conference on, volume 1, pages 451 –460, jan. 1994. 

138


[12] Microsoft. Windows driver kit. http://msdn.microsoft.com/en-us/ 

library/ff553872.aspx, September 2011. [Online; accessed 26-November- 

2011]. 

[13] Steven Sinofsky. Support and q & a for solid-state drives. 

https://blogs.msdn.com/b/e7/archive/2009/05/05/ 

support-and-q-a-for-solid-state-drives-and.aspx, May 


[14] Jon A. Solworth and Cyril U. Orji. Write-only disk caches. In Proceedings of the 

1990 ACM SIGMOD international conference on Management of data, SIGMOD 


[15] Steven P. Vanderwiel and David J. Lilja. Data prefetch mechanisms. ACM Comput. 

Surv., 32:174–199, June 2000. 

139

Proceedings of the Seminar Hardware/Software Codesign

Create successful ePaper yourself

Delete template?

Save as template?